Modern Methods of Drug Discovery (PDFDrive)
Modern Methods of Drug Discovery (PDFDrive)
Modern Methods of
Drug Discovery
Edited by A. Hillisch and R. Hilgenfeld
Springer Basel AG
Editors
The publisher and editor can give no guarantee for the information on dosage and administration contained
in thls publication. The respective user must check its accuracy by consulting other sources of reference
in each individual case.
The use of registered names, trademarks etc. in this publication, even if not identified as such, does not
imply that they are exempt from the relevant protective laws and regulations or free for general use.
This work is subject to copyright. AlI rights are reserved, whether the whole or part of the material is con-
cerned, specificalIy the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in other ways, and storage in data banks. For any kind of use, permission
of the copyright owner must be obtained.
ISBN 3-7643-608l-X
98765432 http://www.birkhauser.ch
v
Contents
Preface .................................................. IX
2 Proteomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Martin J. Page, Chris Moyses, Mary J. Cunningham, Gordon Holt
and Alastair Matheson
3 Bioinformatics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
David B. Jackson, Eric Minch and Robin E. Munro
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
vn
List of contributors
Adel Nefzi, Torrey Pines Institute for Molecular Studies, 3550 General Atomics
Court, San Diego, CA 92121, USA
John M. Ostresh, Torrey Pines Institute for Molecular Studies, 3550 General
Atomics Court, San Diego, CA 92121, USA
Martin J. Page, OSI Pharmaceuticals, Cancer Biology, Watlington Road, Oxford
OX4 6LT, UK; e-mail: mpage@osip.com
Douglas C. Rohrer, Pharmacia, Discovery Research, 7000 Portage Road,
Kalamazoo, MI 49008, USA
Isabel Sattler, Hans-Knoll-Institut fur Naturstoff-Forschung e.Y., Beutenberg-
straBe IIa, D-07745 Jena, Germany
Simon F. Semus, Wyeth-Ayerst Research, Department of Biological Chemistry,
CN8000, Princeton, NJ 08543, USA; present address: GlaxoSmithKline, 709
Swedeland Road, King of Prussia, PA 19406, USA; e-mail:
Simon.F.Semus@gsk.com
Ralf Thiericke, CyBio Screening GmbH, Winzerlaer Strasse 2a, D-07745 Jena,
Germany; e-mail: ralf.thiericke@cybio-ag.com
Han van de Waterbeemd, Global Research and Development, PDM, Sandwich,
Kent CT13 9NJ, UK; e-mail: han_waterbeemd@sandwich.pfizer.com
John H. van Drie, Vertex Pharmaceuticals, 130 Wavely St, Cambridge, MA
02139; e-mail: vandrie@mindspring.com
Martin Vogtherr, Institut fUr Organische Chemie, Johann-Wolfgang von Goethe-
Universitat FrankfurtlMain, Marie-Curie StraBe 11, D-60431 Frankfurt a.M.,
Germany
IX
Preface
1.1 Introduction
1.2 Economics of drug discovery and development
1.3 The drug discovery process: methods and strategies
1.3.1 Target identification
1.3.2 Target validation
1.3.3 Lead identification
1.3.4 Lead optimization
1.4 Trends and visions in drug development
1.4.1 Genomic approaches
1.4.2 Genetic approaches
1.5 Conclusions
1.6 Acknowledgments
1.7 References
1.1 Introduction
Q
$25B ; by top 10 pharma:
60 If) 1.6
'0
'6~
QJ
'550 $20B
c
"Productivity
§ 40 $15B ~ Gap"
Q)
x
III
~ 30
z $10B £0:: Yearly NeE launches
20 required to achieve
Oi 10% sales growth
$5B -g by top 10 pharma
10 (3
3.1
o $OB
1988 89 90 91 92 93 94 95 96 97 98 99 2000E
Figure 1. The innovation deficit in the pharmaceutical industry: The gap between the number of new
chemical entities (NCEs) required to be launched in order to accomplish an annual 10% revenue increase
and the actual number of NCEs introduced in the market by the top 10 pharmaceutical companies. In 2000
this gap was 1.5 NCEs.
1 Modem methods of drug discovery: An introduction 3
Post-marketing testing _
_ Clinical phase"
_ Clinical phase I
• • • • • • • Preclinical testing
o 2 4 6 8 10 12 14 16
Years
Source: PhRMA, based on data from Center for the Study of Drug Development, Tufts University, 1995
Figure 2, Average periods for steps in the drug discovery and developement procedure (FDA: Food and
Drug Administration),
Other ~
2,4°;°. Bioavailability
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Figure 3, Allocation of R&D by function in percentage terms (IND: Investigational New Drug, NDA: New
Drug Application),
One reason for "innovation deficit" is the increased demand on safety of drugs.
The average number of clinical trials per new drug application (NDA) steadily
increased from 30 in the 1970s, about 40 in the 1980s to 70 in the 1990s [7]. If
1,500 patients were required for an NDA in the early 1980s, today the average
number of patients is around 4,000. The increased demand on safety is conse-
quently reflected in a prolonged duration of the drug development process. In the
1960s, the total development time was 8.1 years. This rose to 11.8 years in the
1970s, 14.2 years in the 1980s to finally 14.9 years in the 1990s [6].
Since drug safety cannot be compromised, methods that enhance R&D pro-
ductivity, namely accelerate drug discovery and reduce failure rates during later
drug development, are desirable.
Can the new technologies pertaining to target and drug discovery fulfill these
needs? Recent advances, for example, in functional genomics, proteomics, assay
development, combinatorial chemistry, high-throughput screening, structure-
based ligand design and pharmcogenomics will definitively enrich the basis for
discovering and developing innovative therapeutic products, but can these meth-
ods also improve the economics of drug research, development and commercial-
ization per se?
In the following current and future drug discovery methods and strategies are
reviewed in chronological order throughout the drug discovery process. Many of
these techniques are described in more detail in the other chapters of this book.
We refer solely to small organic molecules and not therapeutic proteins ("biolog-
icals"), for which the situation may be quite different.
The drug discovery process can be divided into four steps: drug target identifi-
cation, target validation, lead compound identification and optimization. In
Figure 4 some of the key methods used to discover new drugs are depicted.
The identification of new and clinically relevant molecular targets for drug inter-
vention is of outstanding importance to the discovery of innovative drugs.
It is estimated that up to 10 genes contribute to multifactorial diseases [8].
Typically, these "disease genes" are linked to another five to 10 gene products in
physiological or pathophysiological circuits which are also suitable for interven-
tion with drugs. If these numbers are multiplied with the number of diseases that
pose a major medical problem in the industrial world, which is between 100 and
150, we end up with about 5,000 to 10,000 potential drug targets. A recent analy-
sis shows that current drug therapy is based on less than 500 molecular targets [8],
with 45% of these being G-protein coupled receptors, 28% enzymes, 11 % hor-
mones and factors, 5% ion channels and 2% nuclear receptors. Thus, the number
I Modern methods of drug discovery: An introduction 5
• Medicinal chemistry
• Parallel synthesis Chapter 6
• Design of focused compound libraries Chapter 7,10
• Molecular modeling, QSAR Chapter 11
Lead • Structure-based design Chapter 8
optimization • in vivo pharmacology
• Pharmacokinetics and toxicology Chapter 12, 13
Preclinical
and clinical
development
Figure 4. Phases of the drug discovery process. After target identification and validation chemical sub-
stances are discovered and optimized with regard to efficacy, potency, selectivity as well as pharmacoki-
netic, toxicological and metabolic aspects. This process is followed by preclinical and clinical develop-
ment. Methods reviewed in this book are indicated by references to the respective chapters.
of drug targets that can be exploited in the future is at least 10 times the number
of targets currently implemented in drug therapy.
Besides classical methods of cellular and molecular biology, new techniques of
target identification are becoming increasingly important. These methods aim at
i) discovering new genes and proteins and ii) quantifying and analyzing gene and
protein expression between diseased and normal cells.
Methods routinely used to identify new drug targets are genomics [9] and pro-
teomics [10, 11] (see Chapter 2),
The term genomics was coined in the mid 1980s. This new discipline has
evolved from two independent advances:
6 H. Giersiefen et al.
Considering the gene expression microarrays and especially gene chip technol-
ogy, scientists today can use a single micro-device to evaluate and compare the
expression of up to 20,000 genes of healthy and diseased individuals at once [24].
Thirty-three thousand genes of the human genome can thus be analyzed simulta-
neously using two high-density arrays of oligonucleotide probes, each at a size of
about a square centimeter (Affymetrix-technology) [25].
However, it is becoming increasingly evident that the complexity of biological
systems lies at the level of the proteins, and that genomics alone will definitely not
suffice to understand these systems. It is also at the protein level that disease
processes become manifest and at which at least 91 % of drugs act [8]. Thus, the
analysis of proteins (including protein-protein-, protein-nucleic acid- and protein-
ligand-interactions) will be of utmost importance to target discovery. Proteomics
(see Chapter 2), the systematic high-throughput separation and characterization of
proteins within biological systems, help to fulfill some of these needs. Concerning
expression analysis, it has been shown that the correlation between RNA and pro-
tein expression is rather weak and ranges in yeast from 10% to 40% for lower-
abundance proteins up to 94% for higher-abundance proteins [26]. Some of these
differences can be explained with posttranslational modifications, accessible only
to proteomics analysis in this context. Target identification with proteomics is per-
formed by comparing the protein expression levels in normal and diseased tissues.
Two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) is used to sepa-
rate the proteins which are subsequently identified and fully characterized with
matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) [27].
Differential protein expression between normal and diseased samples probably
indicates that the protein of interest is involved in the pathophysiology and could
therefore be a potential drug target. This assumption has to be confirmed during
the target validation process.
The validation of a drug target involves demonstrating the relevance of the target
protein in a disease process and ideally requires both gain and loss of function
studies. This is accomplished primarily with highly specialized knock-out (loss of
function) and knock-in (gain of function) animal models that are capable of
mimiking the targeted disease state [28-30], or with the use of small molecules
(inhibitors, agonists, antagonists), antisense nucleic acid constructs [31],
ribozymes and neutralizing antibodies [32] (see also Chapter 2). The development
of in vivo models is always costly and time consuming. Provided that genomics
and proteomics will uncover novel biological targets focusing on new pathome-
chanisms, many of the required models have yet to be designed. Nevertheless,
they will always be an approximation of the real clinical situation in humans and
can thus not necessarily contribute to the predictability of drug development.
Methods used for target identification at the in vitro level are RNA and protein
expression analysis and cell based assays.
8 H. Giersiefen et aI.
By far, not all target proteins that are relevant for a certain disease process can
be affected with preferably orally bioavailable small molecules. Since strong
interactions between a protein and its ligand are characterized by a high degree of
complementarity, knowledge of the protein three-dimensional structure will
enable the prediction of "druggability" of the protein (see Chapter 8). For exam-
ple, a protein with a large hydrophobic ligand-binding pocket will also interact
preferably with large apolar ligands that are certainly problematic cases with
respect to physicochemical (e.g., aqueous solubility, 10gP) and pharmacokinetic
properties. In the future, protein structure information (x-ray, NMR and compar-
ative models) may be applied to rule out such difficult drug targets. To cope with
the thousands of potential drug targets, structural genomics (systematic, large-
scale efforts towards the structural characterization of all proteins) [33, 34] will
play an important role. It is conceivable that tool compounds, designed on the
basis of protein three-dimensional structure information, will help to unravel the
physiological roles of target proteins and contribute to "chemical" target valida-
tion [35-37] (see Chapter 8). Probably such tool compounds can later be quickly
turned into drugs if the target shows disease relevance.
In summary, target validation is one of the bottlenecks in the drug discovery
process, since this phase is less adaptable to automation. The careful validation of
target proteins, not only with respect to disease relevance, but also "druggablity,"
will reduce the failure rate and increase the overall efficiency of the drug discov-
ery process.
In the lead generation phase, compounds are identified which interact with the tar-
get protein and modulate its activity. Such compounds are mainly identified using
random (screening) or rational (design) approaches.
High-throughput screening (HTS, see Chapter 4) is used to test large numbers
of compounds for their ability to affect the activity of target proteins. Today,
entire in-house compound libraries with millions of compounds can be screened
with a throughput of 10,000 (HTS) up to 100,000 compounds per day (uHTS,
ultra high-throughput screening) [38, 39] using very robust test assays (see
Chapter 4). Combinatorial chemistry and parallel synthesis (Chapter 6) are
employed to generate such huge numbers of compounds. However, there are
some concerns regarding the pure "number"-approach [40]. In theory, generating
the entire or at least significant parts of chemical space for drug molecules and
testing them would be an elegant approach to drug discovery. In practice, this is
not feasible, since we are faced with a chemical space of 1062 _10 63 compounds
that comprise carbon, oxygen, nitrogen, hydrogen, sulfur, phosphorous, fluorine,
chlorine and bromine, having a molecular weight of less than 500 Da [41].
Unfortunately, even the largest chemical libraries available today, including up to
10 million compounds (107), comprise only a tiny fraction of the entire chemical
space. With existing chemistry technologies we will not exceed library sizes of
1 Modem methods of drug discovery: An introduction 9
108 compounds in the near future. Since it seems that we have to live with that
problem, concepts are needed to synthesize and select biologically relevant com-
pounds. One solution, according to Wess et aI., could be to accumulate as much
knowledge as possible on biological targets (e.g., structure, function, interactions,
ligands) and fill the limited chemical space with targeted approaches of chemical
synthesis [40]. Natural compounds and synthetic derivatives thereof can con-
tribute essentially to such approaches [42] (Chapter 5). This is simply because
nature has provided clues as to the structural requirements for biological activity.
In other words, if lead identification is compared to the proverbial "search for the
needle in the haystack," then simply adding more "hay" will not yield the solu-
tion. Chemical synthesis, be it traditional solution phase chemistry, natural prod-
ucts derivatization, combinatorial or parallel synthesis, has to work closely
together with design approaches like structural biology, molecular modeling,
library design and physicochemical approaches, in order to select smart
"haystacks" with more "needles."
Concluding this, the design of compound libraries will gain further importance
for high- or medium throughput screening, since besides molecular similarity and
diversity issues, physicochemical (Chapter 12) and toxicological aspects (Chapter
13) can now be considered to select "drug-like" compounds for screening.
Another crucial point for reliable results with high-throughput screening is the
robustness and quality of the biological test assays. Homogenous "mix and meas-
ure" assays are preferred for HTS as they avoid filtration, separation, and wash
steps that can be time-consuming and difficult to automate. Assays for HTS can be
grouped into two categories: so-called solution-based biochemical assays and cell-
based assays [43,44]. The former are based on radioactive (scintillation proximi-
ty assay, SPA) and fluorescence (fluorescence resonance energy transfer FRET,
fluorescence polarization FP, homogenous time resolved fluorescence HTRF, and
fluorescence correlation spectroscopy FCS) detection methods to quantify the
interaction of test compounds with biological target molecules. Scintillation prox-
imity assays in HTS have largely replaced heterogenous assays that make use of
radio-labeled ligands with subsequent filtration steps to measure high-affinity
binding to receptors. Cell-based assays for HTS can be categorized as follows: i)
second messenger assays that monitor signal transduction, ii) reporter gene assays
that monitor cellular responses at the transcriptionaVtranslationallevel, and iii) cell
proliferation assays that detect induction or inhibition of cell growth.
A different approach to lead compound discovery is in silico or virtual screen-
ing [45,46] (see also Chapter 10). With this computer method, three-dimensional
(3D) structures of compounds from virtual or physically existing libraries are
docked into binding sites of target proteins with known or predicted structure.
Empirical scoring functions are used to evaluate the steric and electrostatic com-
plementarity (the fit) between the compounds and the target protein. The highest
ranked compounds are then suggested for biological testing. These software tools
are attractive and cost-effective approaches to generate chemical lead structures
virtually and before using expensive synthetic chemistry. Furthermore, they allow
rapid and thorough understanding of the relationship between chemical structure
10 H. Giersiefen et al.
and biological function. The virtual screening of small molecules takes less than
a minute per chemical structure per computer processor (CPU) when using the
more sophisticated computer algorithms that are available today [45]. Utilizing
clusters of CPUs results in a high degree of parallelization. The throughput with
100 parallel CPUs is even higher compared to current uHTS technologies. The
main advantage is that the method does not depend on the availability of com-
pounds, meaning that not only in-house libraries can be searched, but also exter-
nal or virtual libraries. The application of scoring functions on the resulting data
sets facilitate smart decisions about what chemical structures bear the potential to
exhibit the desired biological activity.
However, one important prerequisite for these technologies is the availability of
structural data of the target and-if possible-in complex with the biologicallig-
and or an effector molecule. Currently, there exists experimental high resolution
structure information on only about 1% of all highly annotated protein sequences
(see Chapter 8). Comparative models for more than 40% of these proteins are
available (see Chapter 3 and 8). So, in silico screening can be applied on average
to each third and probably in the future to each second drug discovery project.
Structural biology will therefore become an increasingly important discipline in
drug discovery. This is grounded in three factors that determine the ability of in
silico screening tools for identifying lead compounds:
i) quality of the structural models determined by nuclear magnetic resonance
spectroscopy (NMR) or x-ray diffraction, ii) current understanding and assump-
tions about protein (target) and ligand (drug) binding interactions and the trans-
position of this understanding into suitable models, and iii) ability to apply com-
putational chemistry as an enabler of a cross-functional and reiterative process
whereby laboratory data demonstrating the desired effect of a ligand can provide
insights for refining and optimizing chemical structures. The power of in silico
screening is expected to increase in the future dramatically due to the availability
of protein structures from structural genomics initiatives (Chapter 8), larger struc-
ture activity data sets (Chapter 11) and experience, necessary to refine today's
computer algorithms to increase their predictability (Chapter 10).
Another screening approach, namely NMR-based screening (Chapter 9) fills
the gap between HTS and virtual screening [47]. This method combines the ran-
dom screening approach (currently up to 1000 compounds/day) with the rational
structure-based approach to lead discovery. Small organic molecules that bind to
proximal subsites of a protein are identified through screening, and linked togeth-
er in a rational approach, to produce high-affinity ligands (Chapter 9).
Once hits (compounds that elicit a positive response in a particular biologically
assay) have been identified by applying the different screening approaches, these
are validated by re-testing them and checking the purity and structure of the com-
pounds. This is to avoid spending too much time with further characterizations of
false positives from the screening. Only if the hits fulfill certain criteria are these
regarded as leads. The criteria can originate from different directions such as:
• pharmacodynamic properties: efficacy, potency and selectivity in vitro and in
vivo;
1 Modem methods of drug discovery: An introduction 11
During the lead optimization phase, small organic molecules are chemically mod-
ified and subsequently pharmacologically characterized in order to obtain com-
pounds with suitable pharmacodynamic and pharmacokinetic properties to
become a drug. This process ideally requires the simultaneous optimization of
multiple parameters and is thus a time-consuming and costly step, which proba-
bly constitutes the "tightest" bottleneck in drug discovery. However, by turning a
biologically active chemical into an effective and safe drug, lead optimization
contributes essentially towards added value in the drug discovery process.
Lead optimization is mainly a cross-talk between the traditional disciplines in
drug discovery: medicinal chemistry and in vitro/in vivo pharmacology. Leads are
characterized with respect to pharmacodynamic properties such as efficacy, poten-
cy and selectivity in vitro and in vivo, physicochemical properties (see Chapters 7
and 12) and pharmacokinetic properties (absorption, distribution, metabolism and
elimination, ADME, see Chapter 7 and 12), and toxicological aspects (see Chapter
13). 1n vitro test assays allow a larger number of compounds to be characterized.
The assays used to test for potency and selectivity are similar to those described
for HTS (see Chapter 4) with the exception that more time-consuming detection
methods (e.g. calorimetric, chromatographic, surface plasmon resonance methods)
are used [50].
In parallel to compound characterization with respect to potency and selectivi-
ty, in vitro assays for the prediction of pharmacokinetic properties should be per-
formed. Such in vitro systems utilize either Caco-2 or Madin-Darby Canine
Kidney cells, human hepatic microsomes, heterologously expressed cytochrome
P450 enzymes or human hepatocytes [49]. Once compounds with desirable in
vitro profiles have been identified, these are characterized using in vivo models
[50]. These tests usually require larger quantities of compounds and are conduct-
ed with selected molecules only.
Compounds with properties which do not fulfill the requirements for a success-
ful drug development candidate have to be optimized through the synthesis of "bet-
12 H. Giersiefen et al.
ter suited" derivatives. Hints and ideas on how to modify a lead compound with
respect to pharmacodynamic properties (e.g., binding to a receptor or enzyme) can
originate from molecular modeling, quantitative structure activity relationship
(QSAR)-studies (see Chapter 11) and from structural biology (see "structure-based
drug design cycle" in Chapter 8). If the three-dimensional structure of the target
protein is not available, the ligand-based approach of molecular modeling can lead
to new insights into structure-activity-relationships (see Chapter 11). If, for exam-
ple, the steric and electrostatic properties of two different series of lead compounds
that bind to the same site at the target protein are compared on the computer screen
in three-dimensions, one can infer important substituents from one series to
another. QSAR studies can yield mathematical equations which correlate biologi-
cal activity with chemical structure (see Chapter 11). Biological activity for com-
pounds that mayor may not exist can be predicted from such equations. In cases
where the three-dimensional structure of the target protein is known, docking cal-
culations are applied to predict the relative orientation of ligands with respect to the
target protein. The thorough analysis of the complexes allows one to predict how
to increase the binding affinity of the ligands. These rational approaches are valu-
able tools in the lead optimization phase, since they connect biology with chem-
istry and allow a thorough understanding of the relationship between chemical
structure and biological function [35]. This has been shown to accelerate the lead
optimization phase (see Chapter 8).
Concerning the generation of ideas on how to modify compounds with respect
to physicochemical and pharmacokinetic properties, in silico methods play an
important role [51]. Probably the most widely used ADME-model is Lipinski's
"rule-of-five," which is used to filter out compounds likely to be purely absorbed
through the human intestine, based on four simple rules related to molecular prop-
erties (see Chapter 7 and 12). Other models for human intestinal absorption, blood-
brain barrier (BBB) permeation or human bioavailability are based on empirical
approaches such as QSAR, and require a significant quantity of high quality data
from which to deduce a relationship between structure and activity [52] (see
Chapter 11). Quantum mechanical simulations, structure-based design methods and
expert systems can be applied to predict metabolism and toxicity (see Chapter 13).
The chemical synthesis of the proposed, better suited compounds is accom-
plished using traditional medicinal chemistry. Serendipity (chance findings) at
this stage of drug discovery also plays an important role and should not be over-
looked. Synthesizing compounds simply because they are chemically easily
accessible is a common approach and has lead to important discoveries. The
preparation of focused combinatorial compound libraries is also a powerful
approach for the derivatization of leads (see Chapter 6 and 10), but requires mean-
ingful HTS-assays for subsequent biological testing. However, in order to supply
the most advanced and complicated test systems, the animal models [50], with
sufficient amounts of pure test compounds, "traditional" chemical synthesis plays
a dominant role in the optimization phase.
In conclusion, it is vital to conceive lead optimization as a simultaneous multi-
dimensional optimization process rather than a sequential one. Optimizing leads
1 Modem methods of drug discovery: An introduction 13
The recent regulatory approval of the Novartis drug Gleevec®, formely known as
STI-571, an inhibitor to Bcr-Abl kinase and a treatment of chronic myeloid
leukemia (CML) set a new benchmark in terms of development and approval time.
Given the specificity challenges associated with kinase inhibition, the rigorous
and selective deployment of modem drug discovery tools translated in clear cus-
tomer value in terms of unsurpassed efficacy, fewer side-effects and a pharmaco-
kinetic profile that supports once-daily oral administration.
The most important lever in terms of improving the economics of R&D is the
ability to distinguish failures from successful candidates early in the process. This
lever exists in every segment of the pharmaceutical value chain and has a dramat-
ic impact on success rates of R&D portfolios. Considering, for example, all bio-
logical targets and corresponding lead drug candidates of a given pharmaceutical
portfolio that fail prior to entering clinical development-if that company were
able to avoid just one out of 10 earlier failures it could save an additional $100
million per drug.
Genomic-based information combined with the ability to analyze it produc-
tively will provide companies an enormous advantage. Such companies can now
make more informed decisions on targets and leads to be pursued. However,
improving portfolio management and decision-making will take more than just
acquiring and implementing genomics technologies. It will also require strategic
thinking, for example, the definition and application of rigorous selection criteria
that will ultimately shape the company's product portfolio and can reduce devel-
opment risk, decisions on building in-house competence or partnering, acquisition
and divestiture of targets and leads, etc.
The cascade of portfolio decisions to be made again and again extends into
drug discovery and drug development. Historically, pharmaceutical companies
tended to develop new drugs inductively, whereby the data generated in each seg-
ment of the value chain were driving subsequent experiments and studies. In other
words, experiments were designed based on what one knew about the target and
the drug candidate. Such an approach bears significant risks in terms of identify-
ing candidates that are likely to fail during drug development later than earlier,
resulting in increased development costs and time-to-market. Furthermore, the
generated data may not always support the ultimate value proposition that a drug
has to deliver to its customers. The generation of "scientific noise" may even have
posed questions that cannot not be answered easily.
14 H. Giersiefen et at.
Especially when pursuing new targets and pathomechanisms that may arise
from the genomics approach, it is crucial to envision the desired clinical benefit
early and prior to committing funds for expensive formal drug development. The
ability of a company to translate clinical relevance to customers into meaningful
experiments and studies in drug discovery and development will influence portfo-
lio success rates. The discovery and development of new drugs should be guided
by an understanding of the anticipated product benefit. Drug product prototypes
that are structured around the key questions that customers always ask may serve
as a portfolio management tool by providing a basis for stringent decision-mak-
ing throughout the process. Customer questions and anticipated answers may
cover essential value criteria such as efficacy, safety, dosing, quality, risklbenefit
and cost/performance ratios. Experiments and studies will not only serve the gen-
eration of scientific data which is ultimately required to obtain market approval,
but more importantly, the continuous analysis and interpretation of this data rela-
tive to the defined product value proposition will serve rational portfolio deci-
sions. In case it becomes obvious from the data analysis that the envisioned value
proposition cannot be accomplished, early discontinuation or mid-term correc-
tions may be necessary. Drug discovery and development become deductive,
whereby scientists become more interested in what they should know about their
targets and drugs rather than what they already know.
While pharmaceutical companies are attempting to adopt and cope with the tech-
nological advances described in this book, the race for new technologies contin-
ues.
Having discussed the genomics wave and its potential to enhance R&D pro-
ductivity, it will be genetics that will shape the future of how we discover and
develop new therapies. Moreover, it can ultimately impact the entire health care
system.
Two genetic approaches can impact R&D at different stages of the value chain:
disease genetics and pharmacogenetics [53-55].
Disease genetics is implemented earlier in drug discovery and involves the
search for genes that determine the human susceptibility for certain diseases.
Once more, the discovery of new targets is the objective. However, this time new
targets may not only enrich the physician's future armamentarium, they also
specifically hold the prospect of transforming R&D success rates.
Pharmacogenetics will influence R&D later, in particular during clinical devel-
opment. It entails the genetic-based study of genetic variations between individu-
als and how these variations influence drug response. Inherited differences in
DNA sequence such as single-nucleotide polymorphism (SNP) contribute to phe-
notypic variation, influencing an individual's anthropometric characteristics, risk
of disease and response to the environment (e.g., a drug) [56]. SNPs occur (on
average) every 1,000-2,000 bases when two human chromosomes are compared.
1 Modem methods of drug discovery: An introduction 15
With the availability of whole-genome SNP maps, it will soon be possible to cre-
ate an SNP profile for patients who experience adverse events or who respond
clinically to the medicine. Pharmacogenetics can thereby help to understand and
predict the efficacy and safety of a potential drug candidate. There are three rele-
vant approaches available today:
• Pharmacogenetics predicts drug responses by analyzing genetic variations on
a DNA level;
• Expression pharmacogenetics accomplishes the same by specifically compar-
ing RNA levels in different samples to determine which genes are expressed in
certain disease states;
• Proteomic pharmacogenetics compares protein readings in different tissue
samples to identify proteins that differ in structure or in expression levels (see
Chapter 2).
Short-term disease genetics and pharmacogenetics will enable drug developers to
predict drug responses by studying underlying genetic variants or polymorphisms.
By comparing the individual's genetic variations against the "standard genome"
scientists should be able to determine the risks for developing specific diseases or
how well suited a patient is to a particular therapeutic intervention. The econom-
ic impact is twofold: The approach described will help to refine inclusion criteria
for clinical trials and can thereby reduce trial cost and the probability of study fail-
ure in the most expensive segment of the value chain. This is especially important
for disease areas that often suffered from non-responders, for example, cancer or
inflammatory diseases. These diseases can have multiple underlying pathomech-
anisms. Therapeutic success in such cases may be limited when attacking mono-
specific biological targets. In addition, the predictive knowledge regarding drug
response can positively influence the prescription habit of physicians. Savings to
the health care system could be paramount considering that today's drugs work
only for 60% of patients at most.
Long-term, pharmacogenetics can deliver individualized therapeutic options.
Customized drugs for patient sub-populations have the potential to provide pre-
dictable and improved clinical outcomes in terms of efficacy and safety. This
opportunity includes the treatment of patients for whom current treatment options
are ineffective or unsafe; other important drugs that have been abandoned due to
a lack of safety in only a few patients could be resuscitated and be made available
to those who tolerate them.
Pharmacogenetics will revise the financial basis for pharmaceutical R&D.
Economic savings are likely to come from improved efficiency in discovery,
improved target validation and clinical trial outcomes. The improved efficiency
will be based on a consolidation of target discovery into a single step; the
improved success rates are based on the ability to pinpoint targets associated with
disease susceptibility. Unlike traditional targets that are usually identified through
the use of cell-based or animal models and whose clinical relevance is initially
speCUlative, pharmacogenetics begins with target validation using human disease
phenotypes.
16 H. Giersiefen et al.
Despite the potential cost savings, phannacogenetics will not support the indus-
try's beloved blockbuster model. Excluding patients from clinical trials due to
increasingly stringent inclusion criteria can eventually result in restricted labeling.
Therefore, the phannaceutical industry will have to carefully analyze compensa-
tory effects including potential market upsides, price premiums that are justifiable
from a phannacoeconomic standpoint, shifts in market share and new patients eli-
gible for individualized therapies.
The phannaceutical industry is changing its approach towards R&D by apply-
ing knowledge and new techniques to understand the root causes rather than the
symptoms of diseases. This has fundamentally changed the way new drugs are dis-
covered and developed. The advances in drug discovery and development evolved
from a detailed understanding of the human genome forming a rich basis for the
discovery of many potential drug therapies. Furthennore, disease genetics and
phannacogenetics can add to the understanding of how genes and proteins work
collectively in a regular and a diseased living system. New techniques for screen-
ing, designing and synthesizing chemicals using robots and computers, and most
importantly, advances in infonnation technology have facilitated this development.
This new way of discovering and developing drugs is infonnation-based. The
infonnation generated and how we use it will further accelerate and change the
processes leading to innovative therapies. Recently, new software tools such as the
E-CELL system were presented [57]. These tools may be able to further utilize
available genetic infonnation to ultimately simulate molecular processes in cellu-
lar systems. Naturally, the resulting independency from living systems to study
drug responses would tum phannaceutical R&D upside down. However, at pres-
ent the replacement of in vivo models with E-CELL approaches is not realistic.
1.5 Conclusions
New technologies in drug discovery and development reviewed in this book can
only provide economic impact if they are accompanied with strategic thinking.
Only if phanna-managers envision new technologies in tenns of their ability to
deliver innovative medicines at lower cost, reduced time and with superior quali-
ty, will they be able to implement and utilize them in an economically profound
way. None of the reviewed methods present all-in-one strategies for phannaceuti-
cal R&D. For any known disease they need to be well orchestrated in order to
deliver the desired outcomes.
1.6 Acknowledgements
The authors thank Peter Kengel, Frank Holfter (EnTec GmbH), Tanis Hogg
(JenaDrugDiscovery GmbH) and Johannes von Langen (IMB-Jena) for their help
in preparing the manuscript and Walter Elger (EnTec GmbH) for stimulating dis-
cussions.
1 Modern methods of drug discovery: An introduction 17
1.7 References
1 PhRMA (2000) Based on data from PhRMA Annual Survey and Standard & Poor's Compustat, a divi-
sion of McGraw-Hill
2 PhRMA Annual Survey (2000)
3 Report issued by Bain & Company Germany Inc. (2001)
4 Boston Consulting Group. A revolution in R&D. (2001)
5 DiMasi JA (2001) Risks in new drug development: approval success rates for investigational drugs.
Clin Pharmacol Ther 69: 297-307
6 Report issued by the Tufts Center for the Study of Drug Development, Tufts University, Boston, MA,
USA (1998)
7 Peck CC (1997) Drug development: improving the process. Food Drug Law J 52: 163-167
8 Drews J (2000) Drug discovery: a historical perspective. Science 287: 1960-1964
9 Ward SJ (2001) Impact of genomics in drug discovery. Biotechniques 31: 626-630
10 Cunningham MJ (2000) Genomics and proteomics: the new millennium of drug discovery and devel-
opment. J Pharmacol Toxicol Methods 44: 291-300
11 Dongre AR, Opiteck G, Cosand WL et al (2001) Proteomics in the post-genome age. Biopolymers 60:
206-211
12 Augen J (2002) The evolving role of information technology in the drug discovery process. Drug
Discov Today 7: 315-323
13 Davies EK, Richards WG (2002) The potential of Internet computing for drug discovery. Drug Discov
Today 7: S99-S 103
14 Lander ES, Linton LM, Birren B et al (2001) Initial sequencing and analysis of the human genome.
Nature 409: 860-921
15 Venter JC, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science 291:
1304-1351
16 National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD 20894,
http://www.ncbLn1m.nih.gov. (2002)
17 Searls DB (2000) Using bioinformatics in gene and drug discovery. Drug Discov Today 5: 135-143
18 Hogenesch JB, Ching KA, Batalov S et al (2001) A comparison of the Ce1era and Ensembl predicted
gene sets reveals little overlap in novel genes. Cell 106: 413-415
19 Wright FA, Lemon WJ, Zhao WD et al (2001) A draft annotation and overview of the human genome.
Genome BioI 2: RESEARCH0025.1-0025.18
20 Duckworth DM, Sanseau P (2002) In silico identification of novel therapeutic targets. Drug Discov
Todny 7: S64-S69
21 Sanseau P (2001) Impact of human genome sequencing for in silico target discovery. Drug Discov
Today 6: 316-323
22 Tang CM, Moxon ER (2001) The impact of microbial genomics on antimicrobial drug development.
Annu Rev Genomics Hum Genet 2: 259-269
23 Herrmann R, Reiner B (1998) Mycoplasma pneumoniae and Mycoplasma genitalium: a comparison
of two closely related bacterial species. Curr Opin Microbiol 1: 572-579
24 Noordewier MO, Warren PV (2001) Gene expression microarrays and the integration of biological
knowledge. Trends Biotechnol19: 412-415
25 Kennedy GC (2000) The impact of genomics on therapeutic drug development. EXS 89: 1-10
26 Gygi SP, Rochon Y, Franza BR et al (1999) Correlation between protein and mRNA abundance in
yeast. Mol Cell BioI 19: 1720-1730
27 Loo JA, Dejohn DE, Du P et al (1999) Application of mass spectrometry for target identification and
characterization. Med Res Rev 19: 307-319
28 Abuin A, Holt KH, Platt KA et al (2002) Full-speed mammalian genetics: in vivo target validation in
the drug discovery process. Trends Biotechnol 20: 36-42
29 Tornell J, Snaith M (2002) Transgenic systems in drug discovery: from target identification to human-
ized mice. Drug Discov Today 7: 461-470
30 Sanseau P (2001) Transgenic gene knockouts: a functional platform for the industry. Drug Discov
Today 6: 770-771
31 Dean NM (2001) Functional genomics and target validation approaches using antisense oligonu-
cleotide technology. Curr Opin Biotechnol12: 622-625
32 Tse E, Lobato MN, Forster A et al (2002) Intracellular antibody capture technology: application to
18 H. Giersiefen et al.
selection of intracellular antibodies recognising the BCR-ABL oncogenic protein. J Mol BioI 317:
85-94
33 Burley SK (2000) An overview of structural genomics. Nat Struct BioI 7 Suppl: 932-934
34 Service RF (2000) Structural genomics offers high-speed look at proteins. Science 287: 1954-1956
35 Dean PM, Zanders ED (2002) The use of chemical design tools to transfonn proteomics data into drug
candidates. Biotechniques Suppl: 28-33
36 Willson TM, Jones SA, Moore JT et al (2001) Chemical genomics: functional analysis of orphan
nuclear receptors in the regulation of bile acid metabolism. Med Res Rev 21: 513-522
37 Lenz GR, Nash HM, Jindal S (2000) Chemical ligands, genomics and drug discovery. Drug Discov
Today 5: 145-156
38 Hertzberg RP, Pope AI (2000) High-throughput screening: new technology for the 21st century. Curr
Opin Chem BioI 4: 445-451
39 Wolcke J, Ullmann D (2001) Miniaturized HTS technologies - uHTS. Drug Discov Today 6: 637-646
40 Wess G, Unnann M, Sickenberger B (2001) Medicinal Chemistry: Challenges and Opportunities.
Angew Chem Int Ed Eng140: 3341-3350
41 Drews J (2000) Drug discovery today - and tomorrow. Drug Discov Today 5: 2-4
42 Harvey AL (1999) Medicines from nature: are natural products still relevant to drug discovery? Trends
Phannacol Sci 20: 196-198
43 Sundberg SA (2000) High-throughput and ultra-high-throughput screening: solution- and cell-based
approaches. Curr Opin Biotechnolll: 47-53
44 Silvennan L, Campbell R, Broach JR (1998) New assay technologies for high-throughput screening.
Curr Opin Chem BioI 2: 397-403
45 Schneider G, Bohrn HI (2002) VIrtual screening and fast automated docking methods. Drug Discov
Today 7: 64-70
46 Toledo-Shennan LM, Chen D (2002) High-throughput virtual screening for drug discovery in parallel.
Curr Opin Drug Discov Devel5: 414-421
47 Shuker SB, Hajduk PJ, Meadows RP et al (1996) Discovering high-affinity ligands for proteins: SAR
by NMR. Science 274: 1531-1534
48 Brennan MB (2000) Drug Discovery: filtering out failures early in the game. Chemical & Engineering
News 78: 63-73
49 Bachmann KA, Ghosh R (2001) The use of in vitro methods to predict in vivo pharmacokinetics and
drug interactions. Curr Drug Metab 2: 299-314
50 Vogel HG (2001) Drug Discovery and Evaluation. Phannacological Assays. Springer-Verlag, Berlin
Heidelberg
51 Beresford AP, Selick HE, Tarbit MH (2002) The emerging importance of predictive ADME simulation
in drug discovery. Drug Discov Today 7: 109-116
52 Butina D, Segall MD, Frankcombe K (2002) Predicting ADME properties in silico: methods and mod-
els. Drug Discov Today 7: S83-S88
53 Roses AD (2001) Pharmacogenetics. Hum Mol Genet 10: 2261-2267
54 Roses AD (2000) Pharmacogenetics and the practice of medicine. Nature 405: 857-865
55 Vesell ES (2000) Advances in pharmacogenetics and pharmacogenomics. J Clin Pharmacol 40:
930-938
56 McCarthy JJ, Hilfiker R (2000) The use of single-nucleotide polymorphism maps in pharrnacoge-
nomics. Nat Biotechnol18: 505-508
57 Tomita M (2001) Whole-cell simulation: a grand challenge of the 21st century. Trends Biotechnol19:
205-210
Modern Methods of Drug Discovery 19
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhiiuser Verlag/Sw~zerland
2 Proteomics
Martin J. Page 1, Chris Moyses 2 , Mary J. Cunningham3 , Gordon Holt2 and
Alastair Matheson4
2.1 Introduction
2.2 Protein mapping
2.2.1 Challenges for 2D-PAGE
2.3 Protein characterization
2.4 New directions in proteomics technologies
2.5 Applications of proteomics to drug discovery
2.5.1 Databases
2.5.2 Diagnostics
2.5.3 Target discovery
2.5.4 Target validation
2.5.5 Candidate selection
2.5.6 Preclinical toxicology
2.5.7 Clinical development
2.6 Conclusions
2.7 References
2.1 Introduction
Several disciplines have recently emerged within biotechnology which for the first
time enable biological systems to be studied on a scale commensurate with their
inherent complexity. From the point of view of drug discovery, a critical role is
fulfilled by proteomics, the systematic high-throughput separation and character-
ization of proteins within biological systems. Importantly, it is at the protein level
that disease processes become manifest and at which most drugs act. Proteomics
is consequently assuming a central place in modem drug development, with a
wide spectrum of practical applications embracing diagnostics, target discovery,
target validation, lead compound selection, investigation of drug modes of action,
toxicology and clinical deVelopment.
The central technologies of proteomics as it is currently practiced can be
grouped into two stages (Fig. 1):
• Technologies for protein mapping, i.e., separating, distinguishing and quanti-
fying the proteins present in individual samples.
• Technologies for identifying proteins and characterizing their full structural
properties and functional roles.
20 M.J. Page et al.
,....--
Samples
LlMS
(laboratory roo---
information
management
system)
~ ::0
Protein Separation 0
C"
..._.
2D·PAGE
0
I
t
_.
OJ
Image analysis
(')
en
_.
-»
0 Staining
Imaging
Digital protein map
::s Comparative analysis
Identify differentiallY-<lxpressed
-tot proteins
..,
0
3
,I ...
0
s::::::
..._.
OJ
Protein characterization
3
..._.
Extraction from gel
Chemical/enzymatic digestion
HPLC
OJ
(') Mass spectrometry
Sequence analysis
en Post-translational modifications
0
::s
I
~ L----
Results
I
~
Interrogation
I
'---
At the major proteomics centers, these stages, each of which involves many indi-
vidual steps, are integrated into high-throughput, automated systems, supported
by sophisticated robotics and dependent at every step on powerful bioinformatics
(see Chapter 3).
sion profile for a given cell/sample type, after which greater resolution can be
achieved by the use of a series of gels with narrower but overlapping pH ranges.
For highly basic proteins with pH's between 10 and 12, the older 2D-PAGE tech-
nology using carrier ampholytes is still useful to achieve maximum resolution.
Sample preparation is critical to successful 2D-PAGE. Proteins must be effec-
tively dis aggregated and solubilized without degrading or modifying their con-
stituent polypeptides. Subfractionation into membrane, cytosol, nuclear,
cytoskeletal and other subcellular fractions has become a key technique, bolster-
ing the overall resolving power of 2D-PAGE by increasing the copy number of
rare proteins and enabling buffers to be customized for different cellular factions.
It also yields information on the subcellular localization of proteins that may
prove vital in assessing their function and pharmaceutical potential. Innovations
in sample preparation are also addressing the problem of rare proteins becoming
obscured on gels by more abundant proteins that migrate to similar gel locations.
A number of groups have developed techniques based on immuno-affinity chro-
matography for removing abundant proteins from samples prior to the 2D-PAGE
step. These include immuno-depletion methodology developed and extensively
used by Oxford GlycoSciences (OGS; Abingdon, UK) and the ProteoClear™
technology developed by Proteomix Inc (San Diego, CA, USA).
Most image-analysis systems work by first detecting the center of gravity or
peak density of a feature, then recording other parameters such as the size of the
feature and its overall abundance. An alternative is edge-detection software, which
is useful for complex feature outlines. Additional bioinformatics procedures come
into play during the curation of feature analysis data obtained from individual
gels. This involves landmarking and warping of images such that they all align
into a common geometry. Usually several replicates of the test sample are run to
derive multiple images, from which an integrated, composite image or master-
group is derived. This is an important part of the experimental design, since it cap-
tures the natural biological variation present, and assists in reducing minor process
variations.
The result of 2D-PAGE and subsequent image analysis is a two-dimensional
map, or protein expression map (PEMTM) [1], of the proteins present in a particu-
lar cell type under defined conditions. Figure 2 shows composite PEMsTM derived
from 10 matched clinical preparations of purified ductal and myoepithelial cells
obtained from the human breast, constructed by Oxford GlycoSciences in collab-
oration with the Ludwig Institute for Cancer Research (LICR; London, UK) [2].
These are currently being compared to another PEMTM derived from a large series
of purified human breast cancer samples in order to identify disease-specific pro-
tein changes and ultimately novel pharmacological targets against breast cancer.
Figure 2. Protein expression map (PEMTM) derived from normal human breast luminal and myoepithelial
cells.
highly impressive compared with other separation technologies, is still below the
approximately 6000 proteins, or more, believed to be expressed by any given tis-
sue type [3]. The resolving power of 2D-PAGE is limited for several categories of
protein. These include very basic proteins, which remain difficult to separate in
reproducible patterns, and large proteins, which do not load or migrate well on
current gels. Most importantly, hydrophobic proteins do not readily solubilize in
the aqueous media on which current 2D-PAGE techniques are based. This is a sig-
nificant problem because membrane proteins with hydrophobic domains include
many cellular receptors and transporters which may provide valuable drug targets.
One approach to this problem is the development of new zwitterionic detergents
which increase the aqueous solubility of these proteins [4], although it is possible
that hydrophobic media capable of supporting electrophoretic separation may be
developed in the future.
2 Proteomics 23
Once proteins have been separated, imaged and incorporated into a mastergroup,
selected proteins of interest are identified and fully characterized using mass spec-
trometry (MS). In contemporary proteomics there has been particular emphasis on
two ionization techniques, electro spray ionization (ESI) and matrix assisted laser
desorption and ionization (MALDI). Tandem MS (MSIMS), involving successive
phases of fragmentation and mass analysis, is also frequently used. It is highly
accurate, very useful in the analysis of post-translational modifications and N-ter-
minal sequencing, and applicable to mixtures of peptides.
The use of MS to characterize proteins is highly dependent on bioinformatics
programs that seek to match the MS-derived data with data from existing EST,
gene and protein sequence databases (see Chapter 3). In the technique known as
peptide mass finger-printing, the spectrum for a given protein digest is compared
against a database of predicted spectra generated for each of the peptides within a
primary sequence database, assuming these peptides had been subjected to the
same pattern of fragmentation.
The use of genomics-derived sequence databases in MS-based analysis consti-
tutes a key crossover point between genomics and proteomics. On the one hand,
the identification of proteins by proteomic analysis benefits from matches
obtained from genomics databases (see Chapter 3). On the other, finding a match
within such a database helps to annotate the genomic information by specifying a
specific cell type and physiological situation in which the gene's protein product
is expressed. Post-translational modifications, however, cannot be identified with
reference to genomic data, and are typically identified following extensive frag-
mentation using MSIMS by reference to a database of spectra generated by spe-
cific side chain variants. MSIMS can also be deployed in place of traditional
Edman degradation to fully sequence novel proteins. These are discovered when
no match is found between the protein being subjected to MS and data from the
genomics databases. Such de novo data represents an important level of real dis-
covery by virtue of the proteomic approach.
hybrid and related assays have proved a useful way of analyzing protein-protein
interactions and when used to analyze the full genomes of simple organisms are a
powerful approach to the elucidation of protein interaction networks [5].
During the next few years, proteomics is widely expected to supersede genomics
as the single most productive application of biotechnology in pharmaceutical
development. Its principal applications include diagnostics, discovery of targets
and naturally occurring protein therapeutics, target validation, drug candidate
selection and mode of action studies and toxicology (Tab. 1). There are also appli-
cations within clinical development.
Preclinical Clinical
2.5.1 Databases
One of the most important applications of proteomics has been the establishment
of subscription databases documenting the protein profiles of specific cell types
under different conditions. For instance, a series of integrated
proteomics/genomics databases has been developed jointly by OGS and Incyte
(Palo Alto, CA, USA) using a range of clinically important human and microbial
cell types. This combined approach is proving extremely useful since for each
sample type, it uniquely brings together information linking the genotype and
phenotype, via genomic (genes), mRNA (transcriptional) and proteomic (transla-
tional) data.
2.5.2 Diagnostics
Proteomics is ideally placed for the development of novel diagnostics based on the
analysis of human body fluids. Secreted proteins are a more abundant and spec if-
2 Proteomics 25
ic source of markers for diseased tissues than nucleic acids, so that whereas
genomics-based diagnostic approaches may require a sample of the diseased tis-
sue, proteomics-based approaches typically do not. By comparing protein expres-
sion profiles obtained from the body fluids of healthy and diseased individuals,
changes in the abundance of proteins can readily be detected, supporting the
development of markers for the diagnosis and monitoring of diseases. Proteomics
projects are researching a range of diagnostic markers for conditions as diverse as
bladder cancer [6], endometrial hyperplasia and carcinoma [7], ovarian cancer [8]
chronic transplant rejection [9], male infertility [10], cardiac hypertrophy [11],
schizophrenia [12] and others.
Target discovery, in which samples from normal and diseased tissues are com-
pared directly, is one of the most important pharmaceutical applications of pro-
teomics (Tab. 2). Taking oncology as an example, proteomic studies have been
reported in the analysis of neuroblastomas [13], breast cancer [2, 14], lung [15],
colon [16], renal cell carcinoma [17] and others.
More detailed target validation ideally requires both gain and loss of function
studies to determine whether a change in the expression of a given protein leads
to a disease phenotype and whether correction of the change reverses it. To vali-
date function, proteomic techniques are combined with established molecular
biology approaches, including the use of small molecule inhibitors and agonists,
antisense nucleic acid constructs, dominant-negative proteins, and neutralizing
antibodies microinjected into cells. It is to be expected that the continuing devel-
opment of functional proteomics will lead to more rapid screens for function than
are presently available.
Once the relevant components of a signal transduction pathway causative of
disease have been identified, immunological techniques can significantly increase
the resolving power of proteomics to analyze their behavior during target valida-
tion studies. Proteomes obtained from normal and diseased tissues can, for
instance, be blotted onto membranes and probed using antibodies against select-
ed proteins. Since proteomics has the power to resolve over 2000 proteins on a
single gel, data on, for example, specific isoforms of signaling molecules such as
glycosylated and phosphorylated variants can readily be derived using this
approach. Frequently such subtle changes constitute the primary difference
between a normal and disease phenotype.
2 Proteomics 27
Once drug candidates have been raised against validated targets, proteomics can
play an integral role in the selection and optimization of lead candidates for fur-
ther development. In particular, it is important to confirm that their efficacy is
achieved through the expected mechanism of action and to compare the effects of
candidates against a range of markers of efficacy and toxicity.
One example of the power of proteomics to characterize mechanisms of action
is provided by a recent study of clofibrate's effects on rat liver in vivo. Clofibrate
is a peroxisome proliferator activation receptor (PPAR) inducer which stimulates
fatty acid oxidation by a complex multi-pathway modulation [18] and also
induces hepatocellular carcinomas in rodents via a non-genotoxic transformation
that is largely uncharacterized. In a proteomic study comparing liver cells from
four clofibrate-treated rats with four controls, a total of 304 out of 92,000 gel fea-
tures examined differed significantly between treated and control animals.
Sequence annotations of these features demonstrated that many were proteins
known to be affected by clofibrate treatment such as catalase, glutathione sulfo-
transferase, enol-CoA hydratase, and estrogen metabolic enzymes. In effect, all of
the key findings in hundreds of clofibrate-focused publications were confirmed in
one simple eight-animal proteomics study. Moreover, many new discoveries were
made. For example, clofibrate was found to modulate the levels of over a dozen
proteins that have never before been described. These novel proteins may prove to
be entirely new pharmaceutical targets for the dynamic field of PPAR-alpha stim-
ulation. Furthermore, proteomic analyses allowed entire metabolic pathways to be
monitored for treatment-induced changes. As shown in Figure 3, enzymes along
the entire citric acid cycle pathway, as well as entry and exit points into a variety
N
00
Atroc::1ed by c~~.,te - -J
NOI a"ected by clofibrate
.......... ~ Pyruvate
~
,.......................!
Not obt.ertld by pc'ol-omlca
.......................
Phosphoenol
Pyruvate
Acetyl-
htty add bloaynme-I.
CoA
Gtyoxyla1. & dk'arboaylabt Fatty .cki degradadon
metabolism
Hl:f+ Cittyl-CoA
1.1.U2
Fuma,ate
Ouloauccinate
+
l 'l~l\1
"'I"
Sue-em"1 • ',-ii"ii,'_ Su«lny-CoA s.$yCCin~kM 2-0..0-
~ dihydf04lpoa OIU1lr•••
~-
~
o;hydr04ipoomldo
~ Llpoomldo j
+
:::
'-
;,0
~
.'Figure 3. Proteomic analysis of the effects of clofibrate on the citric acid cycle in the rat liver. ~
ec
2 Proteomics 29
2.6 Conclusions
Proteomics has assumed a major role in modem drug discovery within just a few
years of its inception. The ability of proteomics to screen proteins on a mass scale
and to analyze their structure, location, relationships and functional organization
using parallel, high-throughput techniques has greatly enhanced the ability of
30 M.J. Page et al.
2.7 References
1 Page MJ, Amess B, Rohlff C et al (1999) Proteomics: a major new technology for the drug discovery
process. Drug Discovery Today Development and Therapeutics 4: 55-62
2 Page MJ, Amess B, Townsend RR et al (1999) Proteomic definition of normal human luminal and
myoepithelial breast cells purified from reduction mammoplasties. Proc Natl Acad Sci U SA 96:
12589-12594
3 Celis IE, Gromov P (1999) 2D protein electrophoresis: can it be perfected? Curr Op Biotech 10: 16-21
4 Chevallet M, Santoni V, Poinas A et al (1998) New zwitterionic detergents improve the analysis of
membrane proteins by two-dimensional electrophoresis. Electrophoresis 19: 1901-1909
5 Frederickson RM (1998) Macromolecular matchmaking: advances in two-hybrid and related tech-
nologies. Curr Opin Biotechnol 9: 90-96
6 Ostergaard M, Wolf H, Orntoft TF et al (1999) Psoriacin (S100A7): a putative urinary marker for the
follow-up of patients with bladder squamous cell carcinomas. Electrophoresis 20: 349-354
7 Bytjalsen I, Mose Larsen P, Fey SJ et al (1999) Two-dimensional gel analysis of human endometrial
proteins: characterization of proteins with increased expression in hyperplasia and adenocarcinoma.
Mol Hum Reprod 5: 748-756
8 Alaiya AA, Franzen B, Moberger B et al (1999) Two-dimensional gel analysis of protein expression
in ovarian tumors shows a low degree of intratumoral heterogeneity. Electrophoresis 20: 1039-1046
9 Faulk WP, Rose M, Meroni PL et al (1999) Antibodies to endothelial cells identify myocardial dam-
age and predict development of coronary artery disease in patients with transplanted hearts. Hum
Immunol60: 826-832
10 Shetty J, Nabby-Hansen S, Shibahara H et al (1999) Human sperm proteome: immunodominant sperm
surface antigens identified with sera from infertile men and women. Bioi Reprod 61: 61-69
II Arnott D, O'Connell KL, King KL et al (1998) An integrated approach to proteome analysis: identifi-
cation of proteins associated with cardiac hypertrophy. Anal Biochem 258: 1-18
12 Edgar PF, Schonberger SJ, Dean B et al (1999) A comparative proteome analysis of hippocampal tis-
sue from schizophrenic and Alzheimer's disease individuals. Mol Psychiatry 4: 173-178
13 Wimmer K, Thorval D, Asakawa J et al (1996) Two-dimensional separations of the genome and pro-
teome of neuroblastoma cells. Electrophoresis 17: 1741-1751
14 Williams K, Chubb C, Huberman E et al (1998) Analysis of differential protein expression in normal
and neoplastic human breast epithelial cell lines. Electrophoresis 19: 333-343
15 Hirano T, Franzen B, Uryu K et al (1995) Detection of polypeptides associated with the histopatho-
logical differentiation of primary lung carcinoma. Br J Cancer 72: 840-848
16 Ji H, Reid GE, Moritz RL et al (l997)A two-dimensional gel database of human colon carcinoma pro-
teins. Electrophoresis 18: 605-613
17 Sarto C, Marocchi A, Sanchez J-C et al (1997) Renal cell carcinoma and normal kidney protein expres-
sion. Electrophoresis 18: 599-604
18 Knopp RH (1999) Drug treatment of lipid disorders. N Engl J Med 341: 498-511
19 Moyses C (1999) Pharmacogenetics, genomics, proteomics: the new frontiers in drug development. Int
J Pharm Med 13: 197-202
Modern Methods of Drug Discovery 31
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
3 Bioinformatics
David B. Jackson l , Eric Minch 2 and Robin E. Munro 2
3.1 Introduction
3.2 Sequence comparison
3.2.1 Scoring matrices
3.2.2 Affine gap penalties
3.3 Database searches
3.3.1 BLAST
3.3.2 BLASTP
3.3.3 PSI-BLAST
3.3.4 PHI-BLAST
3.3.5 FASTA
3.4 Multiple sequence alignment
3.5 Biological databases
3.5.1 Sequence databases
3.5.2 Domain databases
3.5.3 Structural classification databases
3.5.4 Comparative databases
3.5.5 Mutation databases
3.5.6 Bibliographic databases
3.6 The bioinformatics tool-box
3.7 Gene characterization
3.8 EST clustering
3.9 Protein structure prediction
3.9.1 Secondary structure prediction
3.9.2 Tertiary structure prediction
3.9.3 Homology modeling
3.9.4 Fold recognition/threading.
3.9.5 Ab initio modeling
3.9.6 Prediction of structural elements
3.10 Cellular localization prediction
3.11 Phylogenetic analysis
3.12 Metabolic simulation
3.12.1 Quantitative approach
3.12.2 Qualitative approach
3.12.3 Semi-quantitative approach
3.12.4 Hybrid approaches
3.13 Automated sequence annotation
3.14 Conclusions and perspectives
3.15 References
32 D.B. Jackson et al.
3.1 Introduction
The writer T.S. Eliot once mused, "Where is the knowledge we have lost in infor-
mation?" [1]. From a biological perspective, the answer to this profound question
is today having far-reaching consequences for the future of biomedical research
and, in particular, the drug discovery process.
Like the greatest authors' works, a genome may be viewed as a superior work
of complex linguistics, sculpted by evolution, to tell the story of life.
Unfortunately, when it comes to genome linguistics we are much like a four-year-
old child attempting to read and comprehend a Shakespearean masterpiece. Thus,
we are not wholly illiterate, but we still have much to learn. Despite the fact that
the fundamental elements of evolution's alphabet are limited to a set of four char-
acters, to say that the manner in which it has "written" the story of life is complex,
constitutes a gross understatement. With language comprehension being a pivotal
element in the purveyance of knowledge, it is not surprising that our efforts to
decipher the secrets of evolution's prose have witnessed a recent and rapid growth.
But why this sudden imperative? It was predicted in the early 1980s that the key
scientific pursuits of the 21 st century would be those related to the disparate sci-
ences of molecular biology and computing. Now, as we begin a new millennium,
we find that this prophecy is not without substance. To be more accurate, howev-
er, what we have recently witnessed is an amalgamation of these disciplines into
a single science; the science of bioinformatics. This fusion was in retrospect
inevitable, since biology is very much an information science. However, it was the
technological advances in both fields that provided the deciding cohesive force.
Improved cloning techniques, the advent of high-throughput chip technologies
and the profusion in genomic sequencing efforts led to an "information explosion"
that could only be handled by employing new advances in information technolo-
gy and computing power. Bioinformatics was thus born.
In simple terms, bioinformatics may be thought of as a theoretical, computer-
based science dedicated to the extrapolation of biological knowledge from bio-
logical information. It operates via the acquisition, analysis and administration of
information, typically obtained by DNA sequencing, laboratory experiment, data-
base searches, scientific publications, instrumentation or modeling.
It differs from its sibling, biological computation, in which strategies and struc-
tures derived from biology are applied to the design of computational algorithms
and architectures (including neural networks, genetic programming, DNA com-
putation, etc.).
The aforementioned challenge to understand genome linguistics is now the forte
of bioinformatics. This is an important consideration since the challenge not only
holds the key to rapid drug discovery techniques, but also promises to change the
face of biology forever. In fact, despite its recent birth, bioinformatics is already
having profound implications for pharmaceutical and diagnostic concerns. Costs
incurred while pursuing a specific drug target and, indeed, the number of such
putative targets have increased enormously over the past decade. While the appli-
cation of genomics and high-throughput methodologies early in the chain of drug
3 Bioinformatics 33
The eagerly awaited transition from fluorescent slab gel based sequencers (e.g.,
Applied Biosystems 373/377, Licor-4000/4200 IR2) to capillary based methods
has ushered in a new era in rapid genomic scale DNA sequencing. By mid 2002
the entire genomes of 695 viruses, 22 viroids, 13 archaeae, 57 bacteriae and 4
eucaryotae including one fungus, one plant, two animals and the human have been
sequenced. Among the 3 billion base pairs of the human genome lies an estimat-
ed 65,000-75,000 genes, of which many could be the pharmaceutical "golden
nuggets" of the future (see Chapter 1).
"Mining" such extensive genomic data to extract relevant biological informa-
tion is the pivotal function of bioinformatics. The basic foundation upon which
this data mining process lies involves the mutual arrangement and comparison of
two or more sequences (DNA or protein), with the aim of defining regions of puta-
tive similarity and/or identity. This strategy, referred to as sequence alignment,
allows us to extrapolate quantitative expressions of inter-relatedness and to infer
possible structural, functional or evolutionary relationships. While alignments
provide a powerful way to compare related sequences, they can be used in an
effort to describe different facts. The alignment of two residues might be viewed
as reflecting a common evolutionary origin, or could be viewed as representing a
common structural motif, which, according to the forces of convergent evolution,
may be completely independent of origin. Interestingly, there is no documented
example to date where two sequences have clear sequence similarity, but dissim-
ilar three-dimensional (3D) structures. Thus it is generally assumed that a com-
mon ancestor always implies a common 3D structure, a notion which forms the
basis of homology based modeling, discussed later.
Because of their fundamental importance, numerous classes of comparison
algorithms and scoring matrices have been developed, the majority of which are
applicable to both DNA and protein. Sequence-comparison methods fall into two
broad categories 1) pairwise methods (e.g., BLAST [2], FASTA [3]) in which two
sequences are compared and 2) methods in which models or profiles are built from
multiple sequence alignments and compared either with other sequences or pro-
files (e.g., HMMER).
34 D.B. Jackson et al.
a.
A
l j
B '·"1 (
A == I
1 == == I
B IDI~·I====~!!C:==:E!··~·I====~f~··3..1===============!E··=···=···~I======J
b.
1 2
A j J 1.." I ) 3( / J
(
\
!
~
B I I
1 \
_ ·f
1 2
- 3
II I f····j H 1···1 I
Figure I. The basic difference between global (a) and local (b) alignment strategies. Global alignments
incorporate the entire lengths of the two sequences. whereas local methods only align their most homolo-
gous parts. Arrows indicate the alignment process.
3 Bioinfonnatics 35
situation is ill-defined, since very little is known about the random distribution of
their scores. One of the very few methods available generates an empirical score
distribution from the alignment of random sequences of the same lengths as the
two sequences being compared. A so-called Z-score for the alignment is then cal-
culated from this distribution. An accurate significance estimate for the Z-score
cannot currently be derived, due to the ill-defined nature of the distribution score.
For local alignments, the situation is thankfully simpler. In this case, the ran-
dom score distribution for an optimal ungapped alignment has been proven to fol-
Iowan extreme value distribution. While unproven for gapped local alignments,
assessments suggest that such a distribution also holds true. Both the BLAST and
FASTA sequence comparison algorithms report so-called raw scores for each
comparison, including an assessment of their statistical significance, based upon
the extreme value distribution, called E-values. For a given alignment, the E-value
is dependent on 1) the length of the query sequence, 2) the size if the database
searched and 3) the raw score.
Quantifying the degree of similarity between two sequences requires three com-
ponents; 1) a scoring scheme, 2) a gap model and 3) an alignment optimizing
algorithm. Scoring schemes are represented as matrices that define the value for
aligning each possible pair of residues. For amino acids, the scoring scheme is a
20 x 20 matrix of numbers. Such matrices are usually symmetrical, since Asp
aligned with Glu has the same score as Glu aligned with Asp. The most elemen-
tal scoring scheme is identity scoring. Here, the aligned units are scored as either
identical (+ 1) or non-identical (0). Using this strategy, the normalized sum of the
alignment scores is popularly quoted as "percentage identity." Although simple in
nature, identity scoring does not reflect the three-dimensional scenario. Another
strategy, this time specific to proteins, known as genetic code scoring considers
the minimum number of nucleotide changes [1-3] that would be required to inter-
convert the codons for two amino acids. While this scheme is rarely employed for
today's alignment tasks, it is often useful for constructing phylogenetic trees.
Chemical similarity scoring schemes give greater relevance to the alignment of
amino acids with similar physico-chemical properties. Here, amino acids of sim-
ilar character (e.g., aspartate and glutamate) are given higher scores than those of
different character (e.g., phenylalanine and glutamate). Despite the large number
of alternative matrices available, it is those based on evolutionary aspects, or
observed substitutions, that have become the most widely used and reliable to
date.
One of the most popular observed substitution schemes, until recent times, was
developed by the late Margaret Dayhoff and co-workers [7]. They devised a model
of protein evolution from which they extrapolated a mutation matrix scoring
scheme. The underlying principle of this approach assumes that the estimation of
mutation rates for closely related proteins can be extrapolated to more distant rela-
36 D.B. Jackson et al.
62 being the default value for BLAST, and the less stringent BLOSUM 50 being
the FASTA and Smith Watennan (see below) defaults.
Comparison of 3D structures also allows alignment of more distantly related
proteins. Many groups have used this knowledge to extrapolate structure-based
substitution matrices. While in theory this approach should provide the most accu-
rate comparison matrices, the size of the available dataset limits its present appli-
cability.
A word of caution before we proceed. The use of BLOSUM matrices per
default is only a general rule. Time and experience have shown that no single
matrix is the solution to all questions. One study [10], for example, compared the
BLOSUM, PAM, PET, 3-D based, Gonnet and multiple PAM matrices, using
ungapped BLAST. While concluding that BLOSUM62 is generally most effec-
tive, there were certain protein families for which all the other algorithms per-
fonned better. The moral of the story: failure to detect a significant alignment in
a search does not always imply that there are no detectable homologues. In such
instances, adjust the matrix and try again.
As sequences evolve and diverge, they often accumulate insertions and/or dele-
tions. Two parameters, a) the gap creation penalty and b) the gap extension penal-
ty, have been devised to penalize the cost of these events.
GAP creation (length-independent) penalty: The application of a negative scor-
ing penalty to the substitution matrix for the first residue in a sequence alignment
gap. FASTA, for example, penalizes the opening of a gap by scoring a default of
-12 for proteins and -16 for DNA.
GAP extension (length dependent) penalty: The application of a negative scor-
ing penalty for every additional residue in a gap. FASTA applies a default score
of -2 for proteins and -4 for DNA.
Thus, combining the two components implies that, for a gap of length x, the
gap penalty P is expressed as:
P(x) = a + bx
where a is the gap opening penalty and b is the gap extension penalty.
This fonn of penalty function is referred to as affine and has many efficiency
advantages over more elaborate penalty types.
In most cases the penalty for a gap is equal at all positions in an alignment.
However, in cases where one would like to align say, a domain, to a longer
sequence, then penalties at the ends of the domain should be reduced. This facil-
itates accurate alignment of the domain. Similar considerations also hold true
when the secondary structure of a protein is known. For example, increasing the
gap penalty within core secondary structure elements will reduce the likelihood of
placing gaps there.
38 D.B. Jackson et al.
3.3.1 BLAST
The BLAST (Basic Local Alignment Search Tool) suite of sequence alignment
algorithms [2] was developed against a backdrop of advances in the statistical
treatment of ungapped sequence alignments. Today, the BLAST2 family general-
ly performs gapped alignments and is defined by seven algorithms, BLASTP (pro-
tein against protein database), BLASTN (DNA against DNA database), BLASTX
(translated DNA against protein database), TBLASTN (protein against translated
DNA database), TBLASTX (translated DNA against translated DNA database),
position specific iterative (PSI-) BLAST [2], and pattern hit initiated (PHI-)
BLAST [11].
3.3.2 BLASTP
BLASTP is now the most widely used program for sequence comparison, prima-
rily because of its accurate statistics, rapid speed and high sensitivity. The algo-
rithm operates in six intuitive steps:
6) The resulting alignment is reported if the expectation value (E-value; i.e., the
possibility of such a homology occurring by chance in a search of any given
database) is low enough to deem it of interest.
3.3.3 PSI-BLAST
3.3.4 PHI-BLAST
Pattern Hit Initiated (PHI) BLAST [11] is the newest addition to the BLAST fam-
ily of sequence comparison algorithms that implements a "hypothesis-directed"
search method. It does this by restricting a BLAST search to those protein
sequences within a database that contain a specified pattern. Thus it requires as
input both a protein query sequence and a query seed pattern. The latter is speci-
fied according to the PROSITE [13] pattern search convention and must occur at
least once in the query sequence and less than once every 5,000 database residues,
i.e., the pattern must have four fully specified residues.
40 D.B. Jackson et al.
3.3.5 FASTA
Like BLAST, the FASTA family of programs [3] is composed of seven algorithms;
FASTA3 (protein against protein or DNA against DNA), ssearch3 (the Smith-
Waterman algorithm), FASTX3IFASTY3 (translated DNA against protein, but
only in three frames), TFASTX3ffFASTY3 (protein against translated DNA data-
base), TFASTA3 (Protein against translated DNA database in six frames), FASTF3
(compares a mixed peptide sequence against a protein database), TFASTF3 (com-
pares a mixed peptide sequence against a translated DNA database).
In comparing a protein sequence against a protein database, the FASTA3 algo-
rithm employs a six-step strategy:
1) Locate regions shared by two sequences with the highest density of single
(ktup = 1) or double (ktup =2) identities (Fig. 2, panel a).
2) Take the ten highest regions of density and score them using BLOSUM50. The
value of highest scoring region is assigned to the variable initl and reported
later (Fig. 2, panel b).
3) If several regions exist with a score greater than the CUTOFF value, establish
if they can be joined to form an approximate alignment with gaps (Fig. 2, panel
c).
4) The similarity score (initn) is calculated by adding the scores of the separate
joined regions and subtracting the gap penalties (Fig. 2, panel d).
5) For sequences with scores greater than a threshold, an optimal local alignment
is constructed and scored; the so-called opt-score.
6) Calculate Z-scores, rank the alignments accordingly and display them using the
Smith-Waterman algorithm.
,,
,
~Sequence A~
,, ~SeqUenceA~
,
~ CO
Q)
oc o
c
Q) Q)
:::::J :::::J
0- 0-
Q) Q)
1'----_'---' 1'------_
en
Figure 2. Graphical representation of the alignment procedure employed by FASTA. See accompanying
steps for details.
of sequences, all unique pairwise comparisons are made and the similarity scores
for the comparisons recorded in a table. These pairwise alignments may be scored
according to percentage identity, normalized alignment score (where the raw score
for alignment of two sequences is divided by the length of the alignment) or a sta-
tistical score from Monte-Carlo shuffling of the sequences. 2) Hierarchical clus-
ter analysis is then performed on the table of pairwise scores generating a den-
drogram of similarity groups. 3) By then following the dendrogram from leaves to
root, the multiple sequence alignment is generated. In general, the denser the tree,
the better the alignment.
One of the most sophisticated hierarchical multiple alignment programs is
ClustalW [14]. ClustalW applies different pair-score matrices when aligning
sequences of differing similarity. The program also modifies gap-penalties
depending upon the composition and degree of conservation at each position.
Instead of using a single substitution matrix, ClustalW incorporates a progressive
series, such that the matrix chosen is appropriate for the level of dissimilarity
encountered. The AMPS program [15], in contrast, implements pairwise global
42 D.B. Jackson et al.
guages such as SQL (structured query language). The relational model, however,
still suffers from the use of different database management systems, a fact that
hinders efficient cross database querying. However, object oriented models of data
in relational databases permit inter-database communication. When the objects in
two or more databases are similar, standards such as the "Common Object
Request Broker Architecture" (CORBA) can be used to communicate between
them in a flexible and dynamic manner.
One of the most important developments in biological database querying was
the introduction of integrative systems that allow queries over numerous hetero-
geneous databases. SRS (Sequence Retrieval System; Fig. 3) [21] and ENTREZ
from NCBI were developed with this goal in mind, allowing easy hypertext based
navigation between entries in different member databases. SRS version 6 provides
Figure 3. A selection of the most commonly used databases with particular emphasis on the integrative
powers of SRS (Sequence Retrieval System).
44 D.B. Jackson et aI.
three different querying interfaces, suiting novice and expert users alike. Because
the system provides intricate database linking, one can easily progress from
sequence information, to metabolic information (e.g., KEGG [97]), to possible
disease associations (e.g., OMIM the "Online Mendelian Inheritance in Man"
database [22]) to proprietary data on lead-compounds, and examine the results
using either pre-defined or user-defined views.
With the fervent progress in database development, it has become increasingly
difficult to obtain a current overview of available databases. While centrally curat-
ed catalogues such as DBCAT [23] and the former LiMB [24] help users locate
databases, escalating efforts are required to keep-up with the current flux of devel-
opment. SRS deals with this problem by generating a database of "DATA-
BANKS" by traversing public SRS installations around the world, providing a
unified system for accessing -400 different biological databases. Currently, the
public network of SRS servers provides access to -1300 databank copies, distrib-
uted over 40 sites in 26 countries.
against sequence databases to add more sequences to the family. In contrast to the
procedures used to construct PROSITE and PRINTS, which concentrate on the
conserved regions of a family's sequences, Pfam's gapped alignments may
include long regions of uncertain alignment between conserved regions.
All of the above represent domain collections that cover a wide spectrum of
cellular functions. This broad scope comes at the cost of optimal sensitivity, speci-
ficity and annotation qUality. For this reason, the developers of the SMART
(Simple Modular Architecture Research Tool) database have collected gapped
alignments of signaling and extracellular domains that are imperfectly covered in
the aforementioned databases. Three iterative methods (HMMER, MoST and
WiseTools) were used to detect distant homologues that were then scored for sta-
tistical significance by searching with BLAST. Hidden Markov models of each
domain were then constructed. SMART domains are annotated by providing links
to i) recent literature via Entrez, ii) homologues with known three-dimensional
structure via PDBsum [32], and iii) the domain or motif collections of PROSITE,
Pfam and BLOCKS.
In addition to curated compilations, protein family databases based solely on
sequence similarity have been constructed by automated clustering of protein
sequence databanks. Despite the fact that these databases contain many more
entries than curated compilations, groupings are made solely on the basis of
sequence similarity with many entries including only a few sequences. Moreover,
family designations change after clustering each new release of a database.
The most advanced uncurated effort is the current version of ProDom [33],
which uses PSI-BLAST to cluster Swiss-Prot. Each ProDom entry corresponds to
a single module or domain which is represented by a gapped multiple alignment,
together with a consensus sequence derived from it. The database is re-construct-
ed with each release of Swiss-Prot. DOMO is a similarly constructed collection of
protein families which tries to group sequences that share multiple modules into
a single entry [34]. Like ProDom, a DOMO entry is represented by a gapped mul-
tiple alignment.
While the large number of databases ensures maximal coverage of all possible
domains, there is a considerable degree of redundancy between them. Moreover,
using these databases to annotate new sequences is difficult because different fam-
ily representations and search algorithms are employed. To address this problem,
a recent venture, in the form of the InterPro consortium [35], plans to integrate
PROSITE, Prints and Pfam with the possibility of others at a later stage. In so
doing they will not only improve the functional annotation of TrEMBL entries,
and thus their speed of promotion to SWISSPROT status, but every other func-
tional annotation effort as well. The BLOCKS+ [20] database was a similar ini-
tiative that used the automated block making PROTOMAT system to delineate
sets of blocks representing conserved regions and the LAMA program to detect
overlap of family groupings. As a result, the BLOCKS+ database has about twice
the coverage of the original BLOCKS database.
3 Bioinformatics 47
There are presently a few hundred classified protein folds, with a maximum esti-
mate of 1,400 from sequence analysis. Despite this, we generally categorize a
structure using the simple classification of all-alpha, all-beta, or alpha-beta.
Databases such as SCOP [36] and CATH [37] classify protein folds at many lev-
els, but the primary classification comes with the aforementioned assignment.
SCOP is a highly comprehensive description of the structural and evolutionary
relationships between all known protein structures. The hierarchical arrangement
is constructed by mainly visual inspection in conjunction with a variety of auto-
mated methods. The principal levels are "family," "superfamily" and "fold." The
family category contains proteins with clear evolutionary relationships. Members
of the superfamily have probable common origin; there may be low sequence
identity, but structure and functional details suggest a common origin. The fold
level groups together proteins with the same major secondary structure elements
in the same arrangement with identical topological connections.
CATH is a hierarchical domain classification of structures with a resolution
better than 3.0. The database is constructed by automatic methods wherever pos-
sible. The four levels in the hierarchy are: class (C), architecture (A), topology (T)
and homologous superfamily (H); a further level called sequence families (S) is
sometimes included. "C" classifies proteins into mainly alpha, mainly beta and
alpha-beta; "A" describes the shape of the structure, or fold; "T" describes the
connectivity and shape; "H" indicates groups thought to have a common ancestor,
i.e., homologous; "S" structures are clustered on sequence identity.
It is now clear that the majority of microbial proteins are highly conserved, with
some 70% containing ancient conserved regions. Proteome versus proteome com-
parisons are now utilizing this fact to establish protein orthology, i.e., direct pro-
tein counterparts in different genomes typically retaining the same physiological
function. The Clusters of Orthologous Genes (COGS) [38] approach identifies the
closest homologs in each ofthe sequenced genomes for each protein, even at lev-
els of seemingly insignificant similarity. The system is built upon the simple
notion that any group of at least three genes from distant genomes, which are more
similar to each other than they are to any other genes from the same genomes, is
most likely to belong to an orthologous family. Each COG consists of individual
proteins or groups of paralogs from at least three different lineages (Fig. 4, left).
The original COGS database was derived by comparing seven complete pro-
teomes. As of late 2002, the database has an additional 43 proteomes and now rep-
resents 30 major phylogeneticallineages. Another effort called MGDB (microbial
genome database for comparative analysis) is dedicated to orthologue identifica-
tion, paralogue clustering, motif analysis and gene order comparison. Given the
present pace of microbial genome acquisition, the number of representatives will
48 D.B. Jackson et aI.
Figure 4. A minimal COG consists of three genes from three different lineages and can be illustrated by a
triangle (left). COGs were expanded by combining triangles that shared sides (right). Each of the 21
genomes on the COGs web-site is represented by a different color. Reciprocal best matches between
genomes are represented by a solid line. Paralogous relationships, where one genome sequence has a best
match to a sequence in another genome but the reverse is not true are indicated by dashed line. (repro-
duction permission from NCB! news and COGs service).
no doubt increase enormously over the coming years making the COGs and
MGDB databases invaluable resources for the annotation of microbial sequences.
One of the most pressing and fundamental problems in this era of profuse DNA
sequencing is the accurate prediction of gene structure, and in particular, open
reading frames. While numerous in silico methodologies have appeared over the
past decade, extracting this knowledge has proved less than easy. Currently, the
correlation between predicted and actual genes is around 70% with just 40-50%
exons predicted correctly. The most successful methods rely on the recognition of
three key factors; 1) structural components such as initiation codons, splice sig-
nals and polyadenylation sites, 2) the compositional tendencies of coding regions,
and 3) homologous sequences.
Detecting functional components in genomic DNA is very much a difficult
exercise in pattern recognition. The process is problematic because the structures
that delineate a gene are ill-defined, highly unspecific and diverse. However, a
number of methods have been applied to the task, the most simple of which use
exact word, regular expression or plain consensus sequence (e.g., a poly-adenine
stretch) searches. Weight matrices provide an extra level of complexity allowing
each position to match any residue but with different scores, while the most
sophisticated methods employ neural networks. Alone, these methods are insuffi-
cient for gene identification since they result in a computationally untreatable
array of potential products. Instead, developers have opted to combine this
approach with those dedicated to detection of coding potential.
While identifying prokaryotic coding sequence is a relatively easy task, with
typically single contiguous stretches of open reading frame coding for a single
protein, the situation for eukaryotes is complex. Applied methods utilize statisti-
50 D.B. Jackson et al.
While a working draft of the human genome provides an important foundation for
gene discovery, it will take a considerable period of time to reconstruct the exact
amino acid sequence of all human proteins from these data. Crucial to this effort
will be a combination of both genomic and expressed sequence tag (EST) data
analysis. As the name suggests, ESTs are short segments of DNA sequence
derived from the high throughput sequencing of specifically constructed cDNA
libraries. It is by no means surprising that many of today's more successful
biotechnology concerns have reaped the benefits of ESTs in the drug discovery
process. ESTs provide a rapid route to gene discovery, allow "in silico northern
blot" analysis and reveal alternative splice variants. This all comes at a price, how-
ever. Due to the high-throughput nature of the process, EST sequences tend to be
poor in quality. In order to improve the situation, pre-processing (e.g., masking),
clustering and post-processing of the results is required.
Clustering information is a prerequisite of all EST clustering tasks. While
shared annotation later provides joining information, the most accurate criteria for
cluster membership is sequence identity. Small scale clustering projects involving
at most a few thousand sequences can easily apply standard tools of contiguous
assembly (i.e., the fusion of multiple overlapping DNA sequence clones into one
contiguous sequence) or multiple alignment. However, the task of clustering the
millions of publicly available ESTs is a venture of greater magnitude. A number
of purpose-built clustering methods have risen to the challenge. ICA tools [51],
for example, represents an alignment based approach that uses a BLASTN [2]
type of algorithm to examine if one sequence is a sub-set of another. A component
thereof, N2tool, is a dedicated clustering tool that uses an indexed file format and
local alignment to compare all sequences to each other and identify those which
share regions of similarity. D2-cluster is a non-alignment based clustering method
that uses the approach of comparing word composition within two sequence win-
dows, thus identifying sequences that are greater than 96% identical over a win-
dow of 150 bases. The algorithm is used to produce the initial loose clusters in the
STACK clustering system. The well known TIGR-ASSEMBLER has also been
used for EST clustering and assembly. The system uses BLAST and FASTA to
identify all sequence overlaps and stores them in relational database. So-called
"transitive closure groups" are then formed and subjected to assembly using the
TIGR-ASSEMBLER. Systems such as Genexpress Index [52], Unigene (at the
NCBI) and the Merck Gene Index [53] also group sequences into clusters based
on sequence overlap above a given threshold.
facilitates the many accessory cellular factors that participate in ensuring its
desired structure. However, the solution to the accurate automated ab initio pre-
diction of 3D structure using solely the endogenous aspect, remains an unsolved
conundrum. The results of the first four CASP (Critical Assessment of protein
Structure Prediction; a comparative assessment within the structural biology com-
munity of the current state of the art in protein structure) experiments have also
embraced this notion. While theoretical approaches such as molecular dynamic
simulations have been examined, they remain practically hindered by inadequate
computing power and knowledge base limitations. These facts have led to an ever-
increasing void between sequence information and the structural knowledge it
harbors (see also Chapter 8). Offsetting this imbalance can to some extent be
achieved using secondary structure prediction algorithms, of which there are
many.
Secondary structure prediction has a long and varied history, roughly defined by
three generations of strategy. First generation algorithms examined the three-state
structural propensities (defined here as H for helices, E for extended strands, and
L for non-regular loops) of single residues, using a very restricted information base
(e.g., Chou-Fasman [54]). The three-state method is the most commonly used
scoring scheme when assessing the relative performance of secondary structure
prediction algorithms (an alternative eight-state model is also available). This sim-
ple scheme reflects the percentage of correctly predicted residues, denoted as Q3:
where Cj is the number of correctly predicted residues in state i (H, E, L), and N
is the number of residues in the protein. For first generation algorithms, scores
were typically below the 50% level.
Growing bodies of data, information about local physicochemical influences,
the use of statistical inference, and greater understanding of tertiary structure all
contributed to the enhancements of second generation algorithms (e.g., Gamier-
Osguthorpe-Robson [55]). At this stage, secondary structure prediction made an
important step away from simply looking at the structural propensities of single
residues to include information about the influence of adjacent residues.
Predictions thus became a process akin to pattern recognition, relying on the fact
that segments of contiguous sequence, typically within a window of 13-21
residues contain an endogenous bias for certain secondary structure states.
However, despite these conceptual advances, second generation algorithms, in
common with their predecessors, produce results that are often erroneous.
Undersized a-helices, near random prediction of p-sheets and a Q3 value <70%
3 Bioinformatics 53
are very much the norm. Note that despite this fact, many bioinformatics providers
continue to offer these algorithms as standards.
The advent of third-generation algorithms, heralded by the development of
PHD [56], has seen the inclusion of evolutionary information using sequence
alignments, coupled with pattern recognition strategies such as neural networks or
nearest neighborhood methods. The evolutionary aspect is important because of
the implicit information contained about long-range intra-protein interactions,
derived from correlated mutation/conservation at key residues.
PHD sec is the most accurate and best known exponent of this multi-level
approach. Here, the query sequence is first screened against a non-redundant data-
base of protein sequences. Homologues are then extracted and compared using the
Figure 5. Three-state per-residue accuracy of various prediction methods. Shaded bars: methods of 1st and
2nd generation; filled bars: methods of 3rd generation. The left axis showed the normalized three-state per-
residue accuracy, for which a random prediction would rate 0%, and an optimal prediction by homology
modeling would rate as 100%. Only methods were included for which the accuracy had been compiled
based on comparable data sets, the sets in particular are: K&S62, 62 proteins taken from [102]; LPAG60,
60 proteins taken from [103] ; Prel24, 124 unique proteins taken from [104]. The methods were: C+ F
Chou & Fasman (1st generation) [54]; Lim (l SI) [105]; GORI (l SI) [55]; Schneider (2nd) [106]; ALB (2nd)
[107]; GORIII (2nd) [108]; LPAG (3rd) [103]; COMBINE (2nd) [109]; S83 (2nd) [110] ; NSSP (3rd) [58] ;
PHD (3rd) [56] .
54 D.B. Jackson et al.
The ultimate aim of protein structure prediction is to take a protein with unknown
structure and, from its sequence alone, predict the tertiary, or 3D, structure.
Despite the simplicity with which the basic problem can be stated, over the 40
years that people have been considering it, no method has ever proved to be gen-
erally (some would say, even partly) successful. The intellectual challenge of the
problem, despite its apparent intractability, has ensured that many have been (and
still are) willing to look at it. Although no general theory has resulted, all this
effort has not been in vain as there are now many methods that, although they can-
not predict a full tertiary structure, can provide insight into the sort of structure
that a sequence might adopt.
Generally, when a protein has high sequence identity to a protein with known
3D-structure, models can be generated using homology modeling (see Chapter 8).
Where no sequence homology can be determined, then threading can be employed
to find appropriate folds. Only when there is no structure with any sequence
homology and no confident threading prediction must purely ab initio methods be
used.
3 Bioinformatics 55
tures in sequences with known structures to find a possible match. Methods that
fall into this category include: TOPITS [66], MAP [67] and H3P2 [68].
Helix-tum-helix
The best known exponent of helix-tum-helix (HTH) prediction is HTHScan. The
program uses a profile derived from Pfam Release 2.0 [30] and constructed using
MEME 2.1 [19], to detect the presence of H-T-H motifs in protein sequences.
Transmembrane domains
Intrinsic membrane proteins comprise some 20-30% of the eukaryotic proteome.
Transiting the membrane by means of a-helices, each containing approximately
20 amino acids of hydrophobic character, their functions are diverse. Some act as
growth factor receptors, transducing extracellular signaling events and controlling
processes such as cell proliferation, differentiation and apoptosis. Others, such as
the G-protein coupled receptors, are of immense pharmaceutical interest being
responsible for functions as diverse as olfactory/visual transduction, hormone
reception, and the purveyance of neuromodulatory signals. Not surprisingly then
is the search for putative membrane proteins an integral part of any genome char-
acterization project. Luckily, predicting a-helical transmembrane domains is a far
easier task than predicting the same domain in a globular context. This derives
from constraints placed by the lipid bilayer, which cause a hydrophobic bias. This
bias is so strong that even a simple strategy, such as calculating a propensity scale
using a sliding window and cutoff, performs admirably well. However, an addi-
tional level of complexity is introduced if we wish to assess the topology of a
58 D.B. Jackson et al.
membrane protein, i.e., whether the N-terminal segment persists in the intra or
extra cellular environment. One general rule of thumb, known as the "positive-
inside rule," suggests that the basic amino acids arginine and lysine are found on
the cytoplasmic side.
Numerous algorithms of varying strategy have been developed in search of
transmembrane proteins. TopPred [76], for example, applies a dual empirical
hydrophobicity cutoff to the output of a sliding trapezoid window, thus producing
a list of putative transmembrane domains. A potential problem with this technique
comes from applying a fixed hydrophobicity threshold, since many a-helical bun-
dle type transmembrane proteins use the non-hydrophobic residues for inter-heli-
cal contacts. The PHDhtm program [77] uses evolutionary information, in the
form of a profile-based multiple sequence alignment coupled with neural net-
works and post-processing, to produce what has repeatedly proven to be the most
accurate transmembrane prediction strategy. Dynamic programming and a set of
statistical tables derived from well-characterized membrane proteins form the
basis of the Memsat program [78]. This program uses separate propensity curves
for amino acids found in the head and tail regions of the lipid bilayer. The TMAP
algorithm [79] also utilizes this strategy, this time in combination with the fre-
quency biases of 12 residue types. Recently, an entirely new approach has been
reported, that summons the help of HMMs ("hidden Markov models," see
Multiple alignments). The algorithm, TMHMM [80], encompasses many of the
conceptual and methodological aspects of the aforementioned methods. Its origi-
nality, however, derives from the probabilistic framework of HMMs, which close-
ly resemble the biological system. This avoids the need for specialized post-pro-
cessing or dynamic programming methodologies. In terms of accuracy TMHMM
produces a comparable single TM prediction accuracy to PHDhtm, however, the
overall accuracy of TMHMM is not as high as for PHDhtm when topology is also
considered.
cal clustering, e.g., of genes in expression array studies. Since the 1950s it has
been assumed that phylogenetic relationships can be inferred from sequence sim-
ilarity. Species with very similar sequences for one or more genes or proteins can
be assumed to have diverged from a relatively recent common ancestor, while the
common ancestor for those with more different sequences is relatively more
ancient. The basis for this assumption is simply that the mutational events under-
lying speciation can be regarded as random and cumulative. Two facts complicate
this picture: first, mutations can be recurrent, and second, they can be convergent.
Since transitions (purine-purine or pyrimidine-pyrimidine) are much more fre-
quent than transversions (purine-pyrimidine), almost all multiple mutations are
recurrent, in that they revert to the original non-mutant type. Thus polymorphic
loci in mutational hot spots (such as the hypervariable mtDNA D-Ioop) are not
very informative for longer time scales. Similarly, microsatellite markers, whose
mutation rate is an order of magnitude higher than other polymorphisms, are
about as likely to gain as to lose a repeat (except at the extremes of their allelic
range).
Convergent mutation (or homoplasy) occurs often enough at the phenotype
level to be a problem in phenetic phylogenetic reconstruction: the anatomy of the
eye, for example, shows unexpected similarities among widely separated species.
At the molecular or sequence level, however, this becomes less of a problem
(although the fundamental distinction between genotype and phenotype becomes
less clear when we can directly measure DNA sequence!)
The two main types of tree-building algorithms are those based on discrete
characters (e.g., maximum likelihood and maximum parsimony) and those based
on distance matrices (e.g., neighbor joining and average linkage). It is always pos-
sible, of course, to transform character data into distances, but not vice versa, so
character data retain more information than do distances. On the other hand, the
distance matrix methods are computationally much less demanding than the char-
acter-based methods. The practical and theoretical advantages and disadvantages
of these are still under debate. Felsenstein's Phylogeny Page [84] has pointers to
161 different software programs and packages related to phylogenetic reconstruc-
tion. The review by Wills [85] covers many of the pros and cons, and includes rec-
ommendations for the appropriate use of different methods. Setubal and Meidanis
[86] provide a somewhat more general and rigorous introduction. Since even the
distance matrix methods are overwhelmed when given thousands of cases and/or
characters, much of the recent work has been directed towards the development of
approximation or sampling methods. This includes, among others, the disk cover-
ing method [87], routed insertion [88], and perfect matchings [89].
Metabolic modeling is entering an exciting era. Today, we have detailed and fair-
ly accurate representations of various metabolic pathways in different organisms.
We have holistic approaches which encompass entire organisms, albeit in less
3 Bioinfonnatics 61
Continuous kinetic models were the first to be applied to limited systems of chem-
ical reactions, as there was already an extensive history of dynamical systems sim-
ulation using differential equations. Use of this paradigm has continued, especial-
ly with increasing computational power over the last decades, and some of the
most recent efforts are based on pure quantitative models.
Several computer packages have been specifically developed for quantitative
metabolic modeling (i.e., for kinetic simulation and optimization), including
MIST [92] and Gepasi [93]. The Gepasi system is well-maintained, quite flexible,
easy to use, and exemplifies the state of the art for quantitative modeling. It allows
specification of a variety of reaction mechanisms, can search for solutions and
estimate parameters based on empirical data, and characterize steady states using
metabolic control analysis (MCA) and linear kinetic stability analysis. An even
more recent example applying this paradigm is E-CELL [94]. E-CELL is, like
Gepasi, a general-purpose biochemical simulation environment, which has been
used to model the primary metabolism of a whole cell (an in silico chimera of
Mycobacterium genitalium and Escherichia coli), resulting in a counterintuitive
prediction of the glucase starvation response which still awaits empirical verifica-
tion.
62 D.B. Jackson et al.
Using the data of EcoCyc, Heidtke and Schulze-Kremer have developed a quali-
tative simulation framework (BioSim) [95] and a model description language
(MDL), to derive qualitative models from the EcoCyc knowledge base. The Lisp-
based frame definitions within EcoCyc were translated to Prolog, then the classes
were defined within MDL as being either objects (compounds, elements,
enzymes, genes, proteins) or processes (reactions, pathways). Thus a model could
be constructed from a query to the EcoCyc knowledge base. An interpreter was
then applied to the model to generate qualitative behavior (i.e., indications of the
increase or decrease in quantity of a substance.
It should also be mentioned in this context that a good many efforts are under
way in reconstructing and representing genetic regulatory networks [96]. Although
these do not model metabolism per se, they do constitute qualitative models whose
function will eventually be necessary for accurate metabolic models.
The drawback of the qualitative approach is that the quantitative aspects of the
system have been abstracted away, so that accurate modeling of gradual change is
impossible; even to approximate it is difficult without considerable complication
of the knowledge- and rule-base. Variations in concentration of metabolites, in
rate of reactions, or in delay between events, cannot be both easily and accurate-
ly modeled using purely qualitative methods.
The predominant model used in the semi-quantitative approach is the Petri net for-
malism. A Petri net is a bipartite graph i.e., a graph in which the vertices are par-
titioned into two subsets, such that vertices from each set are connected only to
vertices in the other set. For purposes of metabolic modeling, one set of vertices
(called places) is used to represent metabolites; the other set of vertices (called
transitions) is used to represent reactions. The characteristic of Petri nets which is
useful for modeling is that each of the places contain "tokens," and when every
place incident to a transition is occupied by a token, the transition fires, which
changes the state of each of the places connected to the transition.
Petri nets have been extended in many different ways, e.g., (black/white versus
colored, delayed or timed, capacity, self-modifying, stochastic, etc.). By incorpo-
rating one or more of these extensions, it is possible to model at any level or com-
bination of levels from qualitative to quantitative, i.e., one can include as much
quantitative information as available on concentrations, rates, etc.
While flexible, the formal frameworks of the semi-quantitative approach are
still unfamiliar, and not yet very well standardized. In addition, the more accu-
rately quantitative the model is to be, the more extensions must be made to the
supporting formalism, making it somewhat complex and unwieldy. Nonetheless,
the representational power and generality of this approach means that it will prob-
ably be used more often in the future.
3 Bioinfonnatics 63
A hybrid model contains two or more classes of objects, such that the interactions
among the members of each class are defined according to a set of rules peculiar
to that class (constituting an integral model), while the interactions between class-
es (the "glue" linking the various integral representations) are specified by ad hoc
rules peculiar to the model. Hybrid models are thus non-generic, special-purpose
models which must be hand-crafted for each target system.
McAdams and Shapiro [97] present an outstanding example of this approach,
in which they simulate the kinetics and signal logic of the bacteriophage lambda
lysisnysogeny decision circuit. Their model involves twelve genes organized in
five operons with seven promoters, and both mRNA, direct gene products, and
posttranscriptional modifications. The kinetics of transcription and signaling are
modeled using numerical integration of differential equations, while the switching
logic is superimposed on the signal levels using both sigmoid-thresholded and
edge-triggered boolean gates. Signal delays are produced by distance between
genes on the DNA, transcription rates, concentration thresholds for signaling pro-
teins, and rates of protein degradation. The model is remarkably detailed and
accurate, and allows verification of the logic as well as identification of the con-
trol function of different components.
One advantage of hybrid models such as this is that (like semi-quantitative
models) they can represent aspects of continuous kinetics at short time scales and
discrete switching mechanisms at longer time scales. A further advantage, distin-
guishing them from the semi-quantitative approach, is that they utilize algorithms
for these different time scales, which have been highly optimized over decades of
use, which are both efficient and familiar. The practical disadvantage is that they
must be recreated by hand for each particular target system.
Currently, the main obstacles to metabolic modeling include the practical prob-
lems of obtaining relevant data, and the theoretical problem of allowing both
quantitative and qualitative interpretations of the same events at different levels.
The practical problems will inevitably yield, bit by bit, to sustained effort. The
theoretical problem can be circumvented, for a time, by hybrid models or by rely-
ing on increases in computing power, but its solution requires a new paradigm
involving a change in perspective whose nature is yet unknown.
statistics, motifs, expression patterns and function are all extractable at the push
of a button.
While predicted characteristics such as isoelectric point (pI) and molecular
weights are static entities, at least in the absence of post-translational modifica-
tions, functional assignments based on tentative sequence homology must always
be treated with caution. Couple this with the fact that database annotations are
often erroneous and one can appreciate that such automated methods should be
treated as facilitators of the manual approach, at least for the foreseeable future.
Indeed, the bioSCOUT system supports this contention by assigning a confidence
level to its annotation, thus allowing the manual annotator to focus only on diffi-
cult cases for which the confidence is low. Moreover, an automated update proce-
dure allows the results of subsequent database searches to be immediately incor-
porated into the function prediction process, with so-called "feature reports"
updated accordingly. Another important feature is the unique "alerting" capabili-
ty of the system. Not only can a sequence be automatically searched against data-
base updates every evening, but strategies can be combined using boolean opera-
tors to include (or exclude) specified PROSITE patterns, keywords and even
HMMs. The result is a system that attempts to overcome the stated limitations of
the automated process, by staying constantly up-to-date with new sequence data
and associated annotations.
3.15 References
1 T.S. Eliot choruses from The Rock
2 Altschul SF, Madden TL, Schaffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new genera-
tion of protein database search programs. Nucleic Acids Res 25: 3389-3402
3 Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad
Sci USA 85: 2444-2448
4 Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the
amino acid sequence of two proteins. J Mol BioI 48: 443-453
5 Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26:
787-793
6 Smith TF, Waterman MS (1981) Comparison of bio-sequences. Adv Appl Math 2: 482-489
7 Dayhoff MO, Schwartz RM, Orcutt BC (1978) Atlas of protein sequence and structure. Nat Biomed
Res Foundation, Washington D.C., USA 5, Suppl 3: 345-352
8 Jones DT, Taylor WR, Thornton JM (1992) A new approach to protein fold recognition. Nature 358:
86-89
9 Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein-sequence data-
base. Science 256: 1443-1445
66 D.B. Jackson et al.
10 Henikoff S, Henikoff JG (1993) Perfonnance evaluation of amino acid substitution matrices. Proteins
17: 49-61
11 Zhang Z, Schaffer AA, Miller Wet al (1998) Protein sequence similarity searches using patterns as
seeds. Nucleic Acids Res 26: 3986-3990
12 Teichmann SA, Chothia C, Gerstein M (1999) Advances in structural genomics. Curr Opin Struct Bioi
9: 390-399
13 BairochA (1991) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 19 Suppl:
2241-2245
14 Thompson ro, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position specific gap penalties and weight
matrix choice. Nucleic Acids Research 22: 4673-4680
15 Barton GJ (1994) The AMPS package for multiple protein sequence alignment. Methods Mol Bioi 25:
327-347
16 Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: Detection of distantly related pro-
teins. Proc Natl Acad. Sci USA 84: 4355-4358
17 Hughey R, Krogh A (1996) Hidden Markov models for sequence analysis: extension and analysis of
the basic method. Comput Appl Biosci 12: 95-107
18 Neuwald AF, Liu JS, Lipman DJ et al (1997) Extracting protein alignment models from the sequence
database. Nucleic Acids Res 25: 1665-1677
19 Grundy WN, BaileyTL, Elkan CP et al (1997) Meta-MEME: Motif-based hidden Markov models of
protein families. Comput Applic Biosci 13: 397-406
20 Henikoff JG, Henikoff S, Pietrokovski S (1999) New features of the Blocks Database servers. Nucleic
Acids Res 27: 226-228
21 Etzold T, Argos P (1993) SRS - an indexing and retrieval tool for flat file data libraries. Comput Appl
Biosci 9: 49-57
22 Online Mendelian Inheritance in Man, OMIM (TM). McKusick-Nathans Institute for Genetic
Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology
Infonnation, National Library of Medicine (Bethesda, MD), 2000. World Wide Web URL:
http://www.ncbi.nlm.nih.gov/omiml
23 Discala C, Benigni X, Barillot E (2000) DBcat: a catalog of 500 biological databases. Nucleic Acids
Res. 28: 8-9
24 Lawton JR, Martinez FA, Burks C (1989) Overview of the LiMB database. Nucleic Acids Res 17:
5885-5899
25 Bairoch A, Apweiler R (1999) The SWISS-PROT protein sequence data bank and its supplement
TrEMBL. Nucleic Acids Res 27: 49-54
26 Barker WC, Garavelli JS, Huang H et al (2000) The protein infonnation resource (PIR). Nucleic Acids
Res 28: 41-44
27 Walsh S, Anderson M, Cartinhour SW (1998) ACEDB: a database for genome infonnation. Methods
Biochem Anal 39: 299-318
28 Hofmann K, Bucher P, Falquet L et al (1999) The PROSITE database, its status in 1999. Nucleic Acids
Res 27: 215-219
29 Attwood TK, Flower DR, Lewis AP et al (1999) PRINTS prepares for the new millennium. Nucleic
Acids Res 27: 220-225
30 Sonnhammer EL, Eddy SR, Birney E et al (1998) Pfam: multiple sequence alignments and HMM-pro-
files of protein domains. Nucleic Acids Res 26: 320-322
31 Ponting CP, Schultz J, Milpetz F et al (1999) SMART: identification and annotation of domains from
signalling and extracellular protein sequences. Nucleic Acids Res 27: 229-232
32 Laskowski RA (2001) PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res 29:
221-222
33 Corpet F, Gouzy J, Kahn D (1998) The ProDom database of protein domain families. Nucleic Acids
Res 26: 323-326
34 Gracy J, Argos P (1998) DOMO: a new database of aligned protein domains. Trends Biochem Sci 23:
495-497
35 Apweiler R, Attwood TK, Bairoch A et al (2000) InterPro-an integrated documentation resource for
protein families, domains and functional sites. Bioinformatics 16: 1145-1150
36 Hubbard TJ, Ailey B, Brenner SE et al (1999) SCOP: a Structural Classification of Proteins database.
Nucleic Acids Res 27: 254-256
37 Orengo CA, Pearl PM, Bray JE et al (1999) The CATH Database provides insights into protein struc-
3 Bioinfonnatics 67
96 D'haeseleer P, Liang S, Somogyi R (1999) Gene expression data analysis and modeling. Tutorial ses-
sion at Pacific Symposium on Biocomputing, Hawaii, January: 4-9; also see
http://www.cgl.ucsf.edulpsb/psb99/genetutorial.pdf
97 McAdams HH, Shapiro L (1995) Circuit simulation of genetic networks. Science 269: 650-656
98 Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:
27-30
99 Scharf M, Schneider R, Casari G et al (1994) GeneQuiz: a workbench for sequence analysis. Proc Int
Con! Intell Syst Mol Bioi 2: 348-353
100 Frishman D, Mewes H-W (1997) PEDANTic genome analysis. Trends in Genetics 13: 415-416
101 Gaasterland T, Sensen CW (1996) MAGPIE: automated genome interpretation. Trends Genet 12:
76-78
102 Kabsch W, Sander C (1983) How good are predictions of protein secondary structure? FEBS Lett 155:
179-182
103 Levin 1M, Pascarella S, Argos P et al (1993) Quantification of secondary structure prediction
improvement using multiple alignment. Protein Eng 6: 849-854
104 Rost B, Sander C (1994) Combining evolutionary information and neural networks to predict protein
secondary structure. Proteins 19: 55-72
105 Lim VI (1974) Structural Principles of the Globular Organization of Protein Chains. A Stereochemical
Theory of Globular Protein Secondary Structure. J Mol Bioi 88: 857-872
106 Schneider R (1989) Sekundarstrukturvorhersage von Proteinen unter Beriicksichtigung von
Tertiarstrukturaspekten. Diploma thesis, Universitlit Heidelberg, Germany
107 Ptitsyn OB, FinkelsteinAV (1983) Theory of protein secondary structure and algorithm of its predic-
tion. Biopolymers 22: 15-25
108 Gibrat J-F, Gamier JRobson B (1987) Further developments of protein secondary structure prediction
using information theory. New parameters and consideration of residue pairs. J Mol Bioi 198:
425-443
109 Gamier J, Gibrat J-F, Robson B (1996) GOR method for predicting protein secondary structure from
amino acid sequence. Meth Enzymol266: 540-553
110 Kabsch WSander C (1983) Segment83. Unpublished
III Brenner SE, Barken D, Levitt M (1999) The PRESAGE database for structural genomics. Nucleic
Acids Res 27: 251-253
Modern Methods of Drug Discovery 71
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
4.1 Introduction
4.2 Present status
4.3 Logistics
4.4 Sample sourcing
4.5 Compound depository
4.6 Bioassay automation and robotics
4.7 Microplate technologies
4.8 Detection
4.9 Chip technologies in drug discovery
4.10 Data handling
4.11 Economics
4.12 Future aspects
4.13 References
4.1 Introduction
In the past 5 years the world of high-throughput screening (HTS), which targets
lead discovery for both pharmaceutical and agrochemical applications, is in a
highly rapid developmental stage. Today, restructuring the drug discovery pro-
grams to develop highest performance is a key issue of pharmaceutical compa-
nies. In an increasingly competitive environment prioritizing innovation has
become of major strategic importance. Driving forces are the pressure to discov-
er more innovative and even better drugs, as well as time to market. Driven by
technological innovation the hybrid "science" HTS [1,2] depends on and merges
latest developments in molecular biology, chemistry, pharmacology, laboratory
automation, bioinformatics, and computing to meet "economics," "big numbers,"
"high speed," and "high information content" necessary for the discovery of new
and innovative drugs.
For an effective HTS infrastructure the fundamental working fields of target
discovery and validation, screen design, assay technology, detection method,
sample generation and handling, laboratory automation and robotics, as well as of
data management have to be integrated, coordinated and optimized. At present,
latest technology and laboratory automation in all parts of the early stages of drug
discovery play crucial roles in the success of HTS strategies [3]. Enabling tech-
nologies that characterize HTS are listed (Tab. 1) [4]. However, drug discovery is
72 R. Thiericke
extremely complex and the goals, company resources, products portfolio, as well
as future expectations of the different players vary widely.
This chapter is intended to provide an overview about the processes, technolo-
gy, and instrumentation that are in use for HTS from a personal, more technolog-
ical point of view. Latest technologies as well as future trends like miniaturization,
higher degree of parallel processes, uHTS, and chip-technologies are given prior-
ity. However, space does not allow to present a complete summary in the rapidly
growing field of HTS technologies, especially in-house developments and the
high number of latest technologies developed and marketed by start-up biotech
companies on a proprietary basis. Furthermore, for novel technologies in, for
example, target discovery (Chapters 2 and 3), compound library design (Chapters
5 and 7), or combinatorial chemistry (Chapter 6) see the corresponding chapters
in this book.
Laboratory automation has grown from a novel technology with its various diffi-
culties and problems of about 10 years ago to powerful, more reliable, and easy
to handle tools today. Applications range from sample generation and prepara-
tion, genomics [5] and proteomics [6] for target finding and validation (see
Chapters 2 and 3), automated multiple parallel synthesis stations, robotic storage
devices, parallel liquid handling systems, workstations and fully integrated robots
for primary biological screening, to automation of analytical problems (sample
purity, mass detection etc.). Today, about 40% of all HTS operations are auto-
mated. With the increasing number of hits, however, technology development as
well as automation strategies are expanded to the steps after hit-discovery, like
hit-profiling, pharmacological testing, re-synthesis in gram quantities and in vivo
profiling.
Today, HTS for drug discovery (up to 10,000 wells analyzed per day) has been
established in most of the pharmaceutical companies, and overall, HTS is already
4 High-throughput screening technologies 73
conducted in more than 500 laboratories worldwide [7]. Often, there exists the
infrastructure to perform more than 100,000 test samples in the 96- or 384-well
format in reasonable time like a week. A number of companies have reached this
throughput per day (ultra high-througput screening, uHTS; 100,000 wells ana-
lyzed per day) [4, 8]. Screening capacity usually is 20 to 50 different assays per
year with a strong upward trend. Data management like acquisition, analysis,
report generation, and mining is established.
About 50 to 60% of all assays in HTS are already running in the 384-well for-
mat. The 1536-well format is close to moving from an experimental stage to
becoming more routine. Companies report on more than 1,000,000 stock com-
pounds ready to be screened. The sample collections are stored in specially
designed compound depositories mostly in the 96-well format. However, pharma-
ceutical companies already began to run depositories in the 384-format and to
establish respective logistics for handling this format in both HTS primary screen-
ing and in secondary profiling. A wide diversity of equipment necessary for reli-
able and routine use are commercially available. In addition, there exists sufficient
knowledge to establish and run new and efficient HTS laboratories.
4.3 Logistics
with the latest technologies. Technology watch and integration into already exist-
ing and well running platforms need a significant amount of personnel capacity.
Another important aspect is, which of the necessary activities of the HTS field
should be conducted in-house in pharmaceutical companies, and which should
made be derived from cooperation and alliances [12]. At present, there exists a
remarkable trend in outsourcing of defined areas of drug discovery, and big phar-
maceutical companies have turned to biotechnology companies that specialize in
equipment and services for HTS. Examples are the acquisition of samples, either
from synthesis (see, e.g., Chapter 6) or from nature (see Chapter 5, access to recent
results from biochemistry or genomics, new assay technologies, and pharmacolog-
ical profiling, etc. In addition, there is a high demand in alliances with niche com-
panies with access to a key technology. Dove [13] summarizes a number of these
major HTS products, its inventors, as well as its major partnerslcustomers. However,
all these outsourced activities have efficiently to be integrated and coordinated.
Compound collection (synthetics, pure natural products, and extracts from natural
origin) is a key asset in the drug discovery attempts of pharmaceutical companies
76 R. Thiericke
(see Chapter 7). These collections have to be stored under defined conditions and
logistics. Each compound should be ready and available in time for the various
needs of primary and secondary screening, as well as first pharmacological stud-
ies, either in defined plate formats (e.g., 96-, 384-, and 1536-wells), or as a single
probe. Therefore, the companies are going to install highly automated storage and
retrieval systems for compound management [22] in which the samples and/or
plates are uniquely identified by plate barcode and their position in the storage
plates/racks. For example, REMP (Switzerland) (Fig. 2) developed and estab-
lished storage and retrieval systems for the individual needs of the pharmaceuti-
cal and chemical industry. The systems are based on standardized storage racks
Figure 3. Specially designed storage and retrieval (cherry picking) plates in the 384-well format (REMP,
Switzerland; F. Hoffmann La Roche).
For a throughput of about 10,000 wells to be analyzed per day (HTS) well estab-
lished automation concepts are on the market that allow reliable routine applica-
tions [15]. Excellent instrumentation for routine low-volume liquid handling, sen-
sitive detection, robotic plate handling, and more specialized devices, like wash-
ing stations, plate sealer, piercer, and incubators in the 96-, and 384-well format is
now readily available and have been integrated into HTS campaigns in various set-
ups worldwide. A list of suppliers of HTS equipment and services is presented in
78 R. Thiericke
[24]. In addition, technologies are continually being developed and suppliers are
teaming up to make available more integrated equipment and consumable sets.
Robotic systems used in biological screening can be divided into fully inte-
grated systems and workstations [25]. Robotic systems are defined as those sys-
tems that utilize articulated robotic arms to move items (e.g., plates) and usually
have access to all points in the three dimensions of the operation area of the robot-
ic arm. A "workstation" is defined as an independent system that is highly spe-
cialized to perform limited functions in most cases without highly-sophisticated
robotic arms. The question of whether a fully automated robotic system or work-
station best fit the needs of HTS depends on the individual priorities, although a
major trend is to use multiple workstation-type devices. However, the design of an
automated system should be as simple as possible as long as it performs the
desired task.
For robotic arms, positional accuracy is limited by the accuracy of the servo
motors used. Several tens of microns can be achieved for positional accuracy with
precision in the same order. However, millimeter accuracy is probably sufficient
for 96-, and 384-well plates, and a tenth of that for 1536-well plates. As the well
size decreases the positional accuracy becomes more important. However, the liq-
uid handling parameters have to be estimated carefully because of problems with
capillary forces, mixing, bubble forming, etc. in reaction volumes down to 1 to
5 III [26]. In terms of overall speed a conveyor belt system will move a plate faster
than a robotic arm.
The major differences in the two systems, workstations and fully integrated
robots, arise in terms of hours of use, ease of use, repair, and cost. In a number of
cases the price for a workstation is significantly lower than for fully automated
robotic systems, allowing to purchase multiple workstations. Therefore, failure of
a fully automated robotic system that causes work stoppage for several hours/days
can be circumvented with a back up workstation. The redundancy afforded by a
multiple workstation concept also allows parallel processing of different assays
resulting in reliable results day in and day out. The workstation concept provides
greater flexibility and is easier to be programmed and handled. Even though
robotic systems are touted as being fully automated "walk away" devices, the real-
ity is that they need "human" attention. Comparison of a robotic system versus the
use of workstations, especially concerning throughput, handling, personnel, and
price is discussed by K. R. Oldenburg [25]. As an example, a workstation devel-
oped by Cybio AG (Jena, Germany) is depicted (Fig. 4).
In case of testing more than 100,000 compounds a day the Zymark Allegro™
robot system moves the bar upwards to uHTS [27]. The system is a series of fully
independent modules, each of which performs a specific task. The modules are
essentially workstations being interconnected by more simple robot arms and the
system is fully enclosed which allows to control the plate environment, i.e., tem-
perature, humidity, or gas composition in the atmosphere. Because of its modu-
larity, it can be quickly reconfigured to fulfill needs of different assay types. A
concept with a conveyer belt system (assembly line style) has recently been pre-
sented by Thermo CRS with its Dimension4™. Furthermore, CyBio has devel-
4 High-throughput screening technologies 79
oped the CyBi™-Screen machine for ultrahigh throughput screens with a compact
design, modular approach, and format flexibility (96-, 384-, and 1536-well
plates). At the core of the CyBi™-Screen machine is a newly developed transport
mechanism that passes the plates between hardware components along a straight
line while keeping to one level-i.e., following the shortest possible route [28].
the 96-well format. As reported in a recent study, the use of the 384-well format
will increase from ca. 55.6% in 2001 to ca. 60% in the 2003. Assays performed in
the 1536-well format will already be about 12.2% in 2003 [4]. Aurora
BioSciences tantalizes with a comprehensive uHTS system based on 3456-well
plates. In addition, higher density formats like 9,600 wells, or open but address-
able formats are in an experimental stage. The scope and impact of miniaturized
methods as applied to HTS are summarized in detail by J. Burbaum [32] .
Figure 5. CyB iTKScreen machine. an uHTS system from CyBio AG (Jena, Germany)
4.8 Detection
allows both adding compounds simultaneously to the wells (96-, and 384-format),
and scanning the whole microplate for kinetic measurements [35]. This device is
well suitable to analyze functional fluorescent-based assays in living cells. Main
applications are the determination of intracellular Ca2+-content and pH-value, as
well as membrane-potentials. Depending on the type of assay, a throughput up to
100 microplates is possible in principle [36]. With the NightOWL system from
EG&G Wallac based on a CCD camera luminescence and fluorescence applica-
tions are possible, while a fluorometric microvolume assay technology (FMAT) is
available from PE Biosystems that analyzes 1 mm2 areas in each well up to the
1536-format.
Based on a single photon counting CCD camera with a special light intensifier
CyBio AG has developed the high resolution luminescence imaging system
CyBiTKLumax 1536. This family of readers is not limited to a microplate format.
Integration of a parallel liquid handling device (CyBiTKLumax 1536 D/SD)
enables measurement of flash luminescence which can be used to determine intra-
and extracellular calcium ion concentration, e.g., via the photoprotein aequorin.
A novel screening system has jointly been developed by Carl Zeiss Jena GmbH
and F. Hoffmann-La Roche which resulted in ultra high throughput (uHTS) appli-
cation with simultaneous 96-well parallel reading of fluorescence, luminescence,
and adsorption on the basis of 96 miniature lenses. A 1536-well plate can be read
in 16 steps [23].
With the move toward miniaturization causing both decreases in sample signal,
and increase in background signal will require the development of new technolo-
gies. Fluorescence correlation spectroscopy (FCS) allows to determine binding
properties at the level of single molecules in volumes of a few femtoliters and
therefore lead discovery by miniaturized HTS (Evotec OAI, Hamburg, Germany)
[37]. For single molecule detection both optical (single molecule microscopy/
spectroscopy) and scanning probe technology (microscopy, manipulation and
force spectroscopy) might make their way to HTS. Apart from fluorescence, sur-
face plasmon resonance (binding properties), infrared spectroscopy (non-invasive
measurements in living cells) NMR (see chapter 9), as well as colorimetric and
amperometric methods are in the stage to be adapted to HTS [13].
The need for understanding how targets and potential hit compounds affect cell
function in early stages of drug discovery requires novel technologies for cell-
based assays [38]. So-called high-content screening (HCS) has been introduced
which allows to gain temporal and spatial cellular information about how a drug
interacts with cells based on multi-wavelength fluorescence measurements (e.g.,
microscope imagers) [39].
which use capillary and electrokinetic forces to move and mix liquids with
reagents, targets and/or test compounds [41]. Orchid Biocomputer (Princeton,
NJ), Caliper (Mountain View, CA) and others have developed special chips on the
basis of microfluidics systems. On the other hand, however, miniaturization
toward chip technologies pose a number of its own problems. Evaporation results
in significant variation of concentration in the sub-microliter region. Liquid dis-
pensing in nanoliter drops is possible in principle, but has problems in the pick-
up mode [42]. Capillary action causes well-to-well contamination of small reac-
tion caves, and surface adhesion/chemistry as well as surface tension has to be cir-
cumvented [43].
Processing fluidics rather than electrons, specialized microchips have been
developed that allow chemical and biochemical analysis [44]. Microchannels cre-
ated in the surfaces of chips allow non-mechanical pumping of liquids by both
electrophoretic and electroosmotic principles. Miniaturizing and integrating liq-
uid-handling and biochemical-processing devices such as pumps, valves, volume-
measuring, reactors, extractors, and separation possibilities led to create novel
drug discovery tools. For example, on-chip measurement of enzymatic reactions,
organic synthesis, ion sensing, DNA amplification, and immunossays, as well as
separation procedures including electrophoresis, chromatography, and solid-phase
biochemistry have been demonstrated. Knowledge contributed from both
genomics and proteomics could have far-reaching implications in such areas as
biology and medicine, clinical diagnostics, high-throughput drug discovery, mole-
cular toxicology, and pharmacodynamics modeling [44].
By screening all compounds available against all possible targets will tremen-
dously increase the growth of databases. A typical HTS program generated
200,000 data points a year in the early 1990s, while 5 years later 5 to 10 million,
and in the year 2000, 50 million data points have to be handled by a pharmaceu-
tical company [45]. This leads to one of the biggest obstacles in HTS: The col-
lecting, deconvoluting, analyzing, sorting, and storage of the information generat-
ed [46, 47]. The stored information itself as well as the ability to mine the bur-
geoning number of databases by linking HTS data to chemistry, and hit follow-up
data like HTS pharmacokinetics and toxicological experiments [1,48] are key to
success and of strong value as a corporate asset. Data-integration technologies are
maturing rapidly [49] and the software in use has to provide enough flexibility
[50]. Collecting the chemical, biological, preclinical, and clinical data in central-
ized databases is one of the strategies [51]. Another is to create integrated data
systems for both, target discovery, and drug discovery [52, 53], and to combine
data from relevant processes of drug discovery and development.
4 High-throughput screening technologies 83
4.11 Economics
In order to sustain the historical revenue growth of about 10% of the pharmaceu-
tical industry, the major global players have to launch five to six new chemical
entities (NCE) with sales higher than US$350 billion per year in the near future
[54] (see also Chapter 1). This is the pressure to fill the drug development pipeline
with new and innovative leads from drug discovery. In this process, quality plays
a crucial role because a significant component of the overall cost of drug devel-
opment is the high failure rate of clinical candidates never reaching the market
place.
Increasing the capacity and decreasing the costs are the major arguments for
HTS miniaturization and a higher degree of parallel processing. In any drug com-
pany, the return on investment in and the use of HTS should properly be assessed
in terms of the number of hitslleads identified. In general, less expensive assays
and new technologies are expected to be the major players for reducing the per-
well cost and the overall cost of lead identification.
In the near future, most pharmaceutical companies will have access to a portfolio
of HTS and uHTS technologies carefully adapted to the various needs.
Conceivably, there will be an excess in the screening capacity, particularly if
uHTS is widely implemented. The challenge is to generate high-quality hits which
can be developed into new drugs for the market and not what technologies and
machines have been installed and used in HTS programs. The aspect of virtual
screening may become more of a reality [55] (see also Chapters 8 and 10). New
applications of technology developments emerge in areas such as toxicology, lead
optimization, pharmacokinetics, secondary screening, and compound profiling.
The future of drug discovery is to converge and co-develop informatics (both
bioinformatics and chemoinformatics), medicinal chemistry, genomics, pro-
teomics, and pharmacokinetics, as well as technology hardware [46, 56].
The question arises how to speed up and optimize steps following drug discov-
ery. The next bottleneck are clear visible: Testing the lead compounds to deter-
mine profiles for adsorption, diffusion, metabolism, and toxicity and excretion
(ADME, see Chapter 12), expensive and time-consuming animal tests [57] (see
Chapter 13).
4.13 References
I Elgen RM (1999) High throughput screening: Myths and future realistics. J Biornol Screening 4:
179-181
2 Devlin JP (ed) (1997) High throughput screening. Marcel Dekker, New York
3 Oldenburg KR (1998) Current and future trends in high throughput screening for drug discovery. Ann
Rep Med Chern 33: 301-311
84 R. Thiericke
4 a) Fox S, Farr-Jones S, Yund MA (1999) High throughput screening for drug discovery: Continually
transitioning into new technologies. J Biomol Screening 4: 183-186
b) Fox S, Farr-Jones S, Sopchak L, Wang H (2002) Fine-tuning the technology strategies for lead find-
ing. Drug Discovery World, Summer 2002, 23-30
5 Spence P (1998) Obtaining value from the human genome: A challenge for the pharmaceutical indus-
try. Drug Discovery Today 3: 179-188
6 Celis JE (1999) Proteomics: Key technology in drug discovery. Drug Discovery Today 3: 193-195
7 Fox S, Farr-Jones S, Yund MA (1998) High throughput screening: Strategies and suppliers. High Tech
Business Decisions, Moraga, CA
8 Kell D (1999) Screensavers: Trends in high-throughput analysis. Trends Biotechnol. 17: 89-91
9 Studt T (1999) Drug development bottlenecks not cured by technology alone. Drug Discovery &
Development, Jan. 40-41
to Archer R (1998) Towards the drug discovery factory. J Assoc Lab Automat 3: 4
11 Tomlinson n, Butler BT, Frezza J, Harris CO, Smith AA, Kwight WB (1997) The planning and estab-
lishment of a high throughput screening site. Proceedings of the International Symposium on
Laboratory Automation and Robotics (ISLAR) 20-25
12 Grindley JN (1998) Success rates for strategic alliances - are they good enough? Drug Discovery
Today 3: 145-146
13 Dove A (1999) Drug screening - beyond the bottleneck. Nature Biotechnology 17: 859-863
14 Borman S (1999) Reducing time to drug discovery. Chem & Engineer News March 8: 33-48
15 Thiericke R, Grabley S, Geschwill K (1999) Automation strategies in drug discovery. Drug discovery
from nature: 56-71, Springer Verlag, Berlin
16 Brown RK (1999) Content and discontent: A combinatorial chemistry status report. Modern Drug
Discovery July/Aug.: 63-71
17 Paululat T, Tang Y-Q, Grabley S, Thiericke R (1999) Combinatorial chemistry: The impact of natural
products. Chimica Oggi, Chemistry Today 17: 52-56
18 Grabley S, Thiericke R (eds): (1999) Drug discovery from nature. Springer Verlag, Berlin
19 Borchardt JK (1999) Combinatorial biosynthesis. Planning for pharmaceutical gold. Modern Drug
Discovery July/Aug.: 22-29
20 Thiericke R, Schmid I, Moore T, Ebert G (2000) Drug discovery from nature. A device for automated
sample preparation. Am Biotechnol Lab 18: (10) 66
21 Thiericke R (2000) Drug discovery from nature. Automated high-quality sample preparation. J
Automated Meth & Management in Chemistry 22: 149-157
22 Wedin R (1999) Taming the monster haystack. The challenge of compound management. Modern
Drug Discovery Jan./Feb.: 47-53
23 Fattinger C, Gwinner E, Gluch M (1999) Looking for new drugs with speed, precision and econom-
ics. BioWorld 4: 2-4
24 Fox S, Yund MA, Farr-Jones S (1998) Seeking innovation in high-throughput screening. Drug
Discovery & Development Nov.: 32-37
25 Oldenburg KR (1999) Automation basics: Robotics vs. workstations. J Biomol Screening 4: 53-56
26 Berg M, Undisz K, Thiericke R, Zimmermann P, Moore T, Posten C (2001) Evaluation ofliquid-hand-
ling conditions in microplates. J. Biomol Screening 6: 47-56
27 Wildey MJ, Homon CA, Hutchins B (1999) Allegro™: Moving the bar upwards. J Biomol Screening
4: 57-60
28 Thiericke R, Schmid I, Gropp T, Ebert G (2000) Automation for ultrahigh throughput screens. Genetic
Engeneering News 20: (14) 32
29 Ferragamo T (1999) The importance of microplate standardization. J Biomol Screening 4: 175
30 Marshall S (1999) Small Talk '99 focuses on automated assay systems. Drug Discovery &
Development Sept.: 34-35
31 Berg M, Undisz K, Thiericke R et al (1999) Miniaturization of an enzyme assay (~-galactosidase) in
the 384- and 1536-well plate format. J Assoc Lab Autom 4: 64-67
32 Burbaum n (1998) Miniaturization technologies in HTS: How fast, how small, how soon. Drug
Discovery Today 3: 313-322
33 Wedin R (1999) Bright ideas for high-throughput screening. Modern Drug Discovery May/June:
61-71
34 Karet G (1999) Microplate readers keep pace with miniaturization. Drug Discovery & Development
May: 44-48
35 Schroeder K, Naegele B (1996) FLIPRTM: A new instrument for accurate, high-throughput optical
4 High-throughput screening technologies 85
5.1 Introduction
5.2 The impact of natural products on the drug market
5.3 Natural sources for drug discovery
5.3.1 General aspects
5.3.2 Microorganisms
5.3.3 Plants
5.3.4 The marine environment
5.4 Approaches to exploit nature's structural diversity
5.4.1 The physico-chemical screening approach
5.4.2 The chemical screening approach
5.4.3 The biological screening approach
5.5 Methods and technologies to build high-quality test sample libraries
5.5.1 General aspects
5.5.2 Automated chromatographic solid-phase extraction
5.6 A comprehensive library for lead discovery: The natural products pool
5.7 Combinatorial libraries based on natural products
5.8 Conclusions and perspectives
5.9 References
5.1 Introduction
About 30% of drugs on the worldwide market are natural products or are derived
from natural products. A similar ratio accounts for clinical candidates currently
under development. Though recombinant proteins and peptides account for an
increasing market volume, the superiority of low-molecular mass compounds in
human diseases therapy remains undisputed mainly due to more favorable com-
pliance and bioavailability properties.
In the past, new therapeutic approaches often evolved from research involving nat-
ural products [1]. Prominent examples from medicine that impressively demon-
strate the innovative potential of natural compounds and their impact on progress
88 S. Grabley and I. Sattler
Table I. History of commercialization of modern drugs derived from nature (for chemical structures see
Fig. I).
in drug discovery and development are listed in Table 1 (for chemical structures
see Fig. 1),
Today, approaches to improve and accelerate the joint drug discovery and
development process are expected to arise from both innovation in drug target elu-
cidation and lead compound finding. Structural and functional analysis of the
human genome will provide access to a dramatically increased number of new
5 Natural products for lead identification: Nature is a valuable resource for providing tools 89
yo H
oo~
0 '1,_,
I'" A
N-
oo~~;c
HO
OH
'"
1 ""
0))
I'"
HOCX:: ...-::.
urN H
I: o)=(-}<
HO,_, 0 H OH Acetyl- eOCH
H salicylic
Morphine Salicin acid Penicillin
OJ s
:j>"'J~l/~J ~r
/ Ht= ~ g g~ g
~j±)CI HOOC~~-+fSJ
0
~! 1=
~
'I' Y
~ 0 H 0 0 H 0 N-Me
f
orN"fCI ~t-C-N-~--c-~---!!--c--N-~--c~---i!---1"'H
A
Tao: eOGH
0/ H~"-:.H
D-Ala
/" H 1-1 1 I"H
HO~OH
~ F-o-a9'--<
~H
~
l .J
o t\ h
Artemisinin Lovastatin: R, = H, R2 = CH 3
Mevastatin: R, = R, = H Pravastatin:
Simvastatin: R, = R, = CH3 R, = H, R2 = OH Fluvastatin
Hoj~1
HO"'Y"'OH
potential drug targets that has to be evaluated (see Chapters 2, 3 and 8).
Improvements in high-throughput screening enable the testing of increasing num-
bers of targets and samples with the consequence that already 100,000 assay
90 S. Grabley and 1. SattIer
OH
HO
o o
FK 506 (Tacrolimus) Camptothecin Topotecan Irinotecan
~...~ ~..•~
M/CHO 0 AX/CHO 0
~ ~
Lipstatin Orlistat
points per day are possible (see Chapter 4). However, accelerated identification of
valuable lead compounds by random screening approaches can only be achieved
by new concepts to generate large varieties of structurally diverse test samples
(see also Chapters 6 and 7). Research on natural products shows that nature has
evolved a diversity of chemical structures that is not accessible even by the most
sophisticated synthetic concepts. Moreover, natural products have often opened
up completely new therapeutic approaches. They substantially contributed to
identifying and understanding novel biochemical pathways and, consequently,
proved to make available not only valuable drugs but also essential tools in bio-
chemistry and molecular cell biology [2, 3]. Table 2 summarizes selected natural
products currently evaluated as drug candidates (for their chemical structures see
Fig. 2).
In fact, natural products are currently undergoing a phase of reduced attention
because of the enormous effort which is necessary to isolate the active principles
from the complex matrices of the natural source and to elucidate their chemical
structures. However, we presume that comprehensive compound collections com-
prising pure natural substances, their derivatives and analogues as well as libraries
generated by standardized fractionation of extracts from natural origin represent
most valuable tools for current drug discovery programs. Also improving access
to large quantities of the compounds is a key prerequisite for detailed biological
studies as well as for clinical development and commercialization.
5 Natural products for lead identification: Nature is a valuable resource for providing tools 91
Table 2. Selected natural products evaluated as drug candidates (without anti-infectives). (For references
see [4-12]; for chemical structures see Fig 2).
O"R
i~". ~
," , .",00
o .
o OH
Epothilone A: (R~H)
Epothilone B: (R~Me) Fumagillin TNp·470
Dldemmn B. R~ Y
OH
1 1 J.. f
HO. A--
;.
0Y'T)'~ 1: l y :V
'\"'y OH NH2
OH Discodermolide
o H
l~~~
Ai /'IH')Y'"'~/~
0 0 --
I :7
The drugs used by ancient civilizations were extracts of plants or animal products
with a few inorganic salts. Especially plants have played an important role in med-
5 Natural products for lead identification: Nature is a valuable resource for providing tools 93
2····=8~h~
'.
'.-
0 I
0""'"
~H
\
M ~
<'"
o
::,..
"
I '"
OH
0
Calanolide A
Ecteinascidin 743 Eleutherobin Calanolide B: 12-Epimer
oopt;
"
OH
H3N~~
~ ~
$7
Halichondrin B Squalamine
'"'' ."H
I
'" O~OH
R,O OR,
h OH
ical use, e.g., in India where the Ayurveda gave access to a broad variety of med-
icines from plants reported since around 1000 Be. The earliest prescriptions in
Chinese medicine based on natural products date back to about 500 BC, and some
of the classical Chinese formulae handed down in the years between 25 and 220
are still in use. Moreover, with the focus of developing and commercializing
improved TCM (TCM: Traditional Chinese Medicine) derived drugs that meet the
requirements of western health care standards, identification, purification and
structure elucidation of the bioactive principles are increasingly addressed.
In contrast to plants, microorganisms were not known to biosynthesize second-
ary metabolites (low-molecular mass compounds that derive from biosynthetic
pathways which are not required for maintenance and growth) useful for medici-
nal application until the discovery of the penicillins. However, the accidental dis-
covery of penicillin from the culture broth of Penicillium notatum in 1928 by
Alexander Fleming and its introduction in 1941142 as an efficient antibacterial
drug revolutionized medicinal chemistry and pharmaceutical research by stimu-
lating completely new strategies in industrial drug discovery. In the following
94 S. Grabley and I. Sattler
5.3.2 Microorganisms
Among the bacteria, streptomycetes and myxobacteria [5] playa dominant role
with respect to secondary metabolism. About 70% of the natural compounds from
microbial sources derive from actinomycetes, mainly from strains of the genus
5 Natural products for lead identification: Nature is a valuable resource for providing tools 95
Streptomyces that can easily be isolated from soil samples. The genus
Streptomyces comprises more than 500 species that perform an outstanding diver-
sity in secondary metabolism yielding a yet increasing variety of new chemical
structures. About 110 genera of the order of actinomycetales are known. So far, a
number of these genera have been neglected with respect to the investigation of
their potential for the biosynthesis of bioactive secondary metabolites.
Besides the prokaryotic actinomycetes and myxobacteria, certainly, fungi are
one of the most significant groups of organisms to be exploited for drug discov-
ery purposes. In particular, fungi imperfecti have provided mankind with lots of
different bioactive secondary metabolites, many of them having entered clinical
applications, such as the ~-lactam antibiotics, griseofulvin, cyclosporin A, or
lovastatin. Currently, most new natural products described in literature are isolat-
ed from fungi [18]. Therefore, fungi are an outstanding source for the isolation of
structurally diverse small molecules that are highly qualified to supplement com-
pound libraries for drug discovery.
Many fungi are presumed to occupy unsuspected niches in nature including
cohabitation with larger life forms, such as higher plants, or the marine environ-
ment. Given the huge number and varieties of higher plants, the number of their
associated micro fungi , mainly belonging to the Ascomycotina and
Deuteromycotina, is expected to be enormous. It is estimated that today only a
few percent of the world's fungi are known. With respect to drug discovery,
mycelium cultures of fungi are of major interest. Their metabolites are easily
accessible in large quantities by fermentation processes, thus providing sufficient
amounts for extended screening programs as well as pre-clinical and clinical stud-
ies.
Most microorganisms investigated for their secondary metabolism are het-
erotrophic. However, microalgae, an assemblage of prokaryotic (cyanophyta) and
eukaryotic microorganisms with oxygenic photosynthesis, have to be considered
as well. Until now, microalgae have been studied mainly according to their toxi-
genic potential and their impact on poisoning the animal and human food chain
[19]. Due to inadequate sterile in vitro cultivation conditions, microalgae have
been neglected with respect to the systematic investigation of their potential to
biosynthesize structurally diverse non-toxic bioactive secondary metabolites.
Therefore, we developed an apparatus for the parallel cultivation of 50 micro algal
isolates in the 100 mL-scale. Our set up guarantees defined growth conditions
with respect to light, medium, temperature and carbondioxide supply [20]. In
order to obtain milligram quantities of microalgal metabolites, scale-up to 20 L
glass vessel bioreactors with an inside illumination device was realized.
5.3.3 Plants
metabolites isolated from plant extracts are essential with respect to the genera-
tion of structural diversity for compound libraries used in drug discovery. Plant
metabolites such as alkaloids or terpenoids are structurally unique and modifica-
tions of the biosynthetic pathways yield a tremendous diversity of derivatives.
Studies addressing the variability of secondary metabolism in dependence on the
place of origin have demonstrated the impact of the habitat. Therefore, current
efforts focus on the investigation of plants from yet unexplored locations.
However, in case of hit finding a number of problems arise referring to back-trac-
ing and accessing sufficient quantities for more detailed biological or pharmaco-
logical studies. In cases of complex structures, multi-step chemical synthesis can-
not solve the problem, as was shown by well known examples of phytogenic drugs
such as morphine, codeine, reserpine, vincristine, the cardiac glycosides or, more
recently, paclitaxel and camptothecin.
However, in various cases valuable precursors have been be made accessible
from plants at moderate cost, thus, contributing to improved manufacturing
processes, supplemented by chemical synthesis, and biocatalysis or biotransfor-
mation, respectively. Precursors sometimes are of considerable advantage because
they also give access to unnatural analogues or derivatives valuable for both, the
generation of compound libraries for screening purposes, and the optimization of
biological properties.
OH
!
/
o y\ I 0
OH H
,~ H H1
"
H H 0
o H H""i H
o o
I'H /
1 "!
0 ~.
I
OH OSO,Na
OH
-,ioI
o~ 0
H.roH
OH
r:
10,0
H
HO-..
... OH
H
0
-~
1HoH
--aH
H OHH H I H""":
OH
Maitotoxin
HO..... . H 0 !I
HO OH OH
Low-molecular mass natural products from bacteria, fungi, plants, and inverte-
brates, either from terrestrial or marine environments, represent unique structural
diversity. In order to get access to this outstanding molecular diversity, various
strategies like the (target-directed) biological, physico-chemical, or chemical
screening have been developed.
In contrast to a biological screening, the physico-chemical, and chemical
screening approaches a priori provide no correlation to a defined biological effect.
Here, the selection of promising, new secondary metabolites out of natural
sources is based on physico-chemical properties (see also Chapter 12), or on
chemical reactivity, respectively. In both strategies, the first step is a chromato-
graphic separation of compounds from the complex mixtures obtained from
98 S. Grabley and I. Sattler
Facing the needs for high-quality test sample preparation as the basis for building
libraries that fulfill the requirements for HTS programs, we developed a novel and
efficient automated sample preparation method based on a multistep fractionation
method by chromatographic solid-phase extraction (SPE). Our approach, which
advantageously does not need HPLC-techniques, evolved from a procedure for
sample preparation from microbial broths with XAD-16 resins that had been
developed for chemical screening. The advanced protocol with novel polys tyrol-
based resins allows chromatographic mixture fractionation through variation of
the organic solvent content in the eluent. Our protocol allows to generate single-
step fraction samples as well as multiple-step elution fractionations from a single
source. The automated multi-step procedure which is performed with modified
RapidTrace® modules from Zymark GmbH (Idstein, Germany) shows highly reli-
able performance and requires only minor manual intervention [29]. In analogy to
the SEPBOXTM concept, our multi-step SPE process can be adopted to the sepa-
ration and purification of mixtures deriving from combinatorial chemistry or any
chemical synthesis. In cooperation with CyBio Instruments GmbH (formerly
OPAL Jena GmbH; Jena, Germany), an apparatus is currently under development
that will provide samples in the 96-well microplate format.
The synergy of the analytical power of chemical screening and the enhanced
quality of fractionated extracts allows to add significant physico-chemical infor-
mation to "hit-lists" out of target-directed screening approaches. Therefore, we
consider the integration of TLC-analysis into secondary biological screening and
hit-verification as a remarkable tool in lead structure finding strategies, e.g., for
assigning the active principle through fast and efficient de-replication, and for
speed-up isolation and purification procedures. More recently, coupling tech-
niques with mass spectrometry, or even TLC-FID coupling have been described
which obviously can be integrated in our concepts.
Both, generation of natural compound libraries, and access to a wanted bioac-
tive principle from fractionated high-quality samples rely on the efficiency of iso-
lation procedures resulting in pure compounds for further evaluation. Therefore,
in cooperation with the HKl Pilot Plant for Natural Products we addressed the
transfer of our SPE fractionation process into a pilot scale chromatography in a
preparative MPLC system on Amberchrom 161c. Fractionation characteristics by
stepwise elution with water/methanol-mixtures can be "translated" into
water/methanol gradients with nearly identical separation characteristics.
5.6 A comprehensive library for lead discovery: The natural products pool
Today, chemical libraries consisting of more than 100,000 compounds are evalu-
ated in biological assays, typically searching for inhibiting effects towards a par-
ticular target (see Chapter 4). In order to generate these huge numbers of com-
pounds, the "classical" strategies of organic and medicinal chemistry have been
overflowed by technologies commonly called "combinatorial" chemistry (solid
phase, solution phase, as well as split- and split-pool strategies, see Chapter 6).
Recent concepts address the synthesis of medium-sized compound libraries con-
sisting of single components.
Since its implementation several years ago, some distinct problems of combi-
natorial synthesis have surfaced, e.g.: i) purity of the samples often is insufficient,
ii) parallel processing often yielded unexpected reaction products, iii) physico-
chemical analysis of the products generated is necessary, iv) reproduction and
purification, especially in larger scales for more detailed pharmacological studies,
is often difficult and a logistical problem, and v) over all, structural diversity is not
as broad as expected.
Regarding the enhancement of structural diversity, there exists a need in devel-
opment of combinatorial chemistry in the direction of more sophisticated synthe-
sis concepts (e.g., multi-step synthesis, larger molecules, stereochemical
approaches, synthesis of more reactive compounds, making use of complex tem-
plates and building blocks). Future success in combinatorial chemistry will sub-
stantially depend on both, the quality and the structural diversity of compound
libraries submitted to HTS. With respect to the latter, exploiting natural sources
for combinatorial synthesis strategies comes into focus [33, 34].
In order to take advantage of molecular diversity from nature for combinatori-
al chemistry, two major strategies are addressed. On the one hand, combinatorial
synthesis allows efficient and systematic structural variation of a given natural
product which is used as a kind of template for synthetic "decoration" [33]. The
second strategy of integrating natural products into combinatorial chemistry
involves total synthesis approaches of natural products. This approach does not
depend on the availability of natural products and allows to generate broader
structure variation of the basic skeleton of a natural product via divergent synthe-
sis approaches and the use of various reagents and building blocks [33].
It should be highlighted that synthetic combinatorial libraries typically focus on
low-molecular mass compounds ranging from 200 to 500 Dalton. The same mole-
cular range, in which the majority of known secondary metabolites from natural
sources are found. A promising approach is to generate compound libraries on the
basis of natural products ranging from 500 to 900 Dalton. These molecules are
expected to be more suitable for targets involving protein-protein, protein-DNA,
104 S. Grabley and I. Sattler
and protein-RNA interaction which play important roles e.g., in cell regulation
and differentiation processes.
So far, a restricted number of natural products, e.g., steroids have been used as
templates for combinatorial synthesis [33]. A new approach for obtaining
increased diversity in the search for new lead compounds, kombiNATURik™, has
been co-developed by the German enterprises AnalytiCon AG (Potsdam,
Germany) and Jerini Bio Tools GmbH (Berlin, Germany). The kombiNATURik™
program starts from natural compounds which are further diversified by solid-sup-
port chemistry introducing, for example, peptide or carbohydrate moieties. Those
libraries generally comprise several hundreds to thousands of single molecules
derived from multi-parallel synthesis [33]. Our own concept towards single com-
pound libraries of natural products derivatives addresses a similar approach.
However, the complexity of our libraries is restricted to around 300 members
obtained by automated solid phase synthesis in parallel. Scale-up to about 200 mg
quantities for hit validation and more detailed biological studies can easily be real-
ized.
Besides organic chemistry, biological methods for structure modification and
subsequent compound library generation can be a supplementary tool for deriva-
tization of structurally complex molecules from both, natural and synthetic
sources. The various methods can be categorized into those employing the native
biosynthetic machinery of a producing organism, those involving a manipulation
of the biosynthetic pathways on the enzymatic or genetic level, and finally, the
application of individual biosynthetic enzymes. The experimental demands on the
application of the various methods range from "simple" feeding of biosynthetic
precursors into standard cultivations to more sophisticated approaches involving
genetic engineering of biosynthetic enzymes. Genetics are applied in the cell-
based combination of biosynthetic genes from different strains or the in vitro
reconstitution of biosynthetic pathways with over-expressed enzymes [35]. All of
the various methods of biological derivatization become possible due to a relaxed
substrate specificity of some of the biosynthetic enzymes, especially those of
microbial secondary metabolism.
However, nature itself uses the principles of combinatorial synthesis to gener-
ate large and structurally diverse libraries by the combination of small biosyn-
thetic building blocks (e.g., nucleic acids, amino acids, and other building blocks
from primary metabolism like activated CT , Cr , and C4-carboxylic acids for
polyketide-type assembling) [36].
Biomimetic combinatorial synthesis of polyketides starting from simple build-
ing blocks represent challenging targets for the production of compound libraries.
First attempts towards the preparation of libraries have been reported [37, 38].
If one considers the diversity of chemical structures found in nature with the nar-
row spectrum of structural variation of even the largest combinatorial library it can
5 Natural products for lead identification: Nature is a valuable resource for providing tools 105
be expected that in drug discovery natural products will regain their importance.
Mainly actinomycetes, fungi and higher plants have been proven to biosynthesize
secondary metabolites of obviously unlimited structural diversity that can further
be enlarged by structure modification applying strategies of combinatorial chem-
istry. Probably, a variety of novel concepts in natural products research is required
to draw interest to incorporating natural sources into the HTS process.
Natural products libraries comprising only pure and structurally defined com-
pounds will probably contribute to more successful competition of compounds
from natural sources within the industrial drug discovery process. Only a minori-
ty of the natural products known so far has been biologically characterized in
detail. Therefore, any novel target-directed screening assay may result in identifi-
cation of a new lead structure even from sample collections comprising already
described compounds. Today, selected targets of interest are transferred to HTS
aiming at discovering a hit, and subsequently a lead structure within a 1 to
2-months period. After that time, in the HTS laboratory the corresponding assay
system is replaced by a new one. Thus, rapid characterization and structure eluci-
dation of the active principles of interest from natural sources are critical with
respect to competing with synthetic libraries consisting of pure and structurally
defined compounds. High-quality test sample preparation as well as elaboration of
LC-MS and LC-NMR techniques to accelerate structure elucidation of bioactive
principles from natural sources are currently underway. Combination of these
techniques with databases comprising a maximum of known natural compounds
will probably contribute substantially to reviving interest in natural products for
application in drug discovery.
Indications that today only a small percentage of the organisms living in the
biosphere is described, implies that there is an enormous reservoir of natural com-
pounds which is still undiscovered. The United Nations Convention on Biological
Diversity adopted in Rio de Janeiro in 1992 sets the basic principles of access to
and exploitation of global biological sources in the future. As a result, the con-
vention introduces national ownership of biological resources.
Implicating increasing efficiency and decreasing costs for HTS technologies,
the basic limiting factor for finding new lead compounds will be the supply of
structural diversity. The relevance of natural products in drug discovery conse-
quently will highly depend on the efficiency and costs of access to compounds
from natural origin compared to the supply from synthetic sources [39].
5.9 References
I Shu Y-Z (1998) Recent natural products based drug development: a pharmaceutical industry perspec-
tive. J Nat Prod 61: 1053-1071
2 Grabley S, Thiericke R (eds) (1999) Drug Discovery from Nature. Springer Verlag, Berlin
3 Grabley S, Thiericke R (1999) Bioactive agents from natural sources: trends in discovery and applica-
tion. Adv Biochem EnginiBiotechnol64: 101-154
4 hup://www.meb.uni-bonn.de/cancemet/600733.html
5 Reichenbach H, Hiifle G (1999) Myxobacteria as producers of secondary metabolites. In: Grabley S,
106 S. Grabley and I. Sattler
Thiericke R (eds) Drug Discovery from Nature. Springer Verlag, Berlin, 149-179
6 http://cancertrials.nci.nih.gov/news/angio/table.html
7 Faulkner DJ (2000) Highlights of marine natural products chemistry (1972-1999). Nat Prod Rep 17:
1-6
8 Commission on Geosciences, Environment and Resources (1999) Marine-Derived Pharmaceuticals
and Related Bioactive Agents. In: From Monsoons to Microbes: Understanding the Ocean's Role in
Human Health. National Academy Press, 73-82
9 Rayl AJS (1999) Oceans: Medicine chests of the future. The Scientist 13: 1-5
10 http://www.pharmamar.es/englishl.html
II http:///www.phc.vcu.eduifeature/epilindex2.html
12 http://www.alzforum.orgimembers/researchidrugslHuperzineA.htmI
13 Omura S (ed) (1992) The Search for Bioactive Compounds from Microorganisms. Brock/Springer,
New York
14 Grafe U (1992) Biochemie der Antibiotika. Spektrum, Heidelberg
15 Grabley S, Thiericke R, Zeeck A (1994) Antibiotika und andere mikrobielle Wirkstoffe. In: Prave P,
Faust U, Sittig W et al (eds): Handbuch der Biotechnologie, 4th ed. Oldenbourgverlag, Munchen,
663-702
16 Kuhn W, Fiedler H-P (eds) (1995) Sekundarmetabolismus bei Mikroorganismen, Beitrage zur
Forschung (engl). Attempto, Tubingen
17 Silverman RB (1992) The Organic Chemistry of Drug Design and Drug Action. Academic Press,
London
18 Henkel T, Brunne RM, Muller H et al (1999) Statistical investigation into the structural complemen-
tarity of natural products and synthetic compounds. Angew Chem Int Ed Engl Ill: 643-647
19 Luckas B (1996) Determination of algae toxins from seafood. CIT 4: 355-359
20 Ebert G (1999) Etablierung von Kultivierungs- und Bearbeitungstechniken zur ErschlieBung und
Bewertung struktureller Diversitat aus phototrophen Mikroorganismen fUr die Wirkstoffsuche. PhD
Thesis, Technische Universitat Berlin, Germany
21 Attaway DH, Zaborsky OR (eds) (1993) Marine Biotechnology, Vol.I Pharmaceutical and bioactive
natural products. Plenum Press, New York
22 Konig GM, Wright AD (1996) Marine natural products research: current directions and future poten-
tial. Planta Medica 62: 193-211
23 Konig GM, Wright AD (1999) Trends in Marine Biotechnology. In: Grab1ey S, Thiericke R (eds) Drug
Discovery from Nature. Springer Verlag, Berlin, 180-187
24 Jensen PR, Fenical W (1994) Strategies for the discovery of secondary metabolites from marine bac-
teria: Ecological perspectives. Annu Rev Microbiol48: 559-584
25 Gusovsky F, Daly JW (1990) Maitotoxin: a unique pharmacological tool for research on calcium-
dependent mechanisms. Biochem Pharmacol39: 1633-1639
26 Grabley S, Thiericke R, Zeeck A (1999) The Chemical Screening Approach. In: Grabley S, Thiericke
R (eds) Drug Discovery from Nature. Springer Verlag, Berlin, 124-148
27 Grabley S, Thiericke R (1999) Access to structural diversity via chemical screening. In: Diederichsen
U, Lindhorst TK, Westermann B et al (eds) Bioorganic Chemistry. Wiley-VCh, Weinheim, 409-417
28 Bindseil KU, God R, Gumm H et al (1997) Vollautomatische Isolerung von Naturstoffen. CIT Spezial
Chromatographie I: 19-21
29 Schmid I, Sattler I, Grabley S et al (1999) Natural products in high throughput screening: Automated
high-quality sample preparation. J Biomolecular Screening 4: 15-25
30 Koch C, Neumann T, Thiericke R et al (1999) A central natural product pool- New approach in drug
discovery strategies. In: Grabley S, Thiericke R (eds) Drug Discovery from Nature. Springer Verlag,
Berlin, 51-55
31 Koch C, Neumann T, Thiericke R et al (1997) Der Naturstoff-Pool - Neuer Ansatz fUr die
Wirkstoffsuche. Nachr Chem Tech Lab 45: 16-18
32 Koch C, Neumann T, Thiericke R et al (1997) Der Naturstoff-Pool - Ein neuartiges Konzept fUr die
Wirkstoffsuche bringt Wirtschaft und Wissenschaft zusammen. BIOspektrum 3: 43-45
33 Bertels S, Frormann S, Jas G et al (1999) Synergistic use of combinatorial and natural product chem-
istry. In: Grabley S, Thiericke R (eds) Drug Discovery from Nature. Springer Verlag, Berlin, 72-105
34 Paululat T, Tang YQ, Grabley S et al. (1999) Combinatorial chemistry: The impact of natural products.
Chimica OggilChemistry Today May/June: 52-56
35 Sattler I, Grabley S, Thiericke R (1999) Structure modification via biological derivatization methods.
In: Grabley S, Thiericke R (eds) Drug Discovery from Nature. Springer Verlag, Berlin, 191-214
5 Natural products for lead identification: Nature is a valuable resource for providing tools 107
36 Rohr J (1995) Combinatorial biosynthesis - an approach in the near future. Angew Chem Int Ed Engl
34: 881-885
37 Reggelin M, Brenig V, We1cker R (1998) Towards polyketide libraries-II: Synthesis of chiral aracemic
di- and triketides on a solid support. Tetrahedron Lett 39: 4801-4804
38 Khosla C (1998) Combinatorial biosynthesis of "unnatural" natural products. In: EM Gordon, JF
Kerwin (eds) Combinatorial chemistry and molecular diversity in drug discovery. Wiley-Liss, New
York, 401-417
39 Beese K (1996) Pharmaceutical bioprospecting and synthetic molecular diversity. The convention on
biological diversity and the value of natural products in the light of new technological developments.
Institute for prospective technological studies - European Commission. Draft discussion paper, May
15: 1-43
Modern Methods of Drug Discovery 109
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
Torrey Pines Institute for Molecular Studies, 3550 General Atomics Court, San Diego, CA 92121 USA
6.1 Introduction
6.2 Combinatorial chemistry techniques
6.3 Synthesis methods
6.3.1 Resin mixtures
6.3.2 Reagent mixtures
6.4 Deconvolution methods
6.4.1 Iterative deconvolution
6.4.2 Positional scanning deconvolution
6.5 Solid phase synthesis of acyclic and heterocyclic compounds from amino acids and short peptides
6.5.1 Peptidomimetics and acyclic compounds
6.5.2 Heterocyclic compounds
6.5.3 Solid phase synthesis of heterocyclic compounds from Ca-functionalized amino acids
6.5.4 Solid phase synthesis of heterocyclic compounds from resin-bound dipeptides
6.5.5 Solid phase synthesis of heterocyclic compounds from acylated dipeptides
6.5.6 Solid phase synthesis of heterocyclic compounds from resin-bound reduced acylated
dipeptides
6.5.7 Solid phase synthesis of bis heterocyclic compounds from resin-bound orthogonally
protected lysine
6.6 Conclusions
6.7 Acknowledgments
6.8 References
6.1 Introduction
plate combinatorial library [19, 31, 32]. A review providing a perspective on the
synthesis and inherent strengths and weaknesses of the use of mixture-based com-
binatorial libraries, as well as the large number of successful applications of this
technology has recently been published by our group [3].
Two synthetic approaches, involving either the mixing of multiple resins or the
use of mixtures of incoming reagents, are now widely used to incorporate multi-
ple functionalities at diverse positions within an SCL.
The "divide, couple, and recombine" (DCR) synthesis method [13], also know as
the "split resin" method [14], was developed for use in the synthesis of peptide,
acyclic and heterocyclic SCLs. This synthesis method, illustrated in Figure 1,
involves the coupling of reactants to individual aliquots of resin followed by thor-
ough mixing of the resin. This method allows the generation of approximately
equimolar mixtures of compounds. Due to the statistical distribution of beads at
each step, care should be used in determining the appropriate amount of resin to
be used in the synthesis in order to ensure inclusion of all compounds in the
library [33, 34]. An important aspect to the DCR approach is that, due to the phys-
ical nature of dividing and mixing the resin beads, each resin bead contains only
one compound when the library is completed [14].
amounts of all reagents, equimolar relative to the resin, in order to obtain a resin-
bound mixture. The reaction is then repeated using excess reagents in order to
drive the reaction of the remaining unreacted sites to completion. Reasonable
results can be obtained for the addition of one position of diversity. However, a
major disadvantage to this method is seen when incorporating more than one posi-
tion of diversity. Incoming reagents can be preferentially consumed by particular
resin-bound reactants even when there are small differences in reaction rates or
relative reaction rates. Repetitive cycles using this method multiplies the problem,
resulting in large deviations from equimolarity in the final products.
Three approaches are generally used for the structural deconvolution of active
compounds from assay data using nonsupport-bound SCLs: iterative deconvolu-
tion [13], positional scanning deconvolution [30], and tagging [42]. Each
approach has been used to identify active individual compounds in a wide variety
of SCLs and assays.
The positional scanning (PS) approach [30] is illustrated in the generic represen-
tation shown in Figure 2.
It involves the screening of separate, single defined position SCLs to individ-
ually identify the most important functionalities at each position of diversity with-
in a library. A complete PS-SCL having three positions of diversity consists of
three sublibraries (designated OXX, XOX, and XXO), each of which has a single
defined functionality at one position and a mixture of functionalities at each of
the other three positions. The structure of individual compounds can be deter-
mined from such a screening since each compound is present in only one mixture
of each sublibrary. In theory, if only one compound was active in the library,
activity corresponding to that compound would be found in the one mixture of
each sublibrary containing that compound. When considered in concert, the
6 Combinatorial chemistry: compounds from amino acids and short peptides 113
0, 02 03 ° m.1 Om
I
-------------_.'._---
L-
Mix resin a nd spli t ni to n equal portion s
Xm Xm Xm Xm Xm
------•.------.------
0, O2 03 On., On
L
Mix resin and split into p equal portions
X, X2 X3 Xn., Xn
------------------
01 Xn OJ Xn Opo1 Xn Xn
0, ~,
Screenin g: Most active 0 , groups Most active ~ groups Most active ~ groups
of the subllbrary of the sublibrary of the sublibrary
0,
Identification of the
active Individual compounds
Synthesis of compounds derived from all combinations of the
0 ,. O2 and 0 3 groups from the previous screer1ng of the three
sublibraries
defined functionality in each mixture can then be used to identify the individual
active compound responsible for the activity. In reality, the same result is seen,
but the activity is generally due to the sum of activities of more than one com-
pound. Anomalous results are seen if the activity is due to the sum of many weak-
ly active compounds. PS-SCLs are generally prepared using the reagent mixture
approach described above. Although the synthesis of PS-SCLs is theoretically
possible with the DCR approach, in reality the labor involved makes the synthe-
sis unfeasible. Freier and coworkers have performed a thorough examination of
the theoretical and experimental aspects of iterative and positional scanning
deconvolution [43, 44].
Scheme I
R, R o R, R
H'N "'(' N~NH
-----
R, R
H"'0~~N"
Rl' R 0
R.tuc:ed peralkyt'ttd PtQIId •• r-.rtlltyllted peQtldill
R, __
H R, R
H 'f'" ~ NH, • NH
",(,,:,
R, H o R: R
Roduc.d poj)Udo M:)'tal_ reduc:ed and pettll<:rl.ted peoli ...
cherne 2
"Tn·ClI>lEA
21U0C8u. R·X
cherne 3
o (H)
R - -
H
o H
~N R (H)
HT r -
Scheme 4 H R,
N{ NI<, • RCHO
Olcr:;t°
c?~~ :!~
R,
0
H
"0
R,
N'*'A
.~'O ' ..,..ocANH
RP',cm
0 R,oc>t,cOCl " sno,
1:)) HFIan15d8 OR, OIHFI.,... o N
H
Scheme S
cherne 6
'IHATU.OIPEA
2)HF/anllilOlt
_ _ __ D,N~~N
co,>< HN H
at the mu, delta, kappa and sigma opioid receptors. Following deconvolution of
this library, individual compounds having nM affinities at the mu or sigma recep-
tors were found (unpublished observations).
We have also developed an efficient method for the solid phase synthesis of
ureas, diureas and triureas from resin-bound monoamines, diamines and tri-
amines. The exhaustive reduction of solid support-bound polyamides with borane
in THF generated amines, which, following treatment with commercially avail-
able isocyanates and cleavage, provided the corresponding ureas in high purity
and good yields (Scheme 3).
Imines are often used as intermediates in organic synthesis and are the starting
point for chemical reactions such as cycloadditions, condensation reactions and
nucleophilic addition [7]. The formation of imines via condensation of amines
with aldehydes has been extensively used for the synthesis of a variety of hetero-
cyclic compounds.
The solid phase synthesis of a combinatorial library containing 43,000 tetrahy-
droisoquinolines has been reported by Griffith and coworkers [53]. The library
was synthesized using a three-step procedure. An imine was formed by reacting a
substituted benzaldehyde with an MBHA resin-bound amino acid. Imine forma-
tion was driven to completion using trimethylorthoformate as a dehydrating
reagent. The treatment of the imine with homophthalic anhydride provided the
desired tetrahydroisoquinoline (Scheme 4).
Isomerization to the more stable trans configuration was obtained following
cleavage by treatment with IN NaOH. The tetrahydroisoquinoline library was
prepared using the DCR method with MBHA resin, 11 amino acid building
blocks, 38 aldehydes and 51 amines. The library has been tested in a number of
assays, including kappa and mu opioid radioreceptor binding assays and in a
sigma radioreceptor binding assay [53]. Using the "libraries from libraries" con-
cept [47, 50], a second copy of the isoquinoline library was reduced in the pres-
ence of borane in THF to generate a second library having different physical and
chemical properties.
The solid phase synthesis of 4-amino-3,4-dihydro-2(1H)-quinolinones was
developed by Pei and coworkers through the rearrangement of ~-lactam inter-
mediates on the solid phase [54]. The reaction of a resin-bound amino acid with
ortho-nitrobenzaldehyde resulted in an imine that, following treatment with a
ketene, undergoes a [2 + 2] cycloaddition to afford a four-member ring cis-~
lactam intermediate. The ketene was generated in situ from the corresponding
phenoxyacetyl chloride in the presence of triethylamine. Following reduction of
the nitro group with tin chloride, the ~-lactam ring underwent an intramolecular
rearrangement to afford the trans-3,4-dihydro-2(1H)-quinolinones through nucle-
ophilic attack of the ~-lactam moiety by the generated amine. Following HF
cleavage and lyophilization, individual dihydroquinolinones were obtained in
6 Combinatorial chemistry: compounds from amino acids and short peptides 117
good yield. Using the tea-bag technology and in combination with the DCR
synthesis method. A library of 2,070 dihydroquinolinones derived from (69 amino
acids, 6 ortho-nitro-benzaldehydes and 5 acid chlorides) was produced.
Scheme 7
t Boc·Xaa.()H
2 55% TFA on OCM COlm,
3 Repeat Steps t and 2 o H
NH2 - - - - - . . . N.AyN NH2 N-alkylatoon
H R1 • 0
CSlm,
o H R
_tiNVir>
NA....--N
H 0
n-l,d~~KCI
n- 2. lhtmlnobutync: ~
n- 3. Otn-thlne
n- 4 , lysine
c herne
1. Xu coupling
2. AoylaUon
cherne 9
o H R 0
N~N N R
Rt R 0 H
malonyt
IHI "'."...
H R3
l
x- O. S
N""yN~ N """' R'
Rt R2 H 01 C202tm2
C2021m2 blIH]
. / CSlm2 R
HN - 1R • (Rt-H) R HN, ~
Rt
,
r N
~
N, ; -N
:) lit r N'--/ , R
R
R R R--<.. _
o 0 N...kN•
( R '..
cheme JO
We have developed a simple synthetic route to the solid phase synthesis of hydan-
toin and thiohydantoin SCLs from a resin-bound dipeptide SCL [58]. This involved
the reaction of the N-terminal amino group of resin-bound dipeptide with phosgene
or thiophosgene leading to the intermediate isocyanate or thioisocyanate that fur-
6 Combinatorial chemistry: compounds from amino acids and short peptides 119
ther reacted intra-molecularly to form the five member ring hydantoin or thiohy-
dantoin. Using 54 amino acids for the first site of diversity and 60 amino acids for
the second site of diversity, and using four different alkylating reagents a library of
38,880 compounds (54 x 60 x 3 x 4) has been synthesized (Scheme 7).
In order to increase the number and class of available compounds, we selec-
tively alkylated the resin-bound amide and then the remaining amides to generate
a dialkylated hydantoin library (Scheme 7).
We also synthesized a different class of hydantoin compounds called branched
hydantoins. Thus, starting from a resin-bound, orthogonally protected diamino
acid (including diaminopropionic acid, diaminobutyric acid, ornithine or lysine),
and following deprotection of the a-amino group and coupling of a second amino
acid, the resin was treated with carbonyldiimidazole or triphosgene to afford the
highly active intermediate isocyanate, which undergoes an intramolecular cycliza-
tion leading to the hydantoin. The diamino acid side chain is then deprotected and
the free amine group is acylated with a range of carboxylic acids to yield the
desired hydantoins following cleavage of the resin with hydrogen fluoride
(Scheme 7) [59].
We have also developed an efficient method for the solid-phase synthesis of imida-
zol-pyrido-indoles [60]. This class of compounds shows a broad range of biologi-
cal activity. Starting from a resin-bound tryptophan or tryptophan analogs, a second
amino acid was coupled and the resin bound dipeptide was acylated with a variety
of carboxylic acids. This acylated dipeptide was subjected to Bischler-Napieralski
condensation using phosphorus oxychloride in l,4-dioxane at 85°C (Scheme 8).
Using 11 different functionalities derived from amino acids at R j, 25 different
functionalities derived from amino acids at R2 and 92 different functionalities
derived from carboxylic acids at R3 , we generated a library of 25,300 imidazol-
pyrido-indole derivatives in the positional scanning format. Prior to the library
synthesis, we synthesized individual control compounds in which the building
blocks in each individual position were varied while the other three positions
remained fixed. Those compounds were used to determine whether the individual
building blocks could be successfully incorporated into the library. The complete
PS-SCL was composed of three sublibraries, each containing the same 25,300
individual compounds (11 x 25 x 92) [60].
Modified dipeptide SCLs having four positions of diversity were selected as start-
ing materials for the solid phase synthesis of cyclic urea and thiourea libraries
120 A. Nefzi et al.
6.6 Conclusions
6.7 Acknowledgments
Work carried out in the authors' laboratory was funded in part by National Cancer
Institute Grant No. CA78040 (Houghten).
6.8 References
1 Anonymous (1998) Breakthrough of the year. Science 282: 2156-2161
2 Lebl M (1999) Parallel personal comments on "classical" papers in combinatorial chemistry. J Comb
Chern 1: 3-24
3 Houghten RA, Pinilla C, Appel JR et al (1999) Mixture-based synthetic combinatorial libraries. J Med
Chern 42: 3743-3778
4 Houghten RA (1985) General method for the rapid solid-phase synthesis of large numbers of peptides:
Specificity of antigen-antibody interaction at the level of individual amino acids. Proc Natl Acad Sci
USA 82: 5131-5135
5 Geysen HM, Meloen RH, Barteling SJ (1984) Use of a peptide synthesis to probe viral antigens for
epitopes to a resolution of a single amino acid. Proc Natl Acad Sci USA 81: 3998-4002
6 Frank R, Doring R (1988) Simultaneous multiple peptide synthesis under continuous flow conditions
on cellulose paper discs as segmental solid supports. Tetrahedron 44: 6031-6040
7 Nefzi A, Ostresh 1M, Houghten RA (1997) The current status of heterocyclic combinatorial libraries.
Chern Rev 97: 449-472
8 Fruchtel JS, Jung G (1996) Organic chemistry on solid supports. Angew Chern Int Ed Eng135: 17-42
9 Fruchtel JS, Jung G (1996) Organic chemistry on solid supports - basic principles for combinatorial
chemistry. In: Jung G (ed) Combinatorial peptide and nonpeptide libraries. Verlag Chemie, Weinheim,
19-78
10 Zwick MB, Shen JQ, Scott J (1998) Phage-displayed peptide libraries. Curr Opin Biotechnol 9:
427-436
11 Koivunen E, Arap W, Valtanen H et al (1999) Tumor targeting with a selective gelatinase inhibitor.
Nature Biotechnol17: 768-774
12 Geysen HM, Rodda SJ, Mason TJ (1986) A priori delineation of a peptide which mimics a discontin-
uous antigenic determinant. Mol Immunol 23: 709-715
13 Houghten RA, Pinilla C, Blondelle SE et al (1991) Generation and use of synthetic peptide comb ina-
toriallibraries for basic research and drug discovery. Nature 354: 84-86
14 Lam KS, Salmon SE, Hersh EM et al (1991) A new type of synthetic peptide library for identifying
ligand-binding activity. Nature 354: 82-84
122 A. Nefzi et al.
15 Furka A, Sebestyen F, Asgedom M et al (1991) General method for rapid synthesis of multicomponent
peptide mixtures. Int J Pept Protein Res 37: 487-493
16 Dooley CT, Chung NN, Wilkes BC et al (1994) An all D-amino acid opioid peptide with central anal-
gesic activity from a combinatorial library. Science 266: 2019-2022
17 Lam KS, Lebl M, Krchnak V et al (1993) Discovery of D-amino-acid-containing ligands with selec-
tide technology. Gene 137: 13-16
18 Blondelle SE, Takahashi E, Weber PA et al (1994) Identification of antimicrobial peptides using com-
binatoriallibraries made up of unnatural amino acids. Antirnicrob Agents Chernother 38: 2280-2286
19 Eichler J, LuckaAW, Houghten RA (1994) Cyclic peptide template combinatorial libraries: Synthesis
and identification of chymotrypsin inhibitors. Pept Res 7: 300-307
20 Blondelle SE, Perez-Paya E, Houghten RA (1996) Synthetic combinatorial libraries: Novel discovery
strategy for identification of antimicrobial agents. Antirnicrob Agents Chernother 40: 1067-1071
21 Dooley CT, Chung, NN, Wilkes, BC et al (1994) An all D-amino acid opioid peptide with central anal-
gesic activity from a combinatorial library. Science 266: 2019-2022
22 Dooley CT, Kaplan RA, Chung NN et al (1995) Six highly active mu-selective opioid peptides identi-
fied from two synthetic combinatorial libraries. Pept Res 8: 124-137
23 Kramer TH, Toth G, Haaseth RC et al (1991) Influence of peptidase inhibitors on the apparent agonist
potency of delta selective opioid peptides In vitro. Life Sci 48: 882-886
24 Dooley CT, Houghten RA (1995) Identification of mu-selective polyamine antagonists from a syn-
thetic combinatorial library. Analgesia 1: 400-404
25 Blondelle SE, Houghten RA, Perez-Paya E (1996) Identification of inhibitors of melittin using non-
support-bound combinatorial libraries. J Bioi Chern 271: 4093-4099
26 Burton DR, Barbas CF, III, Persson MAA et al (1991) A large array of human monoclonal antibodies
to type I human immunodeficiency virus from combinatorial libraries of asymptomatic seropositive
individuals. Proc Natl Acad Sci USA 88: 10134-10137
27 Motti C, Nuzzo M, Meola A et al (1994) Recognition by human sera and immunogenicity of HBsAg
mimotopes selected from an M13 phage display library. Gene 146: 191-198
28 Pinilla C, Appel J, Blondelle SE et al (1995) A review of the utility of peptide combinatorial libraries.
Biopolyrners (Peptide Science) 37: 221-240
29 Terret NK, Bojanic D, Brown D et al (1995) BioMed Chern Lett 5: 917-922
30 Pinilla C, Appel JR, Blanc P et al (1992) Rapid identification of high affinity peptide ligands using
positional scanning synthetic peptide combinatorial libraries. Biotechniques 13: 901-905
31 SpatolaAF, CrozetY, DeWit D et al (1996) Rediscovering an endothelin antagonist (BQ-123): A self-
deconvoluting cyclic pentapeptide library. J Med Chern 39: 3842-3846
32 Szymonifka MJ, Chapman KT (1995) Magnetically Manipulable Polymeric Supports for Solid Phase
Organic Synthesis. Tetrahedron Lett 36: 1597-1600
33 Gallop MA, Barrett RW, Dower WJ et al (1994) Applications of combinatorial technologies to drug
discovery. 1. Background and peptide combinatorial libraries. J Med Chern 37: 1233-1251
34 White SP, Scott DL, Otwinowski Z et al (1990) Crystal structure of cobra-venom phospholipase A2 in
a complex with a transition-state analogue. Science 250: 1560-1563
35 Ostresh 1M, Winkle JH, Hamashin VT et al (1994) Peptide libraries: Determination of relative reac-
tion rates of protected amino acids in competitive couplings. Biopolyrners 34: 1681-1689
36 Ostresh 1M, Schoner CC, Hamashin VT et al (1998) Solid phase synthesis of trisubstituted bicyclic
guanidines via cyclicization of reduced N-acylated dipeptides. J Org Chern 63: 8622-8623
37 Nefzi A, Ostresh JM, Houghten RA (1999) Parallel solid phase synthesis of tetrasubstituted diethyl-
enetriamines via selective amide alkylation and exhaustive reduction of N-acylated dipeptides.
Tetrahedron 55: 335-344
38 Nefzi A, Ostresh JM, Meyer J-P et al (1997) Solid phase synthesis of heterocyclic compounds from
linear peptides: cyclic ureas and thioureas. Tetrahedron Lett 38: 931-934
39 Nefzi A, Ostresh JM, Giulianotti M et al (1999) Solid-phase synthesis of trisubstituted 2-imidazoli-
dones and 2-imidazolidinethiones. J Cornb Chern I: 195-198
40 Udaka K, Wiesmiiller K-H, Kienle S et al (1995) Decrypting the structure of major histocompatibili-
ty complex class I-restricted cytotoxic T lymphocyte epitopes with complex peptide libraries. J Exp
Med 181: 2097-2108
41 Grass-Masse H, Ameisen JC, Boutillon C et al (1992) Synthetic vaccines and mV-I hypervariability:
a mixotope approach. Pept Res 5: 211-216
42 Janda KD (1994) Tagged versus untagged libraries: Methods for the generation and screening of com-
binatorial chemical libraries. Proc Natl Acad Sci USA 91: 10779-10785
6 Combinatorial chemistry: compounds from amino acids and short peptides 123
43 Konings DAM, Wyatt JR, Ecker DJ et al (1996) Deconvolution of combinatorial libraries for drug dis-
covery: Theoretical comparison of pooling strategies. J Med Chem 39: 2710-2719
44 Wilson-Lingardo L, Davis PW, Ecker DJ et al (1996) Deconvolution of combinatorial libraries for drug
discovery; experimental comparison of pooling strategies. J Med Chem 39: 2720-2726
45 Blomberg K, Granberg C, Hemrnila let al (1986) EuropiumOiabelled target cells in an assay of natu-
ral killer cell activity. II. A novel non-radioactive method based on time-resolved fluorescence.
Significance and specificity of the method. J lmmunol Meth 92: 117-123
46 Dorner B, Ostresh 1M, Blondelle SE et al (1997) Peptidomimetic synthetic combinatorial libraries. In:
A Abell (ed): Advances in Amino Acid Mimetics and Peptidomimetics Vol 1. JAI Press, Greenwich, CT,
109-125
47 Ostresh JM, Dorner B, Houghten RA (1998) Peralkylation: "Libraries from libraries": Chemical trans-
formation of synthetic combinatorial libraries. In: Cabilly S (ed): Combinatorial peptide library pro-
tocols. Humana Press, Totowa, New Jersey, 41-49
48 Ostresh JM, Dorner B, Blondelle SE, Houghten RA (1997) Soluble combinatorial libraries of peptides,
peptidomimetics, and organics: fundamental tools for basic research and drug discovery. In: Wilson
TR, Czarnik AW (eds): Combinatorial Chemistry. John Wiley & Son, Inc., New York, 225-240
49 Dorner B, Husar GM, Ostresh 1M et al (1996) The synthesis of peptidomimetic combinatorial libraries
through successive amide alkylation. Bioorg Med Chem 4: 709-715
50 Ostresh JM, Husar GM, Blondelle SE et al (1994) "Libraries from libraries": Chemical transformation
of combinatorial libraries to extend the range and repertoire of chemical diversity. Proc Natl Acad Sci
USA 91: 11138-11142
51 Ostresh JM, Blondelle SE, Dorner B et al (1996) Generation and use of nonsupport-bound peptide and
peptidomimetic combinatorial libraries. Methods Enzymol267: 220-234
52 Cuervo JH, Weitl F, Ostresh 1M, Hamashin VT, Hannah AL, Houghten RA (1995) Polyalkylamine
chemical combinatorial libraries. In: Maia HLS (ed): Peptides 94: Proceedings of the 23rd European
Peptide Symposium. ESCOM, Leiden, 465-466
53 Griffith MC, Dooley CT, Houghten RA, Kiely JS (1996) Solid-phase synthesis, characterization, and
screening of a 43,000-compound tetrahydroisoquinoline combinatorial library. In: Chaiken 1M, Janda
KD (eds): Molecular Diversity and Combinatorial Chemistry: Libraries and Drug Discovery.
American Chemical Society, Washington, DC, 50-57
54 Pei Y, Houghten RA, Kiely JS (1997) Synthesis of 4-amino-3,4-dihydro-2(IH)-quinolines via B-Iac-
tam intermediates on the solid-phase. Tetrahedron Lett 38: 3349-3352
55 Nefzi A, Ostresh JM, Houghten RA (1997) Solid phase synthesis of 1,3,4,7-tetrasubstituted perhy-
dro-I,4-diazepine-2,5-diones. Tetrahedron Lett 38: 4943-4946
56 Nefzi A, Giulianotti M, Houghten RA (1998) Solid phase synthesis of 2,4,5-trisubstituted thiomor-
pholin-3-ones. Tetrahedron Letters 39: 3671-3674
57 Nefzi A, Ong, NA, Giulianotti, MA et al (1999) Solid phase synthesis of 1,4-benzothiazepin-5-one
derivatives. Tetrahedron Lett 40: 4939-4942
58 Meyer J-P, Ostresh JM, Houghten RA (1999) Combinatorial libraries of hydantoin and thiohydantoin
derivatives, methods of making the libraries and compounds therein. U.S. Patent No. 5,859,190
59 Nefzi A, Ostresh JM, Giulianotti M et al (1998) Efficient solid phase synthesis of 3,5-disubstituted
hydantoins. Tetrahedron Lett 39: 8199-8202
60 Ostresh JM, Houghten RA (1999) Combinatorial libraries of imidazol-pyrido-indole and
imidazol-pyrido-benzothiophene derivatives, methods of making the libraries and compounds therein.
U.S. Patent No. 5,856,107
61 Nefzi A, Giulianotti M, Houghten RA (1999) Solid phase synthesis of 1,6-disubstituted 2,3-dike-
topiperazines and 1,2-disubstituted piperazines from N-acylated amino acids. Tetrahedron Lett 40:
8539-8542
62 Nefzi A, Ostresh JM, Houghten RA (2001) Solid-phase synthesis of mixture-based acyclic and hete-
rocyclic small molecule combinatorial libraries from resin-bound polyamides. Biopolymers (Peptide
Science) 60: 212-219
63 Blondelle, SE, Crooks, E, Ostresh, JM et al (1999) Mixture-based heterocyclic combinatorial posi-
tional scanning libraries: discovery of bicyclic guanidines having potent antifungal activity against
Candida albicans and Cryptococcus neoformans. Antimicrob Agents & Chemother 43: 106-114
64 Nefzi A, Giulianotti MA, Houghten RA (2001) Solid-phase synthesis of bis-heterocyclic compounds
from resin-bound orthogonally protected lysine. J Comb Chem 3: 68-70
Modern Methods of Drug Discovery 125
ed. by A. Hillisch and R. Hilgenfeld
© 2003 8irkhauser Verlag/Switzerland
Aventis Pharma Deutschland GmbH, DI&A Chemistry, Molecular Modelling, Building G878, D-65926
Frankfurt am Main, Germany
7.1 Introduction
7.2 The similarity principle as basis for rational design
7.3 Molecular descriptors to quantify diversity
7.3.1 2D Descriptors
7.3.2 3D Descriptors
7.3.3 Alternative approaches to quantify molecular similarity
7.4 Subset selection approaches
7.4.1 Different types of libraries
7.4.2 Maximum dissimilarity methods
7.4.3 Cluster analysis
7.4.4 Cell-based methods
7.4.5 Other selection approaches
7.5 Descriptor validation studies
7.6 Comparing compound libraries
7.7 Design of combinatorial libraries
7.8 Design of drug-like libraries
7.9 Conclusions
7.10 Acknowledgements
7.11 References
7.1 Introduction
The concepts of molecular diversity and library design are increasingly important
for the pharmaceutical industry due to the need to establish efficient technologies
with potential for shortening the drug discovery process. This paradigm shift
caused by cost pressure went along with the advent of high-throughput methods
and produced an exponential increase of data from combinatorial chemistry [1]
(see Chapter 6) and high-throughput screening (HTS, Chapter 4). Following this
technological change, theoretical concepts for effective experimental design and
data analysis were summarized under the term "chemoinformatics" [2], defined as
"mixing of information resources to transform data into information and informa-
126 H. Matter
tion into knowledge, for the intended purpose of making better decisions faster in
the arena of drug lead identification and optimization".
In this Chapter computational approaches to diversity analysis, library design
and comparison, compound selection, and prediction of drug-likeness will be
reviewed. The similarity principle, which forms the conceptual basis for chemical
diversity, is connected to molecular descriptors for an objective evaluation of
diversity metrics. If valid descriptors are known, the next issue is related to the
appropriate choice of a selection strategy to sample diverse or similar, but repre-
sentative subsets from either existing or "virtual" compound libraries. The dis-
cussion of compound libraries then is focused on concepts and practical consid-
erations for their design, assessment and comparison [3].
The ability to synthesize huge libraries in various formats is unprecedented in
medicinal chemistry. Although combinatorial chemistry (see Chapter 6) has start-
ed with large libraries of mixtures followed by deconvolution strategies, the auto-
mated parallel or combinatorial synthesis of smaller, single compound libraries is
today mainly applied and the following sections will refer to this library strategy
[4]. For identification of novel bioactive compounds often a combination of his-
toric compound collections and combinatorial libraries for high-throughput
screening or focused biological testing are utilized [5]. Historic libraries are char-
acterized by a limited number of templates from earlier projects. Hence, it is nec-
essary to expand their structural diversity by directed acquisition of compound
collection or synthesis [6].
Although combinatorial chemistry allows to synthesize large numbers of com-
pounds, it requires rational design in order to concentrate on the "best possible
library" in order to optimize resources for a successful discovery strategy. As
experimental techniques are now sufficiently developed, it is possible to plan syn-
thetic schemes for producing large populations of small, drug-like entities by sys-
tematic, repetitive connection of individual building blocks with varying chemical
topology. Often more than 1000 reactants are commercially available for each ran-
domization position, thus appropriate selection strategies must be applied. The
identification of "redundant" compounds based on molecular diversity considera-
tions is a key requirement, as any reduction of the number of compounds, while
only reducing redundancy in a database, but not introducing any voids, should
impact research efficiency [7]. Even with new miniaturization technologies (see
Chapter 4), smaller subsets are essential to handle more assays in a given time
frame.
However, it is not only molecular diversity, that makes a synthetically feasible
combinatorial library a good choice. The integration of drug development activi-
ties into early stages of discovery [8] is increasingly important to rapidly flag
those molecules, which are unlikely to become drugs. Hence, the incorporation of
experimental knowledge into computational methods for filtering and prediction
of "drug-likeness" is valuable permitting the rapid and cost-effective elimination
of candidates prior to synthesis [9]. Here, approaches towards the prediction of
drug-likeness in a general sense, as well as intestinal absorption through passive
transport and the penetration of the blood-brain barrier will be presented, which
7 Quantification of molecular diversity and design of compound libraries 127
are increasingly incorporated into library design, following the earlier interest in
Lipinski's "rule-of-five" [10] (see also Chapter 12).
It was stated earlier that general screening libraries should minimize redundancy
using reactants with diverse structures, while for focused libraries exploring the
neighborhood of a lead monomers with similar features to individual building
blocks should be employed [11]. This follows the Similarity Principle [12], stat-
ing that structurally similar molecules are likely to exhibit similar physicochemi-
cal and biological properties. Hence, it should be possible to predict biological
properties of a structurally related compound given known activities for its near-
est neighbors. Most approaches to quantify similarity [13] originate from methods
like substructure and similarity searching [14], clustering [15] and quantitative
structure-activity relationship (QSAR [16]]. The use of very similar molecules
does not enhance the probability to find novel structural classes for one target,
while dissimilar molecules should enhance the probability for finding interesting
leads on different targets. Such a diverse subset should cover a wide range of
diverse, non-redundant compounds. Not only the risk of missing an interesting
structural class should be low, but also the total number of compounds. When
optimizing a hit on the other hand, the use of similar molecules is essential to
establish a sound structure-activity relationship (SAR), which differs significant-
ly from lead finding.
While this seems to be a clear concept, reality is more complicated, as there is
no objective definition of "molecular diversity". One should be aware that all
descriptors are derived a priori from molecular structures and two molecules
might be similar given one descriptor, while there are dissimilar using another
one. Similar molecules sometimes exhibit significant differences in biological
activities due to differences in protein-ligand interactions, flexibility, binding
mode, mechanism of action and others [17]. Hence, for any valid descriptor the
number of exceptions from the similarity principle should be minimal by consid-
ering only relevant properties.
While historic databases are characterized by clusters of molecules, an optimal
distribution of molecules should avoid redundancies for primary screening. An
optimal selection would pick only dissimilar compounds outside of the smallest
acceptable distance between two molecules (similarity radius), resulting in a
diverse subset. Such a distribution may help to identify structurally different mol-
ecules being active on the same target, supposing that the interest lies in the dif-
ferentiation of active versus inactive compounds regardless of a quantification of
activity. In contrast, establishing a structure-activity relationship should be the
subject of a lead optimization project starting from targeted libraries, enumerated
within the similarity radius around known hits and enriched by known privileged
structures.
128 H. Matter
string, while they also denote the presence of predefined functional groups, rings
and atoms in 60 of a total of 988 bits. In contrast to fragment fingerprints, hashed
fingerprints describe structures with a similar quality independent from a prede-
fined fragment dictionary, which might be focused towards pharma- or agro-rele-
vant molecules. They have been originally combined with non-hierarchical clus-
tering for similaritity-based compound evaluations [30] and in numerous diversi-
ty studies. Critical assessments of 2D fingerprint based similarity considerations
are reported [31], while compact representations for storing chemical information
(mini-fingerprints) have recently been applied [32].
Encoding molecular structures in the form of bitstrings allow to quantify mole-
cular similarity using different functions, like the Tanimoto [33] or Cosine coeffi-
cients [34], while for non-bitstring descriptors the Euclidean distance is often
used. Both coefficients are counting the number of bits in common set to 1 in
slightly differing ways. The Tanimoto coefficient is widely used in database analy-
sis, as it has properties making the work with larger datasets very efficient. A sim-
ilarity coefficient of 0 means that both structures have no "1" bits in common.
Atom pair fingerprints as 2D descriptors count the shortest path of bonds
between two atom types [35]. Each bit in a bitstring now corresponds to a pair of
atom types separated by a fixed number of bonds. This concept has been extend-
ed to represent atoms by physicochemical properties rather then element types
[36]. Based on these pharmacophoric atom types, a method for detecting mean-
ingful common topological substructures among sets of active compounds was
described based on a clique-based subgraph detection method to find the highest
scoring common substructure for each pair of molecules [37]. Pharmacophoric
definition groups rather than atom types have also been successfully used by oth-
ers for similarity and diversity investigations [11,38].
A conceptionally similar implementation of 2D pharmacophoric fingerprints
within a 2D topological cross-correlation vector was successfully applied for
identification of isofunctional molecules with significantly different backbones,
thus demonstrating its potential for "scaffold-hopping" by topological pharma-
cophore searching [39]. The higher abstraction level, when introducing a pharma-
cophore concept into common descriptors, facilitates the identification of mole-
cules with less similarity on the 2D structural diagram and 2D fingerprint level.
Hence, for any objective assessment of a novel descriptor, not only the quantity of
hits, but also their quality need to be assessed. A combination of 2D fragments
with the medicinal chemistry concept of pharmacophoric groups should result in
powerful and fast molecular descriptors.
2D or 3D autocorrelation vectors [40] have also been used as diversity descrip-
tors, representing intramolecular 2D topologies or 3D distances. An autocorrela-
tion coefficient is a sum over all atom pair properties separated by a predefined
number of bonds (2D) or distance (3D), while the entire vector represents a series
of coefficients for all topological or cartesian distances. This transformation con-
verts a vector of variable length into information of fixed length for application of
chemometrical comparison and evaluation approaches. Their drawback is that the
transformation back into the original vector for chemical interpretation is not triv-
130 H. Matter
ial. Atomic properties, which are mapped onto individual atoms involve
hydrophobicity [41], partial atomic charges, hydrogen bonding properties and oth-
ers. To reduce the number of variables and the model dimensionality prior to
selection or analysis, often a principal component analysis (PCA) [42] is applied.
Alternatively the autocorrelation vectors can be projected into 2D space using a
Kohonen network [43] as nonlinear mapping approach. Three-dimensional auto-
correlation vectors on properties based on distances calculated from 3D molecu-
lar surfaces [44] have also been applied to visually assess the diversity of differ-
ent libraries [45].
BCUT values were developed by Pearlman [46] by generating an association
matrix from the connection table of a particular molecule and then adding atomic
charge, polarizability or hydrogen-bonding properties as diagonal elements, while
non-diagonal elements encode intramolecular connectivities. For each matrix
from each atomic property, the lowest and highest eigenvalue is extracted as
descriptor, resulting in a low-dimensional chemistry space encompassing only six
individual descriptors. The same approach has also been used to derive 3D
descriptors by replacing connectivity by interatomic distance derived from a sin-
gle 3D structure of a compound. These BCUT values are mainly used in combi-
nation with cell-based partitioning methods [46,47] for compound selection and
classification, while they were shown to be effective in QSAR studies [48].
Topological indices encode the molecular connectivity pattern [49] and quanti-
fy features like atom types, numbers of atoms and bonds, the extent and position
of branchings and rings in a set of single-value real numbers. In addition, electro-
topological state values [50], molecular shape indices [51] and topological sym-
metry indices are also used. A variety of different indices have been developed and
successfully applied for QSAR studies, diversity studies and database compar-
isons [52]. Often the large number of correlated indices is condensed by principal
component analysis to a few, orthogonal principal properties.
Similar to topological indices are global property molecular descriptors like
size, computed 10gP, molecular refractivity, and key functional group counts.
These descriptors characterize molecules using a single number representing the
underlying physicochemical property. It was earlier reported that a basis set of six
out of 49 2D descriptors are only weakly correlated to each other, suggesting its
use to quantify diversity [47]. Using a genetic algorithm [53], it was possible to
derive an effective set of descriptors from 111 possible ones computed from a
molecule's 2D structure for classification of compounds to different biological
activity classes. Interestingly it was reported that only four descriptors account-
ing for aromatic character, hydrogen bond acceptors, estimated polar vdW sur-
face area, and a single structural key resulted in overall best performance, sug-
gesting that only a few critical descriptors are preferred to partition compounds
according to their activity. Those holistic descriptors are getting even more inter-
est to be incorporated into library filter tools, as most of the properties captured
within those descriptors correlate to pharmacokinetic observables or "drug-like-
ness" [54, 55].
7 Quantification of molecular diversity and design of compound libraries 131
correlate with three other proteins from the reference panel. Although those fin-
gerprints can be used like conventional descriptors, they need additional experi-
mental effort plus synthetic material for profiling. Dixon and Villar [71] used
affinity fingerprints with Euclidean distances and a simple optimization algorithm
to design diverse libraries from compound collections. They also show that a wide
range of structurally diverse active molecules can be detected using the affinity
data of a first screening experiment.
Starting from this biochemical concept, in silicio affinity fingerprints were
computed [72] by estimating binding affinities for a series of molecules versus
known 3D protein cavities in the reference panel using docking programs and
scoring functions [73]. Although less performant than 2D techniques, this
approach led to an enrichment of structurally diverse hits on the 2D level. For
selection of proteins for the reference panel, optimization schemes were devel-
oped to increase predictivity of in silicio affinity fingerprints [74]. It was shown
that this Flexsim-X approach is useful to detect molecules with similar biological
activities from different classes without prior knowledge of the target protein
structure.
Most descriptors represent 2D or 3D molecular features in a linear bitstring or
vector, as such a representation allows for comparing two molecules using a sim-
ple, rapid mapping between their vectors using similarity coefficient (Tanimoto,
Cosine, Euclidean distance). Hence, essential information about the topological
arrangement of features is only partially conserved or lost. This led to the devel-
opment of the feature tree descriptor [38], describing molecules by a tree struc-
ture. The nodes of this tree represent properties of individual molecular fragments,
subdivided into steric and chemical features, while edges are connecting nodes in
correspondence to fragments joined in the 2D chemical structure. For similarity
evaluation, a mapping between parts of the trees must be computed, which is done
by novel algorithms [38]. This descriptor introduces new concepts into descriptor
technology. Due to its tree structure, the overall molecular structure is represent-
ed as trade-off between exactness and speed. In contrast to 2D fragment-based
methods, feature trees detect similarities, if similar interaction pattern are possi-
ble, thus being a extension of the 2D topological pharmacophore approach dis-
cussed above [36, 39] with similar-and sometimes improved-performance
compared to 2D fingerprints.
There are several subset selection approaches from virtual libraries or compound
collections, ranging from computationally demanding clustering [75] to fast max-
imum dissimilarity [18, 34, 76] and cell-based partitioning methods [46, 47]. A
comparison of cell-based partitioning applications to similarity selections is given
in [77] (see also Chapter 10), while reviews for this field have also appeared [3b,
134 H. Matter
19, 23]. Furthermore, stochastic methods like genetic algorithms and simulated
annealing become increasingly popular, mainly for selection of focused libraries.
Selection strategies depend on the purpose of a library. For a novel, uncharac-
terized target, a generic screening library should encompass a larger number of
molecules with many structural motifs. Such a generic screening library will also
be used in general screening for other new targets in search for new leads. Much
attention has been drawn to rational design for enhancing the structural variations
in those libraries, while maintaining moderate library sizes [78]. Due to the lack
of target classification, structural information and missing lead compounds, there
is typically no information on constraints to be imposed during library design.
Representative subsets should span the entire chemical property space ofthe par-
ent library with a smaller number of preferably drug-like compounds. The inten-
tion to consider only drug-like might be valuable to introduce at least some con-
straints into the design procedure. Applying such a design strategy requires addi-
tional screening iterations test similar compounds (analog libraries) to identified
hits for deriving structure-activity relationships. The ultimate goal of any rational
design is to make sure that every new compound adds new information to the
existing HTS compound collection [7]. For such databases an optimally diverse
subset [7, 78] is designed, which include as few compounds as possible, but still
is representative for the entire collection.
If some more information on the target is known, like its classification to a pro-
tein family, any design should bias the library towards structural target features
and privileged structures from initial hits. When no structure-activity relationship
can be generated, global similarity considerations during the selection from the
underlying virtual library should be used, leading to biased or targeted libraries.
Those targeted libraries might be directed towards particular protein families like
metalloproteinases, kinases, ion-channels or G-protein coupled receptors (see also
Chapter 10). Here, initial information could be collected from literature and cor-
porate projects and subsequently turned into constraints plus privileged structures
for library design.
With more information (3D-QSAR models [79], pharmacophore hypotheses
[60], 3D protein structure), global similarity should be replaced by local similari-
ty in the design, i.e., 3D regions or functional groups, which are correlated with
biological activity should be considered. This leads to focused libraries, where
available information on target and lead structures is high and thus constrains the
structural space.
Subsequently, for every validated primary screening hit from rationally
designed libraries, the nearest neighbors must be identified using similarity
searching [80], synthesized, if not already available, and tested in a second itera-
tion to develop an initial SAR (structure-activity relationship). If the nearest
neighbor originates from a virtual library, there is a high probability that it can be
made using the same, validated reaction protocol, which increases efficiency.
7 Quantification of molecular diversity and design of compound libraries 135
Cluster analysis [75, 86] as alternative method offers more specific control by
assigning every structure to a group, which exhibit a high degree of both intra-
cluster similarity and intercluster dissimilarity. A variety of different clustering
techniques have been critically evaluated in the literature [33, 80]. There are no a
priori guidelines for what technique will be most appropriate for a dataset,
although some methods perform better for grouping similar compounds [26].
Cluster methods can be divided into hierarchical and non-hierarchical methods.
The idea of hierarchical agglomerative clustering is to start with singleton clusters
and merge those two clusters with minimal inter-cluster distance sequentially.
This approach does not require any assumption about the number of clusters to
generate, but is the computationally most expensive technique. Different methods
can be used to compute distances between clusters: the distance between the clos-
est pair of data points in both clusters (single linkage clustering), the distance
between the most distant pair of data points in both clusters (complete linkage
clustering), the average of all pairwise data points between two clusters, and the
distance between two cluster centroids. For compound selection typically one or
more structures closest to the centers are chosen to form a representative set.
136 H. Matter
Cell-based or partitioning methods [46, 47, 52] are tailored to work in low-dimen-
sional descriptor space by defining a small number of bins on each descriptor axis.
Those bins are used to partition a multidimensional space into smaller subspaces
(cells). Each compound to be analyzed in a virtual or existing library will be
assigned to one of the cells (cubes for three dimensions, hypercubes for higher
dimension) according to its descriptor values, i.e., physicochemical properties.
The underlying assumption following the similarity principle is that compounds
in one cell share similar biological properties. Diverse subsets might be derived
by choosing one or more representative molecules from each occupied cell, while
for focused libraries all molecules from a cell with the hit motif plus neighboring
cells should be selected. In a recent study [90], different binning schemes were
compared in terms of their ability to provide an even distribution of molecules
across the chemistry space and to maximize the number of active molecules for
hypothetical assay data.
This approach allows to efficiently identify unoccupied regions in descriptor
space and thus guide rational library acquisition, while this is not possible using
7 Quantification of molecular diversity and design of compound libraries 137
Selection techniques from statistical experimental design have also been used for
subset selection, like D-optimal design [91] to maximize diversity in reactant col-
lections for combinatorial chemistry. This approach allows to incorporate a vari-
ety of different descriptors into the design, while it is known for D-optimal design
to select molecules at the edges of the multidimensional descriptor space. Another
strategy by Wold et al. [92] uses experimental design (factorial design, fractional
factorial design and D-optimal design [93]) as key component for selecting
diverse building blocks, while their combination should result in a diverse prod-
uct library. Relevant descriptors are chosen specifically for the problem. Then a
peA (principal component analysis) is applied to reduce the dimensionality of the
descriptor space, followed by statistical molecular design for selecting those
building blocks, which optimally represent the chemistry space defined by PCA
scores (principal properties). Although this approach works on building blocks
rather than enumerated products, it led to diverse libraries. No efficiency differ-
ence was found comparing building block and product space selections, when crit-
ically investigating the building block space and select an appropriate number of
reactants.
However, it is still an issue whether a selection in reactant or product space lead
to more diverse libraries [94]. Certainly the most diverse library can be obtained
by an unconstrained diversity selection on the full product matrix, while this
approach violates the combinatorial scheme and thus is not very efficient, when
addressing real chemistry problems. Although parallel synthesis schemes are
compatible with this pure diversity selection after solving the logistic problems
[7], a more economical solution is preferable for most synthesis schemes.
Contrasting to results from Wold et al. [92], Gillet et al. [94] showed that product
based design using a genetic algorithm, combinatorial constraints and a finger-
print-based library diversity measure as fitness function results in more diverse
138 H. Matter
libraries than a simple reactant-based design, while the maximum diversity can
only be achieved by product-based pure diversity selection disregarding the com-
binatorial scheme (see Fig. 1). Those conflicting results can be understood, if
looking at the different set of descriptors and validation approaches employed in
both studies. Another study [95] comparing the efficiency of reagent-based selec-
tions versus product-based selections under combinatorial constraints revealed
that the advantage of working in product space indeed depends on the descriptors.
While for some descriptors, product-based approaches provide an advantage,
whereas for others results from reactant pools are comparable, which suggests that
for each descriptor the ability to relate diversity from reagent to product space
should be evaluated.
Several stochastic optimization techniques have been applied for selecting
appropriate subsets mainly for focused libraries towards a particular target.
Genetic algorithms [96] were utilized to identify a diverse set of compounds using
topological similarity to a known lead structure as fitness function [97], while a
direct optimization of experimental biological activities using a genetic algorithm
was iteratively used for identification of trypsin inhibitors in a virtual library of
Combinatorial Chemistry:
R1 + R2 - Product(1,2)
• • I.
• •
R1 Building Blocks
•
Diversity Selection under
Combinatorial Constraints
•
R1 Build ing Blocks
•
•
•
R1 Building Blocks
Figure 1. Comparison between pure diversity selection and selection under combinatorial constraints to
obtain a r epresentative subset [94). Although pure diversity selection results in more diverse subsets, it is
not economical , as most building blocks have to be used only once. With constrained selection a synthet-
ically efficient and diverse combinatorial subset can be identified.
7 Quantification of molecular diversity and design of compound libraries 139
This variety of methods immediately raises the question of what is useful for
library design, comparison and subset selection? Any validation study to address
this issue must evaluate, whether a method can discriminate between actives and
inactives for a particular target. As one quantitative performance measure, the
number of actives belonging to the same class than a reference compound could
be used. Such an enrichment factor [38] is defined as number of hits in the first
percent of the database sorted according to the similarity measure divided by the
number of hits expected from a random selection. Intuitively, the enrichment fac-
tor describes how much is gained by using a similarity measure for selecting mol-
ecules compared to a random selection.
Several physicochemical descriptors were investigated to uncover the relation-
ship between 2D/3D similarity and biological activity. The first study towards an
objective validation of descriptors was done by Brown and Martin [26], who
assessed the ability of 2D and 3D descriptors combined with hierarchical and non-
hierarchical clustering techniques to separate active from inactive compounds. An
active cluster contains at least one active compound for a particular assay, and the
active cluster subset refers to the total number of molecules (actives and inactives)
in all active clusters. Then the proportion of active structures in active clusters is
computed and compared to the proportion of active structure in the entire dataset.
Any increase in this proportion indicates a tendency to group active compounds.
A superior behavior of 2D descriptors (MACCS keys, Daylight and UNITY fin-
gerprints) compared to 3D descriptors was found, which partially was attributed
to a lack in 3D descriptor development approaches and the problem of treating
conformational flexibility. A following study demonstrated that a variety of 2D
descriptors, MACCS keys in particular, implicitly contain information about
physicochemical properties known to be relevant for stabilizing protein-ligand
complexes [21], thus suggesting, why those simple descriptors might be success-
ful in separating active from inactive compounds. The information content of each
descriptor was assessed by its ability to accurately predict physicochemical prop-
erties from known values for related structures using similarity-based and cluster-
based property prediction.
The correspondence between chemical and biological similarity was evaluated
using structurally similar compounds known to act on different biological targets.
Figure 2. Generation of
neighborhood plots for ~
.tt Neighborhood behaviour descriptor validation
[102]. For each
<II .;J Diagonal defines
u n*(n-l)l2 pairs of n
c ••' maximum gradient
I!! molecules. the descrip-
.••••.~ 0 tor difference (1 -
~
o .' 0 Tanimoto coefficient) is
o ..-0 plotted on the x-axis
~ versus their difference
> .••• [lJ
n in biological space
IV
o /00 0 (Act(a)-Act(b» . For a
iii valid descriptor. a char-
...0 0 0 acteristic shape with
·tJ only few outliers in the
Q] 00 upper left triangle is
....··0 0 0 o observed. Examples for
Pairwise Descriptor Difference valid and invalid
descriptors are plotted
for 138 angiotensin-
converting enzyme
inhibitors. Left: steric
Steric Fields 20 Fingerprints Molecular Weight fields ; middle: 2D fin-
gerprints; right: mole-
7
cular weight (103).
7
CII <II
U U
8c C c
~ I!!
CII
Qj ~ ~
:ec o o
-0IV -0
IV
-0IV o o
o iii iii
11 , ,__ '1 ;:c
iii0 ... "' :
'"
~
7 Quantification of molecular diversity and design of compound libraries 141
the gap between 90% and 100% biological representativity requires many more
compounds. Remarkably, lower numbers of selected molecules led to almost sim-
ilar coverage of biological classes for the rational and random approach.
Thus it was shown that designed subsets are superior in the sense of sampling
more biological targets compared to randomly picked ensembles. Two-dimen-
sional fingerprints as molecular descriptors were shown to be appropriate for
designing subsets representing biological properties of parent databases. They
perform better, when comparing the sampling properties of other descriptors car-
rying 2D or 3D molecular information (see Fig. 3). All studies reveal that 2D fin-
gerprints alone or in combination with other metrics as primary descriptor allow
to handle global diversity.
In a recent validation study for diversity selection approaches in combination
with representativity techniques (selection of central compounds in a cluster,
Kohonen neural networks, nonlinear scaling of descriptors) and descriptors (topo-
logical and 3D fingerprints), the differences between diversity and representativi-
ty was explored [lOS]. The authors report that only a cluster analysis operating on
fingerprints or whole-molecule descriptors outperforms a random selection in a
database of diverse drug compounds.
Another validation concept was introduced by Pearlman et al. [106], the activ-
ity-seeded structure-based clustering, which probes the assumption that active
structures are grouped in chemistry space. The authors report that 74 active ACD
inhibitors are found in only three clusters occupying less then 0.02% of the chem-
istry space of the entire MDDR database.
Those concepts for validating descriptors allow to apply them for design and com-
parison of compound libraries. Such a comparison should be based on the degree
of complementarity to a reference database in a common reference frame, when
adding a candidate database to enhance the number of structural motifs within a
corporate collection. Such a candidate database might be an external collection of
available compounds or a virtual library for combinatorial synthesis.
Database self-similarity and comparison plots based on a distribution of pair-
wise distances of validated descriptors between both datasets are useful tools to
monitor their diversity, representativity and complementarity [27, 78]. In a typical
application [78] the Tanimoto coefficient for every reference database structure to
its nearest neighbor is computed using 2D fingerprints and used to generate a his-
togram, the maximum Tanimoto coefficient represents the closest pair of any two
compounds in the dataset. For an idealized database designed using a Tanimoto
coefficient of 0.85 as similarity radius (Fig. 4, far left panel) defining the bound-
ary between similar and diverse, a self-similarity plot (Fig. 4, central left column)
with a peak on the far left side is expected, while the comparison with the parent
virtual library should not reveal any voids, leading to a histogram (Fig. 4, central
right column) with a peak and a cut-off at on the right side of this similarity radius
7 Quantification of molecular diversity and design of oc mpound libraries 143
1.00
0.70
0.60 Hierarchical
0..
Q)
liP 0.50
Q;
~ 0.40
0.30
0.20
0.10 ~ PDT_Cos
--RandomAv
0.00
50 100 150 200 250 300 350 400 450 500
Number of Clusters
100
90
eo
"C
CII 70
iii
>
0
u 60
Dissimilarity
'"CII
'"'"
m
50 selection
C3
E 40
CII
e
Q)
0.. 30
20
FP_Cos
10
Coverage analysis ....-PDT_Tan ....... PDT_Cos
0
50 100 150 200 250 300 350 400 450 500
Number of Selected Compounds
Figure 3. Upper: Cluster separation analysis [26] from hierarchical clu stering of the IC93 database using
20 fingerprints (FP) and 3D pharmacophoric triplets (PDT) in comparison t o a naverage of 10 analyses
using random numbers (RandomAv) [6 1]. T animoto (TAN) and cosine coeffici ents (COS) are both evalu-
ated. The average proportion p on the y -axis is plotted versus the number of clusters generated at differ-
ent levels of the dendrogram. This a nalysis evaluates, whether a particular descriptor i s able togroup bio-
logically similar compounds. L ower panel: Coverage analysis using the same database and descriptors
[6 1]. This a nalysis evaluates the percent biological classes covered from the database in various subsets
generated u sing maximum dissimilarity selection.
_T_
144 H. Matter
0.10 -O]T""-
:····· - _T"'_1\:.....
1.12 : D.70 :
·· ..
........;;n;==~-I
i
. ....cr_c.&, l
)c-c.
• .......".....
Figure 4. Database comparison histograms to illustrate a database selection generated using a sphere-
exclusion algorithm with a predefined similarity radius [7, 78]. The left panel shows a simplified repre-
sentation of this database as distribution of molecules (filled circles) in an arbitrary 2D molecular proper-
ty space. In the middle left panel idealized self-similary histograms are shown, while the plots in the mid-
dle right panel show plots obtained by comparing the database subset to the entire database. The right col-
umn refers to plots obtained by comparison to a corporate database. Dotted lines indicate the similarity
radius for exclusion.
Another approach is based on spatial diversity and evaluates the number of dif-
ferent functional groups within a certain 3D distance for a single or two databas-
es [l08]. Other ways for comparing databases and detecting complementary sub-
sets involve profiling and categorizing based on a list of predefined 2D substruc-
ture features [109], ring systems, frameworks [110] or property distributions of
relevant physicochemical descriptors, for which optimal ranges have been identi-
fied from statistical analysis of drug reference databases [55].
During the actual design, new compounds should be unique and thus add useful
information to the library tailored for a particular purpose (primary screening
library, targeted library or focused library). Such a library should exhibit a wide
coverage of the physicochemical property space, any member should be drug-like
to add biological relevance to the screening collection. For designing a primary
screening library the use of two very similar compounds does not enhance the
chance to find different types of biological activity or novel structural motifs relat-
ed to biological activity for the same target. It would be advantageous to replace
one with a more dissimilar candidate, which indeed is the motivating factor for the
design of diverse compound libraries [7, 78] using dissimilarity selection, sphere
exclusion, clustering, cell-based selection or other validated selection techniques.
Hence, diversity-based selections are utilized to choose an optimal generic library
from the possible virtual library, in some cases under combinatorial constraints to
increase synthetic efficiency. In contrast, reverse selection strategies need to be
utilized for targeted libraries, while similar descriptors and algorithms could be
applied. For example, after clustering one would typically choose all compounds
in the same cluster than an initial motif. Similarity searching around hits aids to
design biased or focused libraries, which could be followed by a procedure to
generate an optimal combinatorial library. The most detailed level of information
can be incorporated by scoring a virtual library with 2D or 3D-QSAR models,
pharmacophore models, or empirical protein-ligand scoring functions [111] (see
Chapter 10), which is time-consuming and requires filtering strategy to reduce the
virtual library size [112]. The fastest library design method today based on simi-
larity searching of 3D shapes in combinatorial libraries has been reported
[63-65]. Thus, as more information is known about a target, as more the choice
of descriptors and selection techniques is shifted towards successful QSAR or
structure-based design descriptors, techniques and methods for introducing a bias
into the library.
Another point is to evaluate, which of many candidate library is more diverse
or better applicable for a particular screening application. Here, validated quanti-
tative measures of diversity are routinely applied to assess diversity, representa-
tivity plus complementarity to any corporate compound collection. It must be
decided whether it is more effective to manufacture a sublibrary variation of an
existing library or to establish a new synthesis scheme with a novel scaffold.
146 H. Matter
Library design typically starts with a new synthetic route or an accessible scaf-
fold, followed by reaction validation to assess the compatibility of potential build-
ing blocks with the synthetic scheme. The undesirability of certain reagents, often
from reactant databases like ACD (Available Chemical Directory), is defined by the
presence of substructures that are known to have toxic side-effects and/or meta-
bolically hazardous [113] as well as by the presence of groups that could interfere
with or prevent the reaction. Reaction validation information has to be considered
to exclude reaction specific unwanted substructures based on the likelihood to
byproducts, their inertness, reactivity of non-reacting functional groups, stability of
products, or solubility. Rejected and accepted compounds are stored in different
files with physicochemical descriptions, such that a later selection is possible at
every stage. This is useful, if at a later stage an educt is not longer commercially
available. Such a refined assortment of reactive building blocks in accord with a
reaction protocol now is the essential starting point for design, while the final
acquisition and maintenance of a building block collection is a resource-intensive
activity. Hence, such a collection will be used more than once by integrating them
into the design of different libraries around alternative scaffolds with alternative
reaction schemes to end up with a diverse portfolio of complementary libraries.
During the library planning phase one should decide whether to use commer-
cially available reagents only or to incorporate proprietary building blocks from
custom synthesis, which requires an initial effort to establish a tailored collection.
Many researchers suggest to incorporate building blocks derived from historic
medicinal chemistry knowledge (privileged structures) [114], which certainly
depends on library type and motivation, the information about the biological tar-
get and the logistics and availability of suitable reactants.
If a product-based approach is used, acceptable reagents are combined to a vir-
tuallibrary. The theoretical number of individual compounds is given by the num-
ber of building blocks per step plus the number of steps. Efficient searching first
rejects compounds using unwanted substructure queries and property ranges for
molecular weight, computed 10gP, number of hydrogen bond donors, acceptors
and others (see Chapter 12). Then an assessment of the drug-likeness of individ-
ual members could be applied, before diversity considerations are taken into
account, again depending on the type of library to be generated.
assays like Caco-2 cell monolayers [118] for membrane permeability or hepato-
cytes or liver microsomes to predict possible metabolic instabilities [119], because
any in vitro profiling requires material, while in-silico methods allow the rapid fil-
tering of virtual compound libraries prior to synthesis. For a review on computer-
aided prediction of drug toxicity and metabolism with respect to the design of
drug-like screening libraries see Chapter 13.
The importance of 10gP, the octanollwater partition coefficient, for distribution
of drugs and approaches towards the estimation of this parameter have recently
been reviewed [120], while a computational study [121] on the comparison of
10gP prediction methods revealed the superiority of fragmental over atom-based
and molecular property approaches. This and other simple physicochemical
parameters were first used by Lipinski et al. [10] in a systematic comparison of
several orally absorbed drugs versus molecules, which are not expected to be
absorbed. A set of very simple filters to categorize compounds for their ability to
undergo possible transport only, known as "rule-oj-five," resulted from this study.
This rule to classify compounds for their ability to oral absorption had a major
impact on medicinal chemistry and library design (see Chapter 12).
Complementary approaches to a retrospective analysis of drugs involve the
generation of quantitative structure-property relationships (QSPR) using regres-
sion based statistical methods. For the majority of drugs the preferred route of
administration is oral absorption. Oral bioavailability refers to the fraction of the
oral dose, which reaches systemic circulation, influenced by intestinal absorption
plus metabolism in the membrane or liver. Those approaches utilize theoretical or
experimental descriptors capturing global properties related to lipophilicity, parti-
tion coefficients, surface areas, hydrogen bonding, pKa, electrostatics and others.
Recently, the importance of a simple, but physicochemically relevant parame-
ter, the polar surface area (PSA), has been discovered in a variety of models to
predict in vitro membrane permeability, intestinal absorption or blood-brain bar-
rier penetration. The PSA is computed based on the sum of van-der-Waals or sol-
vent-accessible surface areas of oxygen and nitrogen atoms including polar hydro-
gens, it correlates to molecular hydrogen bonding and donating properties
[122-125]. A first study [123] on 17 diverse drugs resulted in a regression model
correlating Caco-2 membrane permeabilities with molecular weight and PSA
only, showing that an increase in PSA is detrimental for absorption. A following
study [126] using MolSurf descriptors from quantum chemical calculations in
combination with PLS also suggests that a reduced hydrogen bonding potential is
favorable for absorption.
An extension led to the dynamic PSAd [122, 127], where multiple conforma-
tions of one molecule are averaged during the PSA calculations. However, single
conformer derived PSA were able to give almost similar results, which allows to
apply this rapid descriptor for library filtering [125, 124]. Another study by Palm
et al. [127] evaluated human absorption data for a set of 20 passively transported
drugs covering a diverse range of human fractional absorption data (Fig. 6). A
strong sigmoidal correlation between the dynamic PSA and the human fractional
absorption was found. From this and another study [124], an upper PSA threshold
148 H. Matter
Nordiazepam Lactulose
99.0 % mean fractional absorption 0.6 % mean fractional absorption
Figure 5. Sigmoidal correlation between human fractional absorption and dynamic polar surface area for
20 diverse drugs (data taken from [127]). The influence of the polar surface area on fractional absorption
is illustrated plotting the small static PSA for the oral available Nordiazepam (left) versus the large area
for the non-available Lactulose (right).
of 140 A2 for oral absorption was proposed, while compounds with a PSA lower
than 60 A2 are found to favorable absorption. Other studies using PSA involve the
successful prediction of blood-brain barrier penetration [128], 125] and the use of
the non-PSA in addition to establish predictive models [129]. Those studies led to
the conclusion that the polar molecular surface is the dominating determinant for
oral absorption and brain penetration of drugs that are transported via the tran-
scellular route. Some library design strategies now include filtering using PSA
descriptors to enhance the hit-to-lead properties for focused libraries [130].
Recently, a novel set of alignment-free 3D descriptors named VolSurf was
developed, referring to several calculated molecular properties extracted from 3D
grid maps of molecular interaction energies with particular probe atoms [131].
These descriptors have been successfully correlated with bioavailability, blood-
brain partitioning, membrane transport and other properties [132, 133]. The gen-
eral conclusions of these models for factors influencing permeability and absorp-
tion are in agreement with those reported above. However, due to the nature of the
descriptors this approach allows to better understand the underlying physico-
chemical requirements for a pharmacokinetic effect and use it for further design,
for example the balance between lipophilic and hydrophilic parts in combination
with size, volume and other effects.
7 Quantification of molecular diversity and design of compound libraries 149
7.9 Conclusions
7.10 Acknowledgements
7.11 References
I Gordon EM, Kerwin JF (eds) (1998) Combinatorial chemistry and molecular diversity in drug dis-
covery. Wiley, New York
2 Brown FK (1998) Chemoinformatics: what is it and how does it impact drug discovery. Annu Rep Med
Chern 33: 375-384
3 Martin EJ, Critchlow RE, Spellmeyer DC et al (1998) Diverse approaches to combinatorial library
design. In: van der Goot H (ed): Trends in drug research II. Elsevier, 133-146
4 Kubinyi H (1998) Combinatorial and computational approaches in structure-based design. Curr Opin
Drug Disc Dev I: 16-27
5 Ash JE, Warr WA, Willett P (eds) (1997) Chemical information systems. Ellis Horwood, Chichester
6 Warr WA (1997) Combinatorial chemistry and molecular diversity. An overview. J Chern In! Comput
Sci 37: 134-140
7 Ferguson AM, Patterson DE, Garr C et al (1996) Designing chemical libraries for lead discovery. J
Biomol Screen 1: 65-73
8 Smith DA, van de Waterbeemd H (1999) Pharmacokinetics and metabolism in early drug discovery.
Curr Opin Chern Bioi 3: 373-378
9 Clark DE, Pickett SD (2000) Computational methods for the prediction of "drug-likeness". Drug Disc
Today 5: 49-58
10 Lipinski CA, Lombardo F, Dominy BW et al (1997) Experimental and computational approaches to
152 H. Matter
estimate solubility and permeability in drug discovery and development settings. Adv Drug Delivery
Rev 23: 3-25
II Martin EJ, Blaney JM, Siani MA, et al (1995) Measuring diversity: experimental design of combina-
toriallibraries for drug discovery. J Med Chem 38: 1431-1436
12 Maggiora GM, Johnson MA (eds) (1990) Concepts and applications of molecular similarity. Wiley,
New York
13 Bures MG, Martin YC (1998) Computational methods in molecular diversity and combinatorial chem-
istry. Curr Opin Chem 8io12: 376-380
14 Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. J Chem Inf Comput Sci 38:
983-996
15 Willett P (ed) (1987) Similarity and clustering in chemical information systems. Letchworth, Research
Studies Press
16 Hansch C, Leo A (eds) (1995) Exploring QSAR: Fundamentals and applications in chemistry and biol-
ogy. Am Chem Soc, Washington, DC
17 Kubinyi H (1998) Similarity and dissimilarity: a medicinal chemist's view. In: Kubinyi H, Folkers G,
Martin YC (eds): 3D-QSAR in drug design Vol 2. Kluwer, Dordrecht, 225-252
18 Lajiness MS (1997) Dissimilarity-based compound selection techniques. Perspect Drug Disc Design
7/8: 65-84
19 Martin YC, Brown RD, Bures MG (1998) Quantifying diversity. In: Gordon EM, Kerwin JF (eds):
Combinatorial chemistry and molecular diversity in drug discovery. Wiley, New York, 369-388
20 Brown RD (1997) Descriptors for diversity analysis. Perspect Drug Disc Design 7/8: 31-49
21 Brown RD, Martin YC (1997) The information content of 2D and 3D structural descriptors relevant to
ligand-receptor binding. J Chem InfComput Sci, 37: 1-9
22 Barnard JM (1993) Substructure searching methods - old and new. J Chem Inf Comput Sci 33:
572-584
23 Gillet VJ (1999) Computational aspects of combinatorial chemistry. In: Miertus S, Fassina G (eds)
Comb Chem Techno!. Dekker, 251-274
24 ISISlBase 2.1.3., Molecular Design Ltd, 14600 Catalina Street, San Leandro, CA 94577
25 Adamson GW, Cowell J, Lynch MF, et al (1973) Strategic considerations in the design of a screening
system for substructure searches of chemical structure files. J Chem Doc 13: 153-157
26 Brown RD, Martin YC (1996) Use of structure-activity data to compare structure-based clustering
methods and descriptors for use in compound selection. J Chem Inf Compuf Sci 36: 572-584
27 Matter H, Rarey M (1999) Design and diversity analysis of compound libraries for lead discovery. In:
Jung G (ed): Combinatorial chemistry. Wiley-VCH, Weinheim, 409-439
28 UNITY Chemical Information Software, Tripos Inc, 1699 S Hanley Road, St Louis, MO 63144, USA
29 Daylight Chemical Information Systems, Inc, 3951 Claremont Street, Irvine, CA 92714
30 Willett P, Winterman V, Bawden D (1986) Implementation of non-hierarchical cluster analysis meth-
ods in chemical information systems: Selection of compounds for biological testing and clustering of
substructure search output. J Chem Inf Comput Sci 26: 109-118
31 Godden JW, Xue L, Bajorath J (2000) Combinatorial preferences affect molecular similarity/diversity
calculations using binary fingerprints and Tanimoto coefficients. J Chem InfComput Sci 40: 163-166
32 Xue L, Godden JW, Bajorath J (1999) Database searching for compounds with similar biological activ-
ity using short binary bit string representations of molecules. J Chem Inf Comput Sci 39: 881-886
33 Willett P, Winterman V (1986) Comparison of some measures for the determination of intermolecular
structural similarity. Quant Struct-Act Relat 5: 18-25
34 Holliday JD, Ranade SS, Willett P (1995) A fast algorithm for selecting sets of dissimilar structures
from large chemical databases. Quant Struct-Act Relat 14: 501-506
35 Sheridan RP, Nachbar RB, Bush BL (1994) Extending the trend vector: The trend matrix and sample-
based partial least squares. J Comput Aided Mol Des 8: 323-340
36 Sheridan RP, Miller MD, Underwood DJ et al (1996) Chemical similarity using geometrical atom pair
descriptors. J Chem Inf Comput Sci 36: 128-136
37 Sheridan RP, Miller MD (1998) A Method for visualizing recurrent topological substructures in sets of
active molecules. J Chem Inf Comput Sci 38: 915-924
38 Rarey M, Dixon JS (1998) Feature trees: a new molecular similarity measure based on tree matching.
J Comput Aided Mol Des 12: 471-490
39 Schneider G, Neidhart W, Giller T et al (1999) "Scaffold-Hopping" by topological pharmacophore
search: a contribution to virtual screening. Angew Chem 111: 3068-3070
40 Moreau G, Turpin C (1996) Use of similarity analysis to reduce large molecular libraries to smaller
7 Quantification of molecular diversity and design of compound libraries 153
66 Todeschini R, Gramatica P (1998) New 3D molecular descriptors. The WHIM theory and QSAR appli-
cations. Perspect Drug Discovery Des 9/10/11: 355-380
67 Bravi G, Wikel JH (2000) Application of MS-WHIM descriptors: 1. Introduction of new molecular
surface properties and 2. Prediction of binding affinity data. Quant Struct-Act Relat 19: 29-38
68 Ekins S, Bravi G, Binkley S et al (1999) Three and four dimensional-quantitative structure activity
relationship (3D/4D-QSAR) analyses of CYP2D6 inhibitors. Pharmacogenetics 9: 477-489
69 Bravi G, Wikel JH (2000) Application of MS-WHIM descriptors: 3. Prediction of molecular proper-
ties. Quant Struct-Act Relat 19: 39-49
70 Kauvar LM, Higgins DL, Villar HO et al (1995) Predicting ligand-binding to proteins by affinity fin-
gerprinting. Chemistry & Biology 2: 107-118
71 Dixon SL, Villar HO (1998) Bioactive diversity and screening library selection via affinity finger-
printing. ] Chem Inf Comput Sci 38: 1192-1203
72 Briem H, Kuntz ID (1996) Molecular similarity based on dock-generated fingerprints. ] Med Chem 39:
3401-3408
73 Bohm H-J (1994) The development of a simple empirical scoring function to estimate the binding con-
stant for a protein-ligand complex of known three-dimensional structure. ] Comput Aided Mol Des 8:
243-256
74 Lessel UF, Briem H (2000) Flexsim-X: A method for the detection of molecules with similar biologi-
cal activity. ] Chem Inf Comput Sci 40: 246-253
75 Barnard JM, Downs GM (1992) Clustering of chemical structures on the basis of two-dimensional
similarity measures. ] Chem Inf Comput Sci 32: 644-649
76 Lajiness M, Johnson MA, Maggiora GM (1989) Implementing drug screening programs by using
molecular similarity methods. In: Fauchere, JL (ed): QSAR: Quantitative structure-activity relation-
ships in drug design. Alan R Liss Inc, New York, 173-176
77 Van Drie JH, Lajiness MS (1998) Approaches to virtual library design. Drug Disc Today 3: 274-283
78 Matter H, Lassen D (1996) Compound libraries for lead discovery. Chim Oggi 14: 9-15
79 Kubinyi H (ed) (1993) 3D-QSAR in drug design theory, methods and applications. ESCOM, Leiden
80 Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. ] Chem Inf Comput Sci 38:
983-996
81 Holliday JD, Willett P (1996) Definitions of "disimilarity" for dissimilarity-based compound selection.
] Biomolecul Screen I: 145-151
82 Snarey M, Terrett NK, Willett P et al (1997) Comparison of algorithms for dissimilarity-based com-
pound selection. ] Mol Graphics 15: 372-385
83 Clark RD (1997) OptiSim: an extended dissimilarity selection method for finding diverse representa-
tive subsets. J Chem Inf Comput Sci 37: 1181-1188
84 Clark RD, Langton WJ (1998) Balancing representativeness against diversity using optimizable K-dis-
similarity and hierarchical clustering. ] Chem Inf Comput Sci 38: 1079-1086
85 Hudson BD, Hyde RM, Rahr E et al (1996) Parameter based methods for compound selection from
chemical databases. Quant Struct-Act Relat 15: 285-289
86 Downs GM, Willett P (1994) Clustering of chemical structure databases for compound selection. In:
Van de Waterbeemd H (ed): Advanced computer-assisted techniques in drug discovery Vol 3. VCH,
Weinheim, 111-130
87 Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared nearest neigh-
bours. IEEE Trans Comput, C22: 1025-1033
88 Menard PR, Lewis RA, Mason JS (1998) Rational screening set design and compound selection:
Cascaded clustering. ] Chem Inf Comput Sci 38: 497-505
89 Reynolds CH, Druker R, Pfahler LB (1998) Lead discovery using stochastic cluster analysis (SCA): A
new method for clustering structurally similar compounds. ] Chem Inf Comput Sci 38: 305-312
90 Bayley MJ, Willett P (1999) Binning schemes for partition-based compound selection. ] Mol Graphics
Modell 17: 10-18
91 Mitchell TJ (1974) An algorithm for the construction of "D-optimal" experimental designs.
Technometrics 16: 203-210
92 Linusson A, Gottfries J, Lindgren F et al (2000) Statistical molecular design of building blocks for
combinatorial chemistry. ] Med Chem 43: 1320-1328
93 Box GEP, Hunter WG, Hunter JS (eds) (1978) Statistics for experimenters Wiley, New York
94 Gillet VJ, Willett P, Bradshaw J (1997) The effectiveness of reactant pools for generating structurally
diverse combinatorial libraries. ] Chem Inf Comput Sci 37: 731-740
95 Jamois EA, Hassan M, Waldman M (2000) Evaluation of reagent-based and product-based strategies
7 Quantification of molecular diversity and design of compound libraries 155
in the design of combinatorial library subsets. J Chern In! Cornput Sci 40: 63-70
96 Weber L (1998) Applications of genetic algorithms in molecular diversity. Drug Disc Today 3:
379-385
97 Sheridan RP, Kearsley SK (1995) Using a genetic algorithm to suggest combinatorial libraries. J
Chern In! Cornput Sci 35: 310-320
98 Weber L, Wallbaum S, Broger C et al (1995) Optimization of the biological activity of combinatori-
allibraroes by a genetic algorithm. Angew Chern 34: 2281-2282
99 Brown RD, Martin YC (1997) Designing combinatorial library mixtures using a genetic algorithm. J
Med Chern 40: 2304-2313
100 GilletVJ, Willett P, Bradshaw J et al (1999) Selecting combinatorial libraries to optimize diversity and
physical properties. J Chern In! Cornput Sci 39: 169-177
101 Agrafiotis DK (1997) Stochastic algorithm for molecular diversity. J Chern In! Cornput Sci 37:
841-851
102 Patterson DE, Cramer RD, Ferguson AM et al (1996) A useful concept for validation of molecular
diversity descriptors. J Med Chern 39: 3049-3059
103 Matter H (1997) Selecting optimally diverse compounds from structure databases: A validation study
of 2D and 3D molecular descriptors. J Med Chern 40: 1219-1229
104 Potter T, Matter H (1998) Random or rational design? Evaluation of diverse compound subsets from
chemical structure databases. J Med Chern 41: 478-488
105 Bayada DM, Hamersma H, van Geerestein VJ (1999) Molecular diversity and representativity in
chemical databases. J Chern In! Cornput Sci 39: 1-10
106 Pearlman RS, Smith KM (1999) Metric validation and the receptor-relevant subspace concept. J
Chern In! Cornput Sci 39 28-35
107 Pickett SD, Luttmann C, Guerin Vet al (1998) DlVSEL und COMPLIB - Strategies for the design
and comparison of combinatorial libraries using pharmacophoric descriptors. J Chern In! Cornput Sci
38: 144-150
108 Boyd SM, Beverley M, Norskov L et al (1995) Characterizing the geometric diversity of functional
groups in chemical databases. J Cornput Aided Mol Des 9: 417-424
109 Bemis GW, Murcko MA (1999) Properties of known drugs. 2.Side chains. J Med Chern 42:
5095-5099
110 Murcko MA, Bemis GA (1996) Properties of known drugs. I. Molecular frameworks. J Med Chern
39: 2887-2893
III Hirst JD (1998) Predicting ligand binding energies. Curr Opin Drug Disc Dev I: 28-33
112 Walters WP, Stahl MT, Murcko MA (1998) Virtual screening - an overview Drug Disc Today 3:
160-178
113 Rishton GM (1997) Reactive compounds and in vitro false positives in HTS. Drug Disc Today 2:
382-384
114 Lewell XQ, Judd DB, Watson SP et al (1998) RECAP-retrosynthetic combinatorial analysis proce-
dure: A powerful new technique for identifying privileged molecular fragments with useful applica-
tions in combinatorial chemistry. J Chern In! Cornput Sci 38: 511-522
115 Walters WP, Ajay Murcko MA (1999) Recognizing molecules with drug-like properties. Curr Opin
Chern Bioi 3: 384-387
116 Blake JF (2000) Chemoinformatics - predicting the physicochemical properties of "drug-like" mole-
cules. Curr Opin Biotechnol II: 104-107
117 Prentis RA, Lis Y, Walker SR (1988) Pharmaceutical innovation by the seven UK-owned pharmaceu-
tical companies (1964-1985). Br J Clin Pharrnacol25: 387-396
118 Delie F, Rubas WA (1997) A human colonic cell line sharing similarities with enterocytes as a model
to examine oral absorption: Advantages and limitations of the Caco-2 model. Crit Rev Ther Drug
Carrier Syst 14: 221-286
119 Eddershaw PJ, Dickens M (1999) Advances in in vitro drug metabolism screening. Pharrn Sci Technol
Today 2: 13-19
120 Buchwald P, Bodor N (1998) Octanol-water partition: searching for predictive models. Curr Med
Chern 5: 353-380
121 Mannhold R, Cruciani G, Dross K et al (1998) Multivariate analysis of experimental and computa-
tional descriptors of molecular lipophilicity. J Cornput Aided Mol Des 12: 573-581
122 Palm K, Luthman K, Ungell A-L et al (1998) Evaluation of dynamic molecular surface area as pre-
dictor of drug absorption: Comparison of other computational and experimental predictors. J Med
Chern 41: 5382-5392
156 H. Matter
123 van de Waterbeemd H, Camenisch G, Folkers G et al (1996) Estimation of Caco-2 cell permeability
using calculated molecular descriptors. Quant Struct-Act Relat 15: 480-490
124 Clark DE (1999) Rapid calculation of polar molecular surface area and its application to the predic-
tion of transport phenomena. I. Prediction of intestinal absorption. J Pharm Sci 88: 807-814
125 Kelder J, Grootenhuis PDl, Bayada DM et al (1999) Polar molecular surface as a dominating deter-
minant for oral absorption and brain penetration of drugs. Pharm Res 16: 1514-1519
126 Norinder U, Osterberg T, Artursson P (1997) Theoretical calculation and prediction of Caco-2 cell
permeability using MolSurf parametrization and PLS statistics. Pharm Res 14: 1786-1791
127 Palm K, Stenberg P, Luthman K et al (1997) Polar molecular surface properties predict the intestinal
absorption of drugs in humans. Pharm Res 14: 568-571
128 Clark DE (1999) Rapid calculation of polar molecular surface area and its application to the predic-
tion of transport phenomena. 2. Prediction of blood-brain penetration. J Pharm Sci 88: 815-821
129 Stenberg P, Luthman K, Ellens H et al (1999) Prediction of intestinal absorption of endothelin recep-
tor antagonits using three theoretical methods of increasing complexity. Pharm Res 16: 1520-1526
130 Pickett SD, McLay 1M, Clark DE (2000) Enhancing the hit-to-Iead properties of lead optimization
libraries. J Chem Inf Comput Sci 40: 263-272
131 Cruciani G, Pastor M, Clementi S (2000) Handling information from 3D grid maps for QSAR stud-
ies. In: Gundertofte K, Jorgensen FS (eds): Molecular modelling and prediction of bioactivity.
Plenum Press, New York
132 Guba W, Cruciani G (2000) Molecular field-derived descriptors for the multivariate modelling of
pharmacokinetic data. In: Gundertofte K, Jorgensen FS (eds): Molecular modelling and prediction of
bioactivity. Plenum Press, New York
133 Alifrangis LH, Christensen IT, Berglund A, et al (2000) Structure-property model for membrane par-
titioning of oligopeptides. J Med Chem 43: 103-113
134 Wessel MD, Jurs PC, Tolan JW et al (1998) Prediction of human intestinal absorption of drug com-
pounds from molecular structure. J Chem Inf Comput Sci 38: 726-735
135 Norinder U, Osterberg T, Artursson P (1999) Theoretical calculation and prediction of intestinal
absorption of drugs in humans using MolSurf parametrization and PLS statistics. Eur J Pharm Sci 8:
49-56 .
136 Norinder U, Sjoberg P, Osterberg T (1998) Theoretical calculation and prediction of brain-blood par-
titioning of organic solutes using MolSurf parametrization and PLS statistics. J Pharm Sci 87:
952-959
137 Sugawara M, Takekuma Y, Yamada H et al (1998) A general approach for the prediction of the intes-
tinal absorption of drugs: Regression analysis using the physicochemical properties and drug-mem-
brane electrostatic interactions. J Pharm Sci 87: 960-966
138 Winiwarter S, Bonham NM, Ax F, et al (1998) Correlation of human jejunal permeability (in vivo) of
drugs with experimentally and theoretically derived parameters. A multivariate data analysis
approach. J Med Chem 41: 4939-4949
139 Ghuloum AM, Sage CR, Jain AN (1999) Molecular Hashkeys: A novel method for molecular char-
acterization and its application for predicting important pharmaceutical properties of molecules. J
Med Chem 42: 1739-1748
140 Sadowski J, Kubinyi H (1998) A scoring scheme for discriminating between drugs and nondrugs. J
Med Chem 41: 3325-3329
141 AjayWalters WP, Murcko MA (1998) Can we learn to distinguish between "drug-like" and "nondrug-
like" molecules? J Med Chem 18: 3314-3324
142 Wagener M, van Geerestein VJ (2000) Potential drugs and nondrugs: prediction and identification of
important structural features. J Chem Inf Comput Sci 40: 280-292
143 Wang J, Ramnarayan K (1999) Toward designing drug-like libraries: A novel computational approach
for prediction of drug feasibility of compounds. J Comb Chem I: 524-533
Modern Methods of Drug Discovery 157
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Blrkhauser Verlag/Switzerland
8.1 Introduction
8.2 Techniques for protein structure determination and prediction
8.2.1 X-ray crystallography
8.2.2 NMR spectroscopy
8.2.3 Comparative modeling
8.3 Sources for protein 3D-structure information
8.4 Structural genomics
8.5 Success stories of protein structure-based drug design
8.5.1 X-ray structure-based drug design
8.5.2 Homology model-based drug design
8.6 Applications of protein structure information in the drug discovery process
8.7 Conclusions
8.8 Acknowledgements
8.9 References
8.1 Introduction
The majority of drugs available today were discovered either by chance observa-
tions or by screening synthetic or natural products libraries. In many cases, a trial-
and-error based approach of chemical modification of lead compounds led to an
improvement with respect to potency and reduced toxicity. Since this approach is
labor and time-intense, researchers in the pharmaceutical industry are constantly
developing methods to increase the efficiency of the drug finding process. Simply
put, two directions have evolved from these efforts. The "random" approach
involves the development of high-throughput screening assays and the testing of
a large number of compounds. Combinatorial chemistry is used to satisfy the need
for huge substance libraries. The "rational," structure-based approach relies on an
iterative procedure of structure determination of the target protein, prediction of
hypothetical ligands by molecular modeling, specific chemical synthesis and bio-
logical testing of compounds (the structure-based drug design cycle). It is becom-
ing evident, that the future of drug discovery does not lie in one of these approach-
es solely, but rather an intelligent combination. In this chapter, we will concentrate
on the protein structure-based drug discovery approach and discuss possible over-
laps with complementary technologies.
158 A. Hillisch and R. Hilgenfeld
Figure 1. a) Schematic representation of the process of protein structure determination by x-ray crystal-
lography. The structure elucidation of A21Gly-B31,B32Arg2 insulin (HOE 901, Lantus®), a long-acting
insulin derivative, is depicted [161]. b) Schematic representation of the process of protein structure eluci-
dation by multidimensional NMR spectroscopy. The determination of the solution structure of the plas-
minogen-activator protein staphylokinase is shown as an example [162]. Two NMR-active nuclei (A, red
and B, blue) are depicted in the 3D-spectrum, with the set of distance restraints (green lines) and the final
structure. c) Schematic representation of the process of protein structure prediction by comparative mod-
eling. Structure modeling of the bacterial transcriptional repressor CopR is shown [163]. Although the
model is based on a fairly low-sequence identity of only 13.8% to the P22 c2 repressor, several experi-
mental methods support this prediction.
8 The role of protein 3D-structures in the drug discovery process 159
a) X-ray crystallography
model
X-r~y $ building
~ ~
b) NMR spectroscopy
c) Comparative modeling
protein of unknown
structure
*
fold assignment
*
sequence-structure
alignment
--......
~ . . lD"wct..::
............. » .......... «
....-.."
..
.. ~ . ~
**
....'-1.1. :m..-.ca.1t
model building
(sidechains. loops)
model refinement
*
model assessment
160 A. Hillisch and R. Hilgenfeld
cury, gold, or other salts) are soaked into the crystal lattice; these heavy atom
derivatives of the native protein crystal give rise to specific changes in the inten-
sities of the diffracted x-rays, from which the phases can be estimated, ii) multi-
wavelength anomalous dispersion (MAD) phasing: the so-called anomalous dif-
fraction of atoms such as selenium or bromine is explored for phase determination
by collecting diffraction data at different wavelengths at a synchrotron; these
atoms are incorporated into the native protein by replacing methionine by
selenomethionine, or simply by soaking potassium bromide into the crystals; iii)
molecular replacement: the atomic coordinates of a homologous protein with a
known three-dimensional structure are used to derive the orientation of the mole-
cule under study in the unit cell of the crystal. This latter approach is the method
of choice for determining the structure of a protein with a series of different lig-
ands, frequently used in structure-based drug design.
After successful phase determination, the polypeptide chain is modeled into the
electron density according to the known amino acid sequence of the protein. An
iterative process of adjusting the model and recomputing the phases leads to a
gradual improvement of the agreement between the observed intensities and those
calculated from the model. This process is called refinement and leads to the final
structural model of the biopolymer.
In contrast to other methods of structure determination, diffraction techniques
such as x-ray crystallography provide a direct image of the molecule under study,
in an exact mathematical process where the Fourier transformation can be com-
pared to the lens of a microscope. The major disadvantage of the method is the
need for crystals, which can be sometimes difficult or even impossible to obtain.
In the past, there have occasionally been concerns that contacts between neigh-
boring molecules in the crystal might influence the conformation of proteins and
lead to artifacts. However, meanwhile we know that this is extremely rare, and if
it occurs, this phenomenon is restricted to the chain termini or to flexible loops on
the surface of proteins, which are typically remote from the active sites. [1, 2]. The
reason is that protein crystals typically contain 35-70% solvent (mostly water)
and molecules in the crystal are highly hydrated, behaving as if in aqueous solu-
tion and maintaining the catalytic activity. Comparisons between proteins deter-
mined with x-ray crystallography and NMR-spectroscopy generally show good
agreement [3-5]. A disadvantage of the method is, that unless the resolution
(smallest interplanar spacing of diffraction pattern for which data have been col-
lected) is less than 1.2 A, the positions of hydrogen atoms cannot be resolved by
x-ray crystallography and must be modeled. Also the exact orientation of Asn, GIn
and His is usually only inferred from the protein environment of the side chain
(i.e., hydrogen bonds to neighboring amino acids).
tances and torsion angles in molecules are measured and translated into structure
information using computer programs (see Fig. Ib). As with x-ray crystallogra-
phy, multi-milligram amounts of pure, homogenous protein are needed. In addi-
tion, l3C and 15N labeled protein [7] should be available and the protein must be
stable up to 30-40°C and soluble to a concentration of -1 mM.
The principle of NMR is based on the quantum mechanical property of nuclear
spin. The magnetic moments of certain nuclei occurring in biomacromolecules
such as IH, 31p and, after labeling, l3C, 15N, adopt a parallel or anti-parallel ori-
entation when placed in an external magnetic field provided by an NMR spec-
trometer. These nuclei precess about the external magnetic field with the so-called
Larmor frequency. The different orientations (parallel versus anti-parallel) of
these "small bar magnets" in the external field correspond to different energy
states, which are unequally distributed according to Boltzmann's distribution law.
Spins with parallel orientation are of slightly lower energy than those with the
anti-parallel one, leading to a net magnetization in the external magnetic field.
This net magnetization can be measured in an NMR experiment. In order to
induce transitions between the different energy states and hence a net magnetiza-
tion, the spins in an NMR sample are irradiated with a radio-frequency that cor-
responds to the Larmor frequency. The resulting magnetization, inducing an elec-
tric current in the detection coil of the NMR spectrometer, returns to its equilibri-
um condition. This signal is called free induction decay (FID) and can be record-
ed as a function of time. The time-domain NMR signal is converted into the fre-
quency domain by applying a Fourier transformation (FT). A one-dimensional
NMR spectrum is obtained by plotting the absorbed irradiation versus the fre-
quency of the irradiation. The utility of NMR originates from the fact that the fre-
quency at which absorption occurs is not only dependent on the type of nucleus,
but also on the molecular environment.
The protein structure determination with NMR spectroscopy typically involves
four steps: 1) recording of NMR spectra, 2) assigning of all resonances in the
spectrum (which means establishing correlations between atoms in the protein
and resonance peaks in the NMR spectrum), 3) determination of distances and
angles between atoms in the protein (restraints) and 4) calculation of the three-
dimensional structure from a set of distance and angle restraints (see Fig. Ib).
The major difficulty in studying the three-dimensional structures of biopoly-
mers by means of NMR spectroscopy is spectral overlap (respectively resolution).
A peptide of 50 amino acids has approximately 300 protons which can result in a
one-dimensional proton NMR spectrum with more than 600 peaks. Assignment of
one-dimensional spectra becomes impossible in this case. Solutions to this prob-
lem are using higher magnetic field strength (up to 900 MHz proton resonance
frequencies) and spreading the spectrum into two or more dimensions. A two-
dimensional NMR spectrum can be obtained by applying, for example, two radio-
frequency pulses temporally separated by a certain interval. The magnetization of
a nucleus, resulting from the first pulse, is transferred to a second nucleus and is
again allowed to oscillate in the static magnetic field at a frequency characteristic
of the second nuclear spin. The signal is recorded as a function of two time
162 A. Hillisch and R. Hilgenfeld
ceming the size of molecules that can be studied. De novo protein structure deter-
mination with NMR is possible today for proteins with molecular weights up to
about 30 kDa. However, recent developments such as TROSY (transverse relax-
ation-optimized spectroscopy) [12] and CRINEPT (cross-correlated relaxation-
enhanced polarization transfer) [13, 14] have proven useful for studying multi-
meric proteins with molecular weight> 100 kDa. Progress in the development of
spectrometers with higher magnetic field strengths of up to 21 Tesla (correspon-
ding to 900 MHz proton resonance frequencies) will also contribute to NMR
structure determination.
Model building
Generally, homology modeling involves four steps: fold assignment, sequence
alignment, model building and model refinement (see Fig. Ic). The first step aims
at identifying proteins with a known three-dimensional structure (so called tem-
plate structures) which are related to the polypeptide sequence with unknown
structure (the target sequence). A sequence database of proteins with known struc-
tures (e.g., the PDB-sequence database) is searched with the target sequence using
sequence similarity search algorithms. Programs like BLAST [16] and FASTA
[17] are mainly used for these purposes (for a description of sequence searches see
Chapter 3). These sequence-based methods are complemented with so-called fold
recognition or threading methods that rely explicitly on the known structures of
the template proteins (see also Chapter 3). Once a clear similarity (>30%
sequence identity) of the target protein with one of the proteins with known 3D
structure is identified, both sequences are aligned. This is to establish the best pos-
sible correspondence between residues in the template and the target sequence.
Extending this pairwise alignment by considering further homologues in a multi-
ple sequence alignment normally improves the quality of the final structure pre-
diction. In the third step a model of the target protein is built by exchanging amino
acids in the 3D structure of the template protein and introducing insertions and/or
deletions according to the sequence alignment.
Loop modeling
Loops at the surface of proteins are generally less conserved and can adopt dif-
ferent conformations in even closely related homologues. The goal of loop mod-
164 A. Hillisch and R. Hi1genfe1d
eling is to predict the confonnation of the backbone fixed at both ends of struc-
turally conserved regions. Two general approaches exist. Ab initio methods
involve the generation of a large number of possible loop confonners and the sub-
sequent evaluation with respect to energetic and other criteria [18-20]. In the sec-
ond knowledge-based approach, a database of loop fragments in experimentally
determined protein structures (subset of PDB, see below) is searched to fit on the
fixed backbone endpoints. The fragments are identified either on the basis of
sequence relationship or of geometric criteria such as the distance between the
amino and carboxyl termini of the conserved regions flanking the loop in the
model [21-24].
Model refinement
After loop modeling and adjusting the side-chains, the resulting homology model
is subjected to an energy minimization protocol using force field methods to
remove remaining steric clashes. Molecular dynamic simulations using explicit
8 The role of protein 3D-structures in the drug discovery process 165
solvent environments are also applied for these purposes and can help character-
ize the plasticity of binding pockets. Finally, the model is evaluated by consider-
ing stereochemical (e.g., Ramachandran plot, X dihedral angles, H-bond geome-
tries) and energetic (e.g., van der Waals clashes) criteria. Several programs, such
as PROCHECK [33] and WHATCHECK [34], are used for these purposes (see
also [35]). Model refinement and evaluation are often applied iteratively to obtain
the final model.
In addition, model evaluation helps to decide what structure information can be
drawn from the model. For drug design purposes, the exact positions and rotamers
of the amino acids defining the binding pocket are often crucial. The necessary
accuracy can be achieved with current homology modeling programs for target
sequences showing >50% sequence identity to a template protein for which an
experimental high-resolution structure is available. However, even models based
on 30-50% sequence identity can yield important hints for structure-activity-rela-
tionships of ligands, if the binding pocket is structurally conserved. Caution
should be paid with models based on lower sequence similarity. Due to the inher-
ent uncertainties, such models can be misleading with respect to ligand design.
The particular advantage of homology models is the short time needed to con-
struct such models. Preliminary models may be built and "refined" within a few
hours. Of course, and this is the major disadvantage, one is restricted to proteins
which show relatively high sequence identity to experimental structures, and sur-
prises with respect to conformational changes with even small sequence differ-
ences cannot be excluded.
a) 18000
(I) 16000
-
CI)
~
:::l 14000
-
CJ
:::l
~
12000
UJ 10000
....
0
~
8000
CI)
Jl 6000
E
:::l 4000
Z
2000 -------------------------------------. --
o - _•• •• 1
N o N CD 0 N CD CD o
o
"0> CD
0>
CD
0>
CD
0>
0>
0>
0>
0>
0>
0>
0>
m o
N
Year
b) experimental
structures (-1%)
sequence comparative
information models (-44 %)
only (-43 %)
protein fold
information (-12 %)
Figure 2. a) Growth in the number of experimental biopolymer structures deposited in the Protein Data
Bank (PDB). By early 2002, about 17,000 structures of biopolymers were publically accessible in the
PDB. b) Structure information on 540,000 protein sequences in the Swiss-Prot and TrEMBL databases.
For about 1 % of these sequences (-6200 different proteins), experimental structure information (x-ray,
NMR) is available. Comparative models can be built for up to 44% (237,000) and protein fold informa-
tion can be obtained for an additional -12% of the sequences. The remaining 43% of the proteins actual-
ly cannot be assigned any fold or other structure information yet.
50,000 different binding sites are covered by this database and can be analyzed
using, for example, programs like ReliBase [37] and PDBsum [38] or the HIV
Protease Database [39].
8 The role of protein 3D-structures in the drug discovery process 167
It is highly likely that structure determination of target proteins will undergo the
same development that DNA sequencing and chemical synthesis and testing of
compounds (see Chapters 4 and 6) underwent several years ago. The linear
approach to structure elucidation will in many cases be replaced by massively par-
allel efforts for protein production [43], purification, crystallization [44], crystal
handling, data collection and analysis [45], spectra interpretation and model gen-
eration [46]. These structural genomics efforts, which aim at solving or modeling
all (or a large fraction) ofthe soluble proteins in an organism [47], will lead to an
enormous increase in structure information for drug target proteins. Several com-
panies and publicly funded projects which focus on structural genomics have been
initiated during recent years [48]. In many of these projects, it is planned to select
the proteins for experimental structure determination based on sequence analysis
[49]. Those proteins which are predicted to have sufficiently novel structures from
sequence analyses are preferred targets for protein expression, purification and
structure determination. Remaining proteins not covered by experimental struc-
ture determination methods will be predicted by homology modeling. The exper-
imental techniques will solve new structures while homology modeling fills the
168 A. Hillisch and R. Hilgenfeld
Generally, the success rate of a drug discovery method is often measured by the
number of compounds that reach the market. Based on current drug development
times of about 12-15 years (see Chapter 1) and the enormous increase in avail-
able protein 3D structures at the beginning of the 1990s, we will certainly see
more drugs in the future where protein structure-based design will have played a
more or less important part in the lead discovery and/or optimization process.
Figure 3 shows the structures of such compounds. The first success story in pro-
tein structure-based drug design was Captopril, an angiotensin-converting enzyme
(ACE) inhibitor. ACE inhibitors block the conversion of the decapeptide
angiotensin I to the potent pressor substance angiotensin II (an octapeptide), lead-
ing to decreased blood pressure. Initial lead compounds were discovered in the
1960s by screening mixtures of peptides isolated from the venom of the snake
Bothrops jararaca (see Chapter 5). Based on these observations, the nonapetide
teprotide was synthesized and tested in humans with i. v. administration. It was
found to lower blood pressure in patients with essential hypertension. The search
for orally active ACE inhibitors was dominated by a rational, structure-based strat-
egy. Cushman et al. [54] from the Squibb Institute for Medical Research designed
several series of orally active nonpeptidic ACE inhibitors. Based on the predicted
similarity of the structure of the zinc proteinase carboxypeptidase A [55] with
ACE, they derived a binding site model of the ACE active site. Since it was known
that D-benzylsuccinic acid inhibited carboxypeptidase A, they hypothesized that
ACE could be blocked by succinyl amino acids that correspond in lengths and
shape to the dipeptide cleaved by ACE. This hypothesis proved to be true and led
to the synthesis of carboxy alkanoyl and mercapto alkanoyl derivatives [56]. The
ACE inhibitor with the lowest IC so in this series was developed as captopril and
is marketed as Lopirin@/Capoten@. Fifteen additional ACE inhibitors with differ-
ent pharmacodynamic and pharmacokinetic properties have been developed since
and are employed worldwide now [57].
The carbonic anhydrase (CA) inhibitor dorzolamid is a further example for the
successful application of the structure-based drug design. Carbonic anhydrases
are ubiquitous zinc enzymes that convert carbon dioxide into bicarbonate.
Fourteen isozymes are presently known in higher vertebrates. In the ciliary
processes of the eye, carbonic anhydrases mediate the formation of large amounts
8 The role of protein 3D-structures in the drug discovery process 169
ing the properties (size, lipophilicity) of the substituents filling the lipophilic
pockets of the substrate-binding site. In 1996, the first HIV protease inhibitor
reached the market and was followed by further four compounds (see Fig. 3).
Another non-peptidic HIV-protease inhibitor, tipranavir [61], is likely to enter the
market soon. The availability of HIV-protease inhibitors revolutionized the thera-
py of AIDS. Given in combination with HIV reverse transcriptase inhibitors, the
viral load of patients can be reduced to undetectable limits, prolonging the lives
, ft
HS~q ))O-"-NH,
o HO~O 0/,/8,'0 S 0
Captopril, Dorzolamid,
Capoten®, Squibb Trusopt®, Merck
&);'~
Saquinavir,
Invirase®, Fortovase®, Ritonavir,
Hoffmann-La Roche Norvir®, Abbott
~H A
, ;,"'0
H
O",'OyN~N,//
o
o
~ if I
o
h.& NH,
Figure 3, Examples of compounds designed on the basis of protein structure that reached the market.
8 The role of protein 3D-structures in the drug discovery process 171
of these individuals, and partially restoring immune functions. However, the cur-
rent therapy is far from achieving viral eradication.
A further example of direct structure-based design are influenza neuraminidase
inhibitors. Sialidases or neuraminidases catalyze the removal of sialic acid (sev-
eral analogues ofN-acetyl neuraminic acid) from various glycoconjugates. In ani-
mals, these enzymes fulfil several roles in the immune system and in apoptosis by
regulating the surface sialic acid profile of cells. The location of sialic acids at the
termini of various carbohydrates associated with animal cells is exploited by a
number of pathogenic viruses such as e.g., influenza virus. These viruses have
proteins that recognize sialic acid for cell attachment, and many have acquired
sialidases to aid in their pathogenesis.
Influenza virus neuraminidase is a viral coat protein that catalyzes the hydrol-
ysis of the linkage joining a terminal sialic acid residue to a D-galactose or a
D-galactosamine residue. The role of neuraminidase is not completely under-
stood, but it is thought to facilitate the liberation of progeny virions from the cell
surface and to assist the virus in penetrating through mucin of the infection site.
Since the inhibition of neuraminidase leads to aggregation of virus particles at the
cell surface [62], it was hypothesized that blocking this enzyme could be a prom-
ising strategy against influenza infection. Synthesis of transition state analogues
led to the discovery of DANA (Neu5Ac2en). The fluorinated compound FANA
showed only little improvement. A crucial breakthrough in the discovery of potent
inhibitors was the determination of the x-ray crystal structure of influenza virus
neuraminidase in 1983 [63]. The analysis of the interactions of DANA in the sub-
strate binding pocket revealed a negatively charged sub-pocket (involving Glul19
and Glu227) which could be filled with positively charged moieties attached to
DANA. The resulting compound 4-guanidino-Neu5Ac2en designed and synthe-
sized by von Itzstein et al. [64] showed a k j of -0.1 nM, about four orders of mag-
nitude lower than Neu5Ac2en, and a high selectivity for influenza A and B. This
compound was developed under the name zanamivir by G1axoWellcome Inc. and
introduced to market as Relenza® in 1999. The drug is delivered using a dry pow-
der inhaler. Structure-based design at Gilead Sciences Inc. resulted in another
potent neuraminidase inhibitor, a carbocyclic transition state analogue with a
lipophilic sidechain replacing the glycerol moiety in DANA [65] (see Fig. 3). In
contrast to zanamivir, this compound is orally bioavailable. Oseltamivir was
approved for the oral prevention and treatment of influenza A and B in 2000 and
is marketed as Tamiflu® by Hoffmann-La Roche.
In many other drug discovery projects the structure of target proteins proved to
be useful in the discovery and optimization of lead compounds. A comprehensive
list of target proteins with known 3D structure and examples of successful appli-
cation of structure-based drug design is given in Table 1. Of course, this list does
not lay claim to completeness. In this table, review articles are listed along with
recent examples of structure-based design.
172 A. Hillisch and R. Hilgenfeld
Table 1. List of drug target proteins with known 3D structure and examples of successful application of
structure-based drug design
~-Iactamase [66,67]
acetylcholinesterase [68]
aldose reductase [69,70]
angiotensin converting enzyme [71]
carbonic anhydrase type II [59,72]
catechol O-methyltransferase [73,74]
cathepsins [75,76]
cyclooxygenases [77]
cytochromes P450 [78]
dihydrofolate reductases [79-81]
elastase [82,83]
estrogen receptors [84,85]
factor Xa [86-88]
farnesyltransferase [89]
FK506-binding protein [90,91]
G protein-coupled receptors [92,93]
hepatitis virus C protease [94]
HIV gp41 [95]
HIV integrase [96]
HIV protease [59,61,97,98]
HIV reverse transcriptase [99,100]
influenza virus neuraminidase [64,65, 101]
interleukin-l beta converting enzyme [102]
matrix metalloproteases [103-106]
nitric oxide synthases [107]
peroxisome proliferator-activated receptors [108]
phospolipase A2 [109, 110]
progesterone receptor [Ill]
protein-tyrosine kinases [112-116]
protein-tyrosine phosphatase 1B [117-120]
purine nucleoside phosphorylase [121, 122]
renin [123-125]
retinoic acid receptors [52,126]
rhinoviral coat protein [127]
rhinovirus 3C protease [128, 129]
telomerase [130]
thrombin [131, 132]
thymidylate synthase [59, 133]
thyroid hormone receptor [134]
trypanosomatid glyceraldehyde-3-phosphate dehydrogenase [135, 136]
tubulin [137]
urokinase [138, 139]
8 The role of protein 3D-structures in the drug discovery process 173
There are numerous examples where protein homology models have supported
the discovery and the optimization of lead compounds. A homology model of a
cysteine proteinase of the malaria parasite Plasmodiumfalciparum lead to the dis-
covery of an inhibitor (IC so = 6 11M) using 3D-database searches [140] (see also
Chapter 10). The predicted binding mode is supported by the fact that the lead
compound could be improved (IC so = 0.2 11M) by homology model-based design
and subsequent synthesis [141].
Several compounds that inhibit trypanothione reductase were designed on the
basis of a homology model of the target enzyme [142]. From a homology model
of malarial dihydrofolate reductase/thymidylate synthase mutant protein,
inhibitors were designed and subsequently synthesized that overcome
pyrimethamine resistance [143]. Homology model-based drug design has been
applied to numerous kinases such as herpes simplex virus type 1 thymidine kinase
[144], EGF-receptor protein tyrosine kinase [145], Bruton's tyrosine kinase [146],
Janus kinase 3 [115] and cyclin-dependent kinase I [147]. Using a muscarinic Ml
receptor homology model, a potent selective muscarinic M1 agonist was designed
and subsequently synthesized [148]. Structure activity relationships in a series of
hydroxyflutamid-like compounds could be explained with a homology of the
androgen receptor [149].
It is clear that protein structure information (e.g., generated within the frame of
structural genomics initiatives) will contribute to many stages of drug discovery
process (see Fig. 4). Since structure is associated with protein function [150], the
hope is that high-throughput structure analysis will reveal functions of proteins
relevant for therapeutic intervention, thus providing novel drug targets. There are
successful examples of inferring function from structure [150-153], but the utili-
ty of this approach for target identification has yet to be proven.
The analysis of ligand binding sites on the surface of proteins will probably
contribute to drug target validation. If, for example, inhibition of protein-protein
interactions is regarded a valuable therapeutic principle, proteins with deeply
buried interaction sites are clearly superior to those with shallow hydrophilic
interfaces. In addition, the design of ligands for proteins of known/predicted struc-
ture but unknown function could also contribute to target validation.
Protein structure information is frequently applied to identify lead compounds
through virtual screening, 3D-database searching and interactive ligand design
(see above). These methods also support the optimization of lead compounds with
regard to potency. Rapidly growing structure information on entire target protein
families will also enable the design of selective ligands. Once the structure of
many or all members of a target protein are known, differences and common fea-
174 A. Hillisch and R. Hilgenfeld
Figure 4. Applications of protein structure information in the drug discovery process. The enormous pool
of protein structure data currently available may not only support lead compound identification and opti-
mization, but also many other steps in the drug discovery process, such as target identification, target val-
idation, selection of appropriate animal test models and prediction of metabolic stability or drug-drug
interactions.
tures of the binding pockets can be identified and compounds can be designed that
contact differing amino acids in the binding niche. If the paraloguous target pro-
teins (paralogues are evolutionary related proteins that perform different but relat-
ed functions within one organism) show sufficiently different tissue distribution or
expression characteristics, drugs with increased selectivity and/or reduced side-
effects may be designed [77]. This is especially true for the design of antiparasitic
drugs, which should not interact with human orthologues [79-81, 142], but show
broad spectrum effects against similar microorganisms. Detailed structural knowl-
edge of ligand binding sites of target proteins could also facilitate the selection of
animal models for in vivo tests. Animals which possess target proteins with sig-
nificantly different binding sites compared to human orthologues could be exclud-
ed as test models.
A further application for structural genomics in lead optimization is the pre-
diction of drug metabolism (see also Chapter 13). Structure determination in
combination with comparative modeling has shed light on the structure of many
cytochrome P450 enzymes and other drug metabolizing enzymes. Virtual screen-
ing and docking of compounds to these enzymes provide a rational basis for the
prediction of metabolic stability and drug-drug interactions [78, 154].
8.7 Conclusions
ket. We conclude that today the question is not if protein structure-based design
works, but if protein structures are available at the beginning of a drug discovery
project in order to support medicinal chemistry. Decades of protein structure elu-
cidation, recent structural genomics initiatives and advances in homology model-
ing techniques have already generated a wealth of structure information. Today,
homology models with varying quality for up to 40% of all sequences can be gen-
erated. The 3D-structures of exciting drug targets such as ion channels (a potassi-
um channel [155, 156]) and membrane receptors (nicotinic receptors [157]), a G
protein-coupled receptor [158]), the HMG-CoA reductase [159] and a bacterial
ribosome [160] have recently been elucidated. This entire pool of structure data
may not only support lead compound identification or optimization, but also many
other steps in the drug discovery process, such as target identification, target val-
idation, selection of appropriate animal test models and prediction of metabolic
stability, side-effects or drug-drug interactions.
8.8 Acknowledgements
The authors thank Oliver OhlenschHiger (lMB-Jena), Tom Sicker (IMB-Jena) and
Ursula Kaulmann (HKI-Jena) for their contributions to Figure 1, and Walter Elger
(EnTec GmbH) for stimulating discussions.
8.9 References
1 Zimmerle CT, Alter GM (1983) Crystallization-induced modification of cytoplasmic malate dehydro-
genase structure and function. Biochemistry 22: 6273-6281
2 Tanenbaum OM, Wang Y, Williams SP et al (1998) Crystallographic comparison of the estrogen and
progesterone receptor's ligand binding domains. Proc Natl Acad Sci USA 95: 5998-6003
3 Lu J, Lin CL, Tang C et al (1999) The structure and dynamics of rat apo-cellular retinol-binding pro-
tein II in solution: comparison with the X-ray structure. J Mol Bioi 286: 1179-1195
4 Billeter M, Kline AD, Braun Wet al (1989) Comparison of the high-resolution structures of the alpha-
amylase inhibitor tendamistat determined by nuclear magnetic resonance in solution and by X-ray dif-
fraction in single crystals. J Mol Bioi 206: 677-687
5 Billeter M (1992) Comparison of protein structures determined by NMR in solution and by X-ray dif-
fraction in single crystals. Q Rev Biophys 25: 325-377
6 Wuthrich K (1989) Determination of three-dimensional protein structures in solution by nuclear mag-
netic resonance: an overview. Methods Enzymol177: 125-131
7 McIntosh LP, Dahlquist FW (1990) Biosynthetic incorporation of 15N and 13C for assignment and
interpretation of nuclear magnetic resonance spectra of proteins. Q Rev Biophys 23: 1-38
8 Karplus M (1963) Vicinal proton coupling in nuclear magnetic resonance. JAm Chem Soc 85:
2870-2871
9 Kuntz 10, Thomason JF, Oshiro CM (1989) Distance geometry. Methods Enzymol177: 159-204
10 Scheek RM, van GW, Kaptein R (1989) Molecular dynamics simulation techniques for determination
of molecular structures from nuclear magnetic resonance data. Methods Enzymoll77: 204-218
11 Brunger AT (1997) X-ray crystallography and NMR reveal complementary views of structure and
dynamics. Nat Struct Bioi 4 Suppl: 862-865
12 Pervushin K, Riek R, Wider G et al (1997) Attenuated T2 relaxation by mutual cancellation of dipole-
dipole coupling and chemical shift anisotropy indicates an avenue to NMR structures of very large bio-
logical macromolecules in solution. Proc Natl Acad Sci USA 94: 12366-12371
13 Riek R, Wider G, Pervushin K et al (1999) Polarization transfer by cross-correlated relaxation in solu-
tion NMR with very large molecules. Proc Natl Acad Sci USA 96: 4918-4923
176 A. Hillisch and R. Hilgenfeld
14 Riek R. Pervushin K, Wuthrich K (2000) TROSY and CRINEPT: NMR with large molecular and
supramolecular structures in solution. Trends Biochem Sci 25: 462-468
15 Lesk AM, Chothia CH (1986) The response of protein structures to amino-acid sequence changes.
Philos Trans R Soc London Ser B 317: 345-356
16 Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Bioi 215: 403-410
17 Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad
Sci USA 85: 2444-2448
18 Zheng Q, Kyle DJ (1996) Accuracy and reliability of the scaling-relaxation method for loop closure:
an evaluation based on extensive and multiple copy conformational samplings. Proteins 24: 209-217
19 Rapp CS, Friesner RA (1999) Prediction ofloop geometries using a generalized born model of solva-
tion effects. Proteins 35: 173-183
20 Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9: 1753-1773
21 Jones TA, Thirup S (1986) Using known substructures in protein model building and crystallography.
EMBO J 5: 819-822
22 Claessens M, Van Cutsem E, Lasters I et al (1989) Modelling the polypeptide backbone with 'spare
parts' from known protein structures. Protein Eng 2: 335-345
23 Li W, Liu Z, Lai L (1999) Protein loops on structurally similar scaffolds: database and conformation-
al analysis. Biopolymers 49: 481-495
24 Wojcik J, Momon JP, Chomilier J (1999) New efficient statistical sequence-dependent structure pre-
diction of short to medium-sized protein loops based on an exhaustive loop classification. J Mol Bioi
289: 1469-1490
25 Vasquez M (1996) Modeling side-chain conformation. Curr Opin Struct Bioi 6: 217-221
26 Dunbrack RLJ, Karplus M (1994) Conformational analysis of the backbone-dependent rotamer pref-
erences of protein sidechains. Nat Struct Bioi I: 334-340
27 Bower MJ, Cohen FE, Dunbrack RLJ (1997) Prediction of protein side-chain rotamers from a back-
bone-dependent rotamer library: a new homology modeling tool. J Mol Bioi 267: 1268-1282
28 Dunbrack RLJ (1999) Comparative modeling of CASP3 targets using PSI-BLAST and SCWRL.
Proteins Suppl3: 81-87
29 De Maeyer M, Desmet J, Lasters I (1997) All in one: a highly detailed rotamer library improves both
accuracy and speed in the modelling of sidechains by dead-end elimination. Fold Des 2: 53-66
30 Cheng B, Nayeem A, Scheraga HA (1996) From secondary structure to three-dimensional structure:
Improved dihedral angle probability distribution function for use with energy searches for native struc-
tures of polypeptides and proteins. J Comp Chem 17: 1453-1480
31 Shenkin PS, Farid H, Fetrow JS (1996) Prediction and evaluation of side-chain conformations for pro-
tein backbone structures. Proteins 26: 323-352
32 Lovell SC, Word JM, Richardson JS et al (2000) The penultimate rotamer library. Proteins 40:
389-408
33 Laskowski RA, MacArthur MW, Moss DS et al (1993) PROCHECK: a program to check the stereo-
chemical quality of protein structures. J Appl Cryst 26: 283-291
34 Hooft RW, Vriend G, Sander C et al (1996) Errors in protein structures. Nature 381: 272
35 EU 3-D Validation Network (1998) Who checks the checkers? Four validation tools applied to eight
atomic resolution structures. J Mol 8io1276: 417-436
36 Murzin AG, Brenner SE, Hubbard T et al (1995) SCOP: a structural classification of proteins database
for the investigation of sequences and structures. J Mol Bioi 247: 536-540
37 Hendlich M (1998) Databases for protein-ligand complexes. Acta Crystallogr D Bioi Crystallogr 54:
1178-1182
38 Laskowski RA (2001) PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res 29:
221-222
39 Vondrasek J, van Buskirk CP, Wlodawer A (1997) Database of three-dimensional structures of HIV
proteinases. Nat Struct Bioi 4: 8
40 Sanchez R, Pieper U, Mirkovic N et al (2000) MODBASE, a database of annotated comparative pro-
tein structure models. Nucleic Acids Res 28: 250-253
41 Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol
Bioi 234: 779-815
42 Peitsch MC, Schwede T, Guex N (2000) Automated protein modelling-the proteome in 3D.
Pharmacogenomics I: 257-266
43 Edwards AM, Arrowsmith CH, Christendat D et al (2000) Protein production: feeding the crystallog-
raphers and NMR spectroscopists. Nat Struct Bioi 7 Suppl: 970-972
8 The role of protein 3D-structures in the drug discovery process 177
44 Abola E, Kuhn P, Earnest T et al (2000) Automation of x-ray crystallography. Nat Struct Bioi 7 Suppl:
973-977
45 Larnzin VS, Perrakis A (2000) Current state of automated crystallographic data analysis. Nat Struct
Bioi 7 Suppl: 978-981
46 Montelione GT, Zheng D, Huang YJ et al (2000) Protein NMR spectroscopy in structural genomics.
Nat Struct Bioi 7 Suppl: 982-985
47 Burley SK (2000) An overview of structural genomics. Nat Struct Bioi 7 Suppl: 932-934
48 Service RF (2000) Structural genomics offers high-speed look at proteins. Science 287: 1954-1956
49 Brenner SE (2000) Target selection for structural genomics. Nat Struct Bioi 7 Suppl: 967-969
50 Marti-Renom MA, Stuart AC, Fiser A et al (2000) Comparative protein structure modeling of genes
and genomes. Annu Rev Biophys Biomol Struct 29: 291-325
51 Schafferhans A, Klebe G (2001) Docking Ligands onto Binding Site Representations Derived from
Proteins built by Homology Modelling. J Mol Bioi 307: 407-427
52 Schapira M, Raaka BM, Samuels HH et al (2000) Rational discovery of novel nuclear honnone recep-
tor antagonists. Proc Natl Acad Sci USA 97: 1008-10 13
53 Spencer TA, Li D, Russel JS et al (2001) Pharmacophore analysis of the nuclear oxysterol receptor
LXRalpha. J Med Chem 44: 886-897
54 Cushman DW, Cheung HS, Sabo EF et al (1977) Design of potent competitive inhibitors of
angiotensin-converting enzyme. Carboxyalkanoyl and mercaptoalkanoyl amino acids. Biochemistry
16: 5484-5491
55 Lipscomb WN, Reeke GNJ, Hartsuck JA et al (1970) The structure of carboxypeptidase A. 8. Atomic
interpretation at 0.2 nm resolution, a new study of the complex of glycyl-L-tyrosine with CPA, and
mechanistic deductions. Philos Trans R Soc Lond B Bioi Sci 257: 177-214
56 Petrillo EWJ, Ondetti MA (1982) Angiotensin-converting enzyme inhibitors: medicinal chemistry and
biological actions. Med Res Rev 2: 1-41
57 Goodman LS, Gilman A (1996) Goodman & Gilman's The pharmakological basis of therapeutics.
McGraw-Hill, New York
58 Eriksson AE, Jones TA, Liljas A (1988) Refined structure of human carbonic anhydrase II at 2.0 A res-
olution. Proteins 4: 274-282
59 Greer J, Erickson JW, Baldwin JJ et al (1994) Application of the three-dimensional structures of pro-
tein target molecules in structure-based drug design. J Med Chem 37: 1035-1054
60 Wlodawer A, Miller M, Jaskolski M et al (1989) Conserved folding in retroviral proteases: crystal
structure of a synthetic HN-l protease. Science 245: 616-621
61 Thaisrivongs S, Strohbach JW (1999) Structure-based discovery of Tipranavir disodium (PNU-
140690E): a potent, orally bioavailable, nonpeptidic HlV protease inhibitor. Biopolymers 51: 51-58
62 Palese P, Schulman JL, Bodo G et al (1974) Inhibition of influenza and parainfluenza virus replication
in tissue culture by 2-deoxy-2,3-dehydro-N-trifluoroacetylneuraminic acid (FANA). Virology 59:
490-498
63 Varghese IN, Laver WG, Colman PM (1983) Structure of the influenza virus glycoprotein antigen neu-
raminidase at 2.9 A resolution. Nature 303: 35-40
64 von Itzstein M, Wu WY, Kok GB et al (1993) Rational design of potent sialidase-based inhibitors of
influenza virus replication. Nature 363: 418-423
65 Kim CU, Lew W, Williams MA et al (1997) Influenza neuraminidase inhibitors possessing a novel
hydrophobic interaction in the enzyme active site: Design, synthesis, and structural analysis of carbo-
cyclic sialic acid analogues with potent anti-influenza activity. JAm Chem Soc 119: 681-690
66 Mascaretti OA, Danelon GO, Laborde M et al (1999) Recent advances in the chemistry of beta-lactam
compounds as selected active-site serine beta-Iactamase inhibitors. Curr Phann Des 5: 939-953
67 Heinze-Krauss I, Angehrn P, Charnas RL et al (1998) Structure-based design of beta-Iactamase
inhibitors. 1. Synthesis and evaluation of bridged monobactams. J Med Chem 41: 3961-3971
68 Kryger G, Silman I, Sussman JL (1999) Structure of acetylcholinesterase complexed with E2020
(Aricept): implications for the design of new anti-Alzheimer drugs. Structure Fold Des 7: 297-307
69 Wilson DK, Petrash JM, Quiocho FA (1997) Structural studies of aldose reductase inhibition. In:
Structure-based drug design, edited by P. Veerapandian, pp. 229-246. Marcel Dekker, Inc., New York,
Basel, Hong Kong
70 Iwata Y, Arisawa M, Hamada R et al (2001) Discovery of novel aldose reductase inhibitors using a pro-
tein structure-based approach: 3D-database search followed by design and synthesis. J Med Chem 44:
1718-1728
71 Ondetti MA, Rubin B, Cushman DW (1977) Design of specific inhibitors of angiotensin-converting
178 A. Hillisch and R. Hilgenfeld
enzyme: new class of orally active antihypertensive agents. Science 196: 441-444
72 Bohacek RS, McMartin C, Guida WC (1996) The art and practice of structure-based drug design: a
molecular modeling perspective. Med Res Rev 16: 3-50
73 Vidgren J (1998) X-ray crystallography of catechol O-methyltransferase: perspectives for target-based
drug development. Adv Pharmacol42: 328-331
74 Masjost B, Ballmer P, Borroni E et al (2000) Structure-based design, synthesis, and in vitro evaluation
of bisubstrate inhibitors for catechol O-methyltransferase (COMT). Chemistry 6: 971-982
75 Katunuma N, Murata E, K~egawa H et al (1999) Structure based development of novel specific
inhibitors for cathepsin L and cathepsin S in vitro and in vivo. FEBS Lett 458: 6-10
76 Katunuma N, Matsui A, Inubushi T et al (2000) Structure-based development of pyridoxal propionate
derivatives as specific inhibitors of cathepsin K in vitro and in vivo. Biochem Biophys Res Commun
267: 850-854
77 Bayly CI, Black WC, Leger S et al (1999) Structure-based design of COX-2 selectivity into flurbipro-
fen. Bioorg Med Chem Lett 9: 307-312
78 Szklarz GD, Halpert JR (1998) Molecular basis of P450 inhibition and activation: implications for
drug development and drug therapy. Drug Metab Dispos 26: 1179-1184
79 Gschwend DA, Sirawaraporn W, Santi DV et al (1997) Specificity in structure-based drug design:
identification of a novel, selective inhibitor of Pneumocystis carinii dihydrofolate reductase. Proteins
29: 59-67
80 Zuccotto F, Brun R, Gonzalez PD et al (1999) The structure-based design and synthesis of selective
inhibitors of Trypanosoma cruzi dihydrofolate reductase. Bioorg Med Chem Lett 9: 1463-1468
81 Rosowsky A, Cody V, Galitsky N et al (1999) Structure-based design of selective inhibitors of dihy-
drofolate reductase: synthesis and antiparasitic activity of 2, 4-diaminopteridine analogues with a
bridged diarylamine side chain. J Med Chem 42: 4853-4860
82 Cregge RJ, Durham SL, Farr RA et al (1998) Inhibition of human neutrophil elastase. 4. Design, syn-
thesis, X-ray crystallographic analysis, and structure-activity relationships for a series of P2-modified,
orally active peptidyl pentafluoroethyl ketones. J Med Chem 41: 2461-2480
83 Filippusson H, Erlendsson LS, Lowe CR (2000) Design, synthesis and evaluation of biomimetic affin-
ity ligands for elastases. J Mol Recognit 13: 370-381
84 Tedesco R, Thomas JA, Katzenellenbogen BS et al (200 I) The estrogen receptor: a structure-based
approach to the design of new specific hormone-receptor combinations. Chem Bioi 8: 277-287
85 Stauffer SR, Coletta CJ, Tedesco R et al (2000) Pyrazole ligands: structure-affinity/activity relation-
ships and estrogen receptor-alpha-selective agonists. J Med Chem 43: 4934-4947
86 Phillips G, Davey DD, Eagen KA et al (1999) Design, synthesis, and activity of 2,6-diphenoxypyri-
dine-derived factor Xa inhibitors. J Med Chem 42: 1749-1756
87 Arnaiz DO, Zhao Z, Liang A et al (2000) Design, synthesis, and in vitro biological activity of indole-
based factor Xa inhibitors. Bioorg Med Chem Lett 10: 957-961
88 Han Q, Dominguez C, Stouten PF et al (2000) Design, synthesis, and biological evaluation of potent
and selective amidino bicyclic factor Xa inhibitors. J Med Chem 43: 4398-4415
89 Kaminski JJ, Rane DF, Snow ME et al (1997) Identification of novel farnesyl protein transferase
inhibitors using three-dimensional database searching methods. J Med Chem 40: 4103-4112
90 Navia MA (1996) Protein-drug complexes important for immunoregulation and organ transplantation.
Curr Opin Struct Bioi 6: 838-847
91 Burkhard P, Hommel U, Sanner M et al (1999) The discovery of steroids and other novel FKBP
inhibitors using a molecular docking program. J Mol Bioi 287: 853-858
92 van Neuren AS, Muller G, Klebe G et al (1999) Molecular modelling studies on G protein-coupled
receptors: from sequence to structure? J Recept Signal Transduct Res 19: 341-353
93 Muller G (2000) Towards 3D structures of G protein-coupled receptors: a multidisciplinary approach.
Curr Med Chem 7: 861-888
94 Martin F, Dimasi N, Volpari C et al (1998) Design of selective eglin inhibitors ofHCV NS3 proteinase.
Biochemistry 37: 11459-11468
95 Debnath AK, Radigan L, Jiang S (1999) Structure-based identification of small molecule antiviral
compounds targeted to the gp41 core structure of the human immunodeficiency virus type 1. J Med
Chem42: 3203-3209
96 Hong H, Neamati N, Wang S et al (1997) Discovery of HIV-l integrase inhibitors by pharmacophore
searching. J Med Chem 40: 930-936
97 Wlodawer A, Vondrasek J (1998) Inhibitors of HIV-l protease: a major success of structure-assisted
drug design. Annu Rev Biophys Biomol Struct 27: 249-284
8 The role of protein 3D-structures in the drug discovery process 179
98 De Lucca GV, Erickson-Viitanen S, Lam PY (1997) Cyclic HIV protease inhibitors capable of dis-
placing the active site structural water molecule, Drug Discov Today 2: 6-18
99 Arnold E, Das K, Ding J et al (1996) Targeting HIV reverse transcriptase for anti-AIDS drug design:
structural and biological considerations for chemotherapeutic strategies, Drug Des Discov 13: 29-47
100 Mao C, Sudbeck EA, Venkatachalam TK et al (2000) Structure-based drug design of non-nucleoside
inhibitors for wild-type and drug-resistant HIV reverse transcriptase, Biochern Pharmacol 60:
1251-1265
101 Wade RC (1997) 'Flu' and structure-based drug design, Structure 5: 1139-1145
102 Shahripour AB, Plummer MS, Lunney EA et al (2002) Structure-based design of nonpeptide
inhibitors of interleukin-I beta converting enzyme (ICE, caspase-1). Bioorg Med Chern 10: 31-40
103 Zask A, Levin JI, Killar LM et al (1996) Inhibition of matrix metalloproteinases: structure based
design, Curr Pharrn Des 2: 624-661
104 Brown PD (1998) Matrix metalloproteinase inhibitors. Breast Cancer Res Treat 52: 125-136
105 Matter H, Schwab W, Barbier D et al (1999) Quantitative structure-activity relationship of human neu-
trophil collagenase (MMP-8) inhibitors using comparative molecular field analysis and X-ray struc-
ture analysis. J Med Chern 42: 1908-1920
106 Cheng M, De B, Pikul S et al (2000) Design and synthesis of piperazine-based matrix metallopro-
teinase inhibitors. J Med Chern 43: 369-380
107 Huang H, Martasek P, Roman LJ et al (2000) Synthesis and evaluation of peptidomimetics as selec-
tive inhibitors and active site probes of nitric oxide synthases. J Med Chern 43: 2938-2945
108 Oliver WRJ, Shenk JL, Snaith MR et al (2001) A selective peroxisome proliferator-activated receptor
delta agonist promotes reverse cholesterol transport. Proc Natl Acad Sci USA 98: 5306-5311
109 Schevitz RW, Bach NJ, Carlson DG et al (1995) Structure-based design of the first potent and selec-
tive inhibitor of human non-pancreatic secretory phospholipase A2. Nat Struct Bioi 2: 458-465
110 Mihelich ED, Schevitz RW (1999) Structure-based design of a new class of anti-inflammatory drugs:
secretory phospholipase A(2) inhibitors, SPL Biochirn Biophys Acta 1441: 223-228
III Bursi R, Groen MB (2000) Application of (quantitative) structure-activity relationships to progesta-
gens: from serendipity to structure-based design. Eur J Med Chern 35: 787-796
112 al-Obeidi FA, Wu 11, Lam KS (1998) Protein tyrosine kinases: structure, substrate specificity, and
drug discovery. Biopolyrners 47: 197-223
113 Shakespeare W, Yang M, Bohacek R et al (2000) Structure-based design of an osteoclast-selective,
nonpeptide src homology 2 inhibitor with in vivo antiresorptive activity. Proc Natl Acad Sci USA 97:
9373-9378
114 Toledo LM, Lydon NB, Elbaum D (1999) The structure-based design of ATP-site directed protein
kinase inhibitors. Curr Med Chern 6: 775-805
115 Sudbeck EA, Liu XP, Narla RK et al (1999) Structure-based design of specific inhibitors of Janus
kinase 3 as apoptosis-inducing antileukemic agents. Clin Cancer Res 5: 1569-1582
116 Furet P, Garcia-Echeverria C, Gay Bet al (1999) Structure-based design, synthesis, and X-ray crys-
tallography of a high-affinity antagonist of the Grb2-SH2 domain containing an asparagine mimetic.
J Med Chern 42: 2358-2363
117 Burke TRJ, Zhang ZY (1998) Protein-tyrosine phosphatases: structure, mechanism, and inhibitor dis-
covery. Biopolyrners 47: 225-241
118 Doman TN, McGovern SL, Witherbee BJ et al (2002) Molecular docking and high-throughput screen-
ing for novel inhibitors of protein tyrosine phosphatase-lB. J Med Chern 45: 2213-2221
119 Yao ZJ, Ye B, Wu XW et al (1998) Structure-based design and synthesis of small molecule protein-
tyrosine phosphatase 1B inhibitors. Bioorg Med Chern 6: 1799-1810
120 Iversen LF, Andersen HS, Branner S et al (2000) Structure-based design of a low molecular weight,
nonphosphorus, nonpeptide, and highly selective inhibitor of protein-tyrosine phosphatase lB. J Bioi
Chern 275: 10300-10307
121 Ealick SE, Babu YS, Bugg CE et al (1991) Application of crystallographic and modeling methods in
the design of purine nucleoside phosphorylase inhibitors. Proc Natl Acad Sci USA 88: 11540-11544
122 Montgomery JA (1994) Structure-based drug design: inhibitors of purine nucleoside phosphorylase.
Drug Des Discov 11: 289-305
123 Dhanaraj V, Cooper JB (1997)Rational design of renin inhibitors. In: Structure-based drug design,
edited by P. Veerapandian, pp. 321-342. Marcel Dekker, Inc., New York, Basel, Hong Kong
124 Hutchins C, Greer J (1991) Comparative modeling of proteins in the design of novel renin inhibitors.
Crit Rev Biochern Mol Bioi 26: 77-127
125 Rahuel J, Rasetti V, Maibaum J et al (2000) Structure-based drug design: the discovery of novel non-
180 A. Hillisch and R. Hilgenfeld
150 Thornton JM, Todd AE, Milburn D et al (2000) From structure to function: approaches and limita-
tions, Nat Struct Bioi 7 Suppl: 991-994
151 Zarembinski n, Hung LW, Mueller-Dieckmann HJ et al (1998) Structure-based assignment of the
biochemical function of a hypothetical protein: a test case of structural genomics. Proc Natl Acad Sci
USA 95: 15189-15193
152 Kleywegt GJ (1999) Recognition of spatial motifs in protein structures. J Mol Bioi 285: 1887-1897
153 Xu LZ, Sanchez R, Sali A et al (1996) Ligand specificity of brain lipid-binding protein. J Bioi Chem
271: 24711-24719
154 Lewis DF (2000) Modelling human cytochromes P450 for evaluating drug metabolism: an update.
Drug Metabol Drug Interact 16: 307-324
155 Doyle DA, Morais CJ, Pfuetzner RA et al (1998) The structure of the potassium channel: molecular
basis of K+ conduction and selectivity. Science 280: 69-77
156 Armstrong N, Sun Y, Chen GQ et al (1998) Structure of a glutamate-receptor ligand-binding core in
complex with kainate. Nature 395: 913-917
157 Brejc K, van Dijk WJ, Klaassen RV et al (2001) Crystal structure of an ACh-binding protein reveals
the ligand-binding domain of nicotinic receptors. Nature 411: 269-276
158 Palczewski K, Kumasaka T, Hori T et al (2000) Crystal structure of rhodopsin: A G protein-coupled
receptor. Science 289: 739-745
159 Istvan ES, Deisenhofer J (2001) Structural mechanism for statin inhibition of HMG-CoA reductase.
Science 292: 1160-1164
160 Ban N, Nissen P, Hansen J et al (2000) The complete atomic structure of the large ribosomal subunit
at 2.4 A resolution. Science 289: 905-920
161 Berchtold H, Hilgenfeld R (1999) Binding of phenol to R6 insulin hexamers. Biopolymers 51:
165-172
162 OhlenschHiger 0, Ramachandran R, Giihrs KH et al (1998) Nuclear magnetic resonance solution
structure of the plasminogen-activator protein staphylokinase. Biochemistry 37: 10635-10642
163 Steinmetzer K, Hillisch A, Behlke J et al (2000) Transcriptional repressor CopR: structure model-
based localization of the deoxyribonucleic acid binding motif. Proteins 38: 393-406
Modern Methods of Drug Discovery 183
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
9.1 Introduction
9.2 Target-based methods-SAR by NMR
9.3 Ligand-based methods
9.3.1 Dynamic NMR-fast and slow exchange
9.3.2 Methods based on relaxation properties
9.3.3 Diffusion methods
9.3.4 "Combination" methods
9.4 Compound libraries for NMR-based screening
9.5 Summary and conclusions
9.6 References
9.1 Introduction
Traditionally the role of nuclear magnetic resonance (NMR) in drug discovery has
been as an analytical tool to aid chemists in characterizing small molecule com-
pounds and identifying novel natural products (see Chapter 5). Today, due to the
development and massive application of x-ray crystallography and NMR to deter-
mine atomic-resolution structures of proteins, nucleic acids, and their complexes,
NMR has established itself as a key method in structure based drug design [1] (see
also Chapter 8). Unfortunately, structure-based drug design by NMR becomes
more difficult with increasing molecular weight of the target. Thus, although
rational drug design has proven to be effective for several targets, there has been
a major shift to high throughput screening (HTS) of large libraries of compounds.
HTS methods only require a sufficiently robust assay which, when miniaturized,
is used to screen hundreds of thousands of small molecule compounds.
The recent emphasis on high throughput screening (HTS) methods has also
greatly influenced the role of NMR in drug discovery. Since obtaining a robust
HTS assay system for drug activity can be difficult, NMR has emerged as a pow-
erful and versatile screening technique for ligand binding [2, 3]. Furthermore, a
significant fraction of NMR based screening strategies is independent of the
molecular weight of the target molecule. NMR also provides a generic binding
assay, which is independent of the target's biological function. This may enable
screening of proteins with unknown function, which now are identified at an ever-
increasing rate through genomic sequence analysis (see Chapter 1).
184 M. Vogtherr and K. Fiebig
NMR employs the same feature to detect binding as other spectroscopic meth-
ods, by monitoring changes in spectral properties brought about by the binding
process [4-6]. Compared to UV-based screening methods NMR is rather insensi-
tive. However, there have been significant advances in sensitivity due to instru-
mentation design such that NMR can be used at much reduced sample concentra-
tions. In spite of its inferior sensitivity NMR possesses several advantages over
other techniques. First, NMR is a method with extremely high analytical value.
Using NMR techniques, it has been possible to elucidate virtually any molecular
structure up to 3D structures of proteins and nucleic acids. Second, there is often
a multitude of spectral properties, which change due to the binding of a ligand.
Third, modem NMR techniques combined with isotopic labeling strategies can
record changes in spectral properties of the desired compounds virtually without
background.
NMR has been used in countless studies to explore molecular binding and
interaction processes (reviewed in [4-6]). Many of these date back to the early
days of NMR well before the era of high-resolution protein structure determina-
tion by NMR. In most of these studies a well-defined set of interacting partners
was examined, and no attempt was made to find novel ligands through "screen-
ing" of large libraries of compounds. However, many of these studies have used
the same features of NMR spectra and their specific changes caused by binding of
a ligand as modem screening methods do. In particular, broadening or shifting of
resonance lines can be monitored very easily using simple ID NMR techniques.
Most screening experiments in pharmaceutical research investigate the interac-
tion between a macromolecular target and numerous potential inhibitors or lig-
ands either individually or as mixtures. For NMR based screening, both the target
and the ligand display NMR signals, which may be perturbed by an interaction.
Hence there arises a natural classification: methods which monitor the NMR sig-
nals of the target protein, and methods which monitor those of the ligands. As will
be seen in the next two sections, this distinction is sensible both conceptually and
methodologically.
One of the first and perhaps best known screening method utilizing NMR is the
"SAR by NMR" (structure activity relationship by NMR) approach developed by
Fesik et al. at Abbott Laboratories [7-10]. This technique monitors all amide
chemical shifts of a uniformly lsN-labeled target protein, which can be produced
quite cost effectively by overexpression in bacteria. Chemical shifts of both lH_
and lSN-amide atoms are recorded as peaks in a two-dimensional spectrum, the
lSN)H-HSQC (heteronuclear single quantum correlation) spectrum. A small
region of such a 15N_1 H_HSQC spectrum is shown in Figure la (black contours).
In such a spectrum each amino acid residue except proline give rise to one single
peak that can be assigned using standard high-resolution multidimensional NMR
techniques [11].
9 NMR-based screening methods for lead discovery 185
a
Q53 116
0
Y26
Q)
• V24
E54
117
118
OSI
.,
II'
E
K73 120 Co
E:
EJI
111
..
Z
F99. M4 III
F47 III
T14 124
.. M .til
Figure I. Monitoring binding activity by inspection of 2D heteronuclear correlation spectra. (a) Shown is
a small region of 15N_1H_HSQC spectra of uniformly 15N labeled FKBP (0.25 mM) without (black) and
with (white) 2-phenylimidazole at a concentration of 1.0 mM. (b) Amide resonances, which shift upon
binding of 2-phenylimidazole, are mapped onto the structure of FKBP (black). FK506 is shown binding
to the active site as a reference.
186 M. Vogtherr and K. Fiebig
If a small molecule binds to the protein, the nuclear spins of the nitrogen and
hydrogen atoms in the binding site are affected and the corresponding peaks in the
15N_1H_HSQC spectrum shift (Fig. la, white contours). If the ligand interacts with
the protein backbone, then these perturbations are especially strong. Due to the
good amide chemical shift dispersion of folded proteins most resonances in the
2D 15N_ 1H_HSQC are well resolved such that even small perturbations in only one
of the dimensions can be detected. If amide 15N and IH chemical shifts have been
assigned previously, and if the structure of the target protein is known, then the
ligand binding site can be identified and mapped onto the surface of the structure
(Fig. lb).
15N_ 1H_HSQC spectra can be acquired in 10 to 30 min for moderately concen-
trated protein solutions (approximately 0.3 mM). Thus, up to 1000 compounds
can be screened per day if mixtures of 10 are evaluated simultaneously (pool 10
approach). This includes time spent on mixtures with binding compounds, which
need to be deconvoluted by performing experiments on each of the single com-
pounds.
Prior to SAR by NMR there have been several published reports which have
used 2D NMR spectroscopy to detect and localize ligand binding (e.g., [12, 13]).
However, SAR by NMR was the first method to propose a 15N_ 1H_HSQC based
screening technique to identify ligands from a library of compounds. The crucial
and new point in the SAR by NMR concept was to use information about the
localization of the binding sites of different ligands to synthesize new higher affin-
ity ligands as shown in (Fig. 2a). First, ligand I and the corresponding binding site
I are identified and optimized (i). Second, screening in the presence of an excess
of ligand 1 is carried out to identify and then optimize the neighboring ligand 2
(ii). Third, experimentally derived structural information is used to guide the
design of linkers between ligand 1 and ligand 2 (iii). Although the individuallig-
and fragments may only bind weakly due to their small size, tethering of the two
fragments will result in a molecule displaying a binding affinity equal to the prod-
uct of the two fragment binding affinities plus the affinity gained by the linker.
This principle is well known in chemistry as the "chelate effect." In Figure 2b sev-
eral other strategies are shown that may be used to optimize binding affinities of
initially found leads. Fragment optimization (i) and fragment linking (ii) have
already been discussed. Generation of a directed library (iii) based on an initial
fragment may be a viable alternative if second site ligands cannot be identified. If
several ligands can be identified which bind to the same site, then fragment-merg-
ing (iv) using classical SAR approaches may be used. Typically similar fragments
are superimposed and docked into the binding site using the differential chemical
shift perturbation they cause in the protein 15N_ 1H_HSQC. Furthermore, a strate-
gy of fragment optimization via disassembly, shown in Figure 2b (v), can be
employed to enhance the binding of larger molecules found by other screening
techniques.
In the initial SAR by NMR publication [7] the Abbott group demonstrated
using the FK506 binding protein FKBP-12 that several fragments with micromo-
lar to millimolar binding affinity could be identified and linked to form a nanomo-
9 NMR-based screening methods for lead discovery 187
(i)
(II)
Design of
directed lin king
libraries
(III)
(ti) I nd
.. 2 screen Directed
library
generation
(Iii) ~ linking
Optimization
via disassembly
+ optimization
+ reassemb le
Figure 2. (a) Schematic principle of the SAR by NMR approach. (b) Several approaches to design
enhanced ligands by evaluation of binding sites. In each information about the localization of the bound
ligand within the protein is utilized to streamline synthesis of tight-binding ligands.
lar inhibitor of FKBP-12. Although the resulting FKBP-12 inhibitors were of lit-
tle therapeutic value, demonstration of the SAR by NMR concept set the stage for
subsequent studies on more interesting targets: the matrix metalloproteinase
stromelysin [8], the E2 protein from the human papilloma virus (HPV E2) [9], and
188 M. Vogtherr and K. Fiebig
the Erm Methyltransferase [10]. Three important conclusions can be drawn from
these studies: 1) all studies employ lsN_1H_HSQC based screening to identify ini-
tial weakly-bound ligands with Ko's in the millimolar range, which are then opti-
mized using SAR. Several of these initial compounds would have not been detect-
ed using a high throughput screen. 2) Linking of optimized fragments of neigh-
boring binding sites produced inhibitors in the low nanomolar and micromolar
range for stromelysin and HPV E2, respectively. For neither of these targets high
throughput screening of more than 100,000 compounds of the Abbott cooperate
library produced inhibitors with superior activities. 3) Diverse lead structures can
be discovered and optimized rapidly because chemical synthesis efforts can be
highly directed. In the case of stromelysin a 15 nanomolar inhibitor was found in
less than 6 months [8]. For HPV E2 micro molar inhibitors which bind to the
DNA-binding site were discovered [9]. Remarkable is that with only little prior
pharmacological knowledge complex macromolecular DNA-protein interactions
could be inhibited.
A large advantage of the lsN_1H_HSQC screening method is that it is highly
sensitive for detecting even weakly bound species. Dalvit et al. [14] were able to
detect shifts in FKBP caused by weakly associated DMSO (which in most stud-
ies of this kind is used for the preparation of ligand stock solutions). Based upon
this finding, the authors could not only localize the binding site, but also could
design sulfoxide-based inhibitors that bind to FKBP much more tightly. This
study elegantly emphasizes the power of NMR screening for low affinity ligands
that are optimized in a second step for tighter binding.
Currently SAR by NMR is being applied to a multitude of low molecular
weight «30 kDa) targets and certainly many suitable targets remain to be dis-
covered. Nevertheless protein target selection for SAR by NMR has been severe-
ly limited by two factors: 1) the size of the target protein for which 15N-1H_corre_
lated spectra can be recorded and 2) the relatively large concentrations of protein
and ligand required.
To overcome the first limitation, transverse relaxation optimized spectroscopy
[15] (TROSY) in combination with large magnetic fields is expected to extend the
molecular weight barrier past 100 kDa (see Chapter 8). Pellecchia et al. [16] have
successfully used TROSY technology to map the binding interactions in the
74 kDa complex of the Pilus chaperone FimC with the adhesin FimH. In such
large proteins, spectral overlap may become a severe problem since the number of
peaks increases linearly with the protein size. To evaluate such spectra, a combi-
nation of selective labeling techniques and smart analysis software will be essen-
tial. Pattern recognition approaches will be needed to detect the minute changes
of the amide resonances in a small fraction of the protein's amino acid residues.
The second major limitation, the solubility and protein availability limitation,
can only be overcome by increasing the sensitivity of NMR instrumentation itself.
New helium-cooled cryogenic NMR probes can enhance the performance of high
resolution NMR up to four times by cooling the radio frequency (RF) detection
coil to 25 K (-248°C). Recently Hajduk et al. have applied this technology to SAR
by NMR to achieve sensitivity gains of a factor of 2.4 when compared to conven-
9 NMR-based screening methods for lead discovery 189
tional probe technology [17]. This allows a reduction in protein and ligand con-
centrations to 50 11M, which shifts the stringency of the screen toward tighter
binding molecules. As a consequence, the number of compounds per mixture can
be increased from 10 to 100. Using cryogenically cooled NMR probes libraries of
200,000 compounds can be tested in less than a month with much reduced protein
consumption. Throughput of 15N-1H-HSQC-based screening is thus greatly
enhanced and may now be comparable to the throughput of other HTS techniques
(see Chapter 4).
Up to now, the SAR by NMR method has mainly been applied to enzymes. In
contrast to conventional assay based HTS methods, SAR by NMR can be easily
applied to other types of proteins, to nucleic acids, or to protein-protein and pro-
tein-nucleotide complexes, since no enzymatic activity is required. The power of
SAR by NMR lies in the direct assignment of binding activity to regions within
the protein, which would not have been possible by other methods. One disad-
vantage of 15N_ 1H_HSQC based screening is that it cannot give any information
about the ligand that has bound. This information is usually obtained by a decon-
volution step, which consists of recording 15N_ 1H_HSQC spectra of the individual
compounds. Alternatively to deconvolution, ligand-based NMR methods
described in the next section may be used.
(a) •
Slow exchange Fast exchange
free average signal
~
(b)
'H [ppm]
Figure 3. (a) The distinction between slow and fast exchange on the NMR timescale. If the release of the
ligand is slower than the frequency difference between the signals, then two signals are observed. If the
release is faster than the frequency difference, then only an averaged signal can be acquired. (b) A bind-
ing study that was conducted by measuring T 2 relaxation rates by observation of line broadening. Spectra
were recorded of varying concentrations of an unknown 200,000 kDa protein from industrial research (0.0
to 2.0 mglml) to which a mixture of 10 ligands, each at I mM concentration, was added. Shown is a region
of the I D spectrum containing signals of a binding (outer lines) and of a nonbinding ligand (inner multi-
plet).
(a)
•
Ligand +
NOEs
(b)
.,
3.57
3.58
e
E
Q.
3.59 -
~
:I: 3.60
3.61
o0
3.62 -
without protein with protein
1H [ppm]
Figure 4. The transferred NOE effect. (a) The principles of this technique: In the fast exchange limit, the
signals for bound and unbound ligand add up, and the sum is dominated by the much larger negative NOE
effect arising from the bound ligand . (b) A TrNOE study (adapted from [37]). NOESY spectra of samples,
which contain four isomeric methylated ~-methylg alactosides, at concentrations of I mM each. When a
galactose-binding lectin from elderberry is added to one of the samples (right spectrum), signals of two of
the ligands change their sign, indicating that two of the four isoforrns bind to the lectin protein. Positive
peaks are displayed as white contours, while negative peaks are shown as black contours.
been impossible since the molecular weight of E-selectin at 220 kDa is far beyond
the present limit for the SAR by NMR screening technique.
The main requirement for the application of TrNOEs is the existence of pairs
of ligand protons that are both close in space and well separated in the NMR spec-
9 NMR-based screening methods for lead discovery 193
trum to avoid spectral overlap with diagonal peaks. To overcome this overlap
problem the TrNOE effect can also be used as a filter in multidimensional spectra
such as the 3D TOCSY-TrNOESY spectra [26]. Such spectra are rather time-con-
suming and thus are not appropriate for fast screening applications.
Besides its value in screening, TrNOE spectra can yield important information
of the structure of the bound ligand [23]. This may be crucial when attempting to
superimpose several flexible ligands for the construction of a pharmacophore
model. A whole arsenal of other NMR methods such as isotope-filter methodolo-
gy [27] exists to obtain detailed information about the binding mode of the ligand.
Isotope-filters and isotope edited NOE spectra can yield direct distance con-
straints between ligand and protein and therefore will allow precise docking of the
ligand into the binding site. These methods cannot be termed "screening" any
more, but provide a direct connection to structure-based drug design [1-6], a link
between structural and functional methodology, which is unique in biophysical
chemistry.
The use of pulsed field gradients in high-resolution NMR readily allows the meas-
urement of diffusion rates of solutes in liquids [28]. The concepts inherent in this
technique are illustrated in Figure S. Application of a pulsed field gradient causes
a spatial change in the Bo magnetic field of the NMR instrument and results in a
spatially encoded dephasing of magnetization. After switching off the gradient all
nuclear spins are dephased and thus would add up to zero signal. Fortunately the
process of dephasing is reversible, and applying exactly the same gradient with
opposite sign will rephase spins. Due to the spatial encoding of the phases the
amount of magnetization that can be restored by the second gradient pulse is
inversely proportional to the diffusion that has taken place in the meantime. The
less a molecule has moved, the more magnetization can be recovered by this so-
called gradient echo. This principle is easily applied to screening. If a small mol-
ecule is bound to a large protein, its diffusion will be slowed down due to time
spent in the bound state and the much slower diffusion rate of the protein. Thus
less magnetization will be recovered for non-binding molecules, which diffuse
more rapidly, than for ones that bind, which diffuse more slowly.
Diffusion is measured by systematically varying diffusion time, gradient
length, or gradient strength. The relationships between these parameters and the
Stokes-Einstein diffusion coefficient D can then be used to obtain the values of D
from several experiments. For screening purposes, this would still mean rather
long measuring times. It is therefore easier to compare relative signal intensities
from diffusion experiments with and without protein. In a difference spectrum of
ligand signals recorded with and without protein, only the signals of bound lig-
ands should remain visible [23, 30-33]. The diffusion filter building block can
also be incorporated into 2D pulse sequences. Especially the TOCSY spectrum,
where correlation peaks between all protons belonging to the same spin system
194 M. Vogtherr and K. Fiebig
6 --L.[}=-.....L....---diffusio_n t m
i e -----L...
=:D...I.....--
U '" Spatial
encoding
Spatial
decoding
Slow
(A) diffusion
/ Fast
Refocused
volume ~oo
element
Figure 5. Principle of diffusion-based NMR spectroscopy. A dephasing gradient pulse encodes each vol-
ume element along the magnet axis. Fast diffusion as in (A) spreads this encoded magnetization over a
larger volume than slow diffusion (B). By a second, rephasing gradient only the signals of molecules that
did not change their position during the diffusion time are recovered. This scheme can be incorporated into
2D sequences or more complex ID schemes such as the NOE pumping experiment discussed below
(Fig.7a).
In the comparatively simple methods discussed so far only one NMR parameter is
exploited to detect binding. The more powerful "combination" methods discussed
subsequently employ a more complex strategy. The Saturation Transfer
Difference (STD) [34] technique depicted in Figure 6a is an example for such a
combined method, which uses selective irradiation of the protein in combination
with magnetization transfer effects due to ligand binding.
In the STD technique a long and selective irradiation pulse (several seconds) is
used to saturate protein spins, resulting in the equilibration of the population of (J.
and ~ spin states. Thus saturation can be viewed as the bleaching out of protein
9 NMR-based screening methods for lead discovery 195
(a)
•• •
RF
(SelectiVe)• •
#fast #fast
minus I I• =
(b)
~
STDNMR
.' ... ..
~
"""""Ologand
III
.... _"u'"'"
STO
supptesM<lill"l M~
n ,t
"
It )I HI ", H ,. lO H"rtIIfI'Q
(C)
·i1fl
3.0
20 TOCSY
'. r •. ·j
3.2
' [ 3 ••
E: ,I
•.0
)J I~ • •
Figure 6. The saturation transfer difference (STO) NMR. (a) Illustration of the technique: The protein
magnetization is saturated ("bleached out") via selective radio frequency (RF) irradiation of protein reso-
nances. Saturation then spreads through the whole protein by spin-diffusion. Saturation only affects bound
ligands, but not free ligands. This effect can be seen in a difference spectrum, in which only signals of
bound ligands survive. Panel (b) and (c) show 800 MHz STO NMR spectra of I mM ~-methylglucoside ,
I mM ~-methylgalactoside and 0.02 mM of a galactose-binding lectin from elderberry. (b) 10 STO NMR
spectrum (top) as compared to a reference 10 spectrum (bottom). (c) A 20 STO-TOCSY (white) is over-
layed on top of a standard 20-TOCSY (black) spectrum. In the black spectrum spinsystems of all ligands
are visible, whereas in the white STO filtered TOCSY spectrum only spinsystems of bound ligands are
seen. Measuring time was 12 min for the 10-STO spectrum in (b) and 12 h for the 20 STO-TOCSY
shown in (c).
196 M. Vogtherr and K. Fiebig
resonances since the difference in the popUlation of a and ~ spin states is direct-
ly proportional to the NMR signal. It is important to carefully select the frequen-
cy of the saturation pulse by centering it on a few isolated resonances of the pro-
tein such that signals of unbound ligands are not bleached out as well. Although
only few resonances of the protein are hit initially, saturation spreads out through
the whole protein mediated by NOEs between spatially close protons. This
process is called spin diffusion and increases in efficiency with the protein's mole-
cular size. Naturally, bound ligand will also be saturated. Upon dissociation, sat-
uration of the bound ligand is transferred to the free ligand population, which
slightly attenuates signals of ligands that bind to the target protein. The small
decrease in intensity of a bound ligand can be detected by subtracting the spec-
trum from a reference spectrum. In the resulting difference spectrum, only signals
of bound ligands survive. Moreover, this concept can be incorporated into virtu-
ally any modem NMR experiment. As can be seen in Figure 6b and c, ID and 2D
STD homonuclear correlation spectra allow the unique identification of the bound
~-methylgalactoside from a mixture of ~-methyl-galactoside and ~-methyl-gluco
side. Furthermore, STD spectroscopy can be combined with high-resolution
magic angle spinning (HR-MAS) techniques, commonly used for solid state NMR
purposes, to allow screening using an immobilized protein. Immobilized protein
may significantly aid recovery of precious ligands. As a proof of principle, HR-
MAS in combination with STD NMR was utilized to detect a lectin-sugar inter-
action [35] when the lectin was tethered to controlled porous glass.
Very similar to STD spectroscopy is the NOE pumping experiment [36] illus-
trated in Figure 7a. In this technique protein magnetization is selected not by spin
diffusion, but via a diffusion filter similar to the one described above for diffusion-
based NMR screening. The resulting protein magnetization is transferred onto a
bound ligand by intermolecular NOEs. Subsequently ligand magnetization is
shuttled to the solution upon dissociation of the ligand-protein complex. In con-
trast, molecules that do not bind to the protein target will not build up magnetiza-
tion in solution. Therefore, only magnetization from bound ligands will be detect-
ed in a ID experiment.
The basic idea of selecting protein signals using the protein's spectroscopic
properties can also be applied to exclusively select ligand signals. This principle,
which is complementary to the other two NOE based methods, has been termed
"reverse NOE pumping" (RNP) [37] and functions as illustrated in Figure 7b.
First, ligand signals are selected by a simple T 2 filter, which utilizes the much
faster T 2 relaxation of a protein as compared to a ligand. After selection of the lig-
and magnetization, those molecules that bind to the protein will lose some of their
magnetization to the (unmagnetized) protein scaffold. Consequently, signals of
ligands that bind to the protein will be attenuated, whereas signals of non-binding
small molecules will remain unchanged. In a difference spectrum, again only res-
onances of bound ligands remain.
The common scheme of these methods is that they consist of two steps. In the
first step either ligand or protein magnetization is selected by utilizing the specif-
ic NMR properties of the ligand or protein. In the second step, magnetization is
9 NMR-based screening methods for lead discovery 197
(a)
• • #fast•
(1 ) diffusion (2) mixing tm (3)
filter (0.5 ... 1s)
-.
(b)
• • •
o
• 'CJ •
• # fat
# ra t # fast
Figure 7. Simple and reverse NOE pumping experiments. (a) The NOE pumping experiment starts with
initial magnetization uniformly distributed among all molecules (I). From this magnetization only protein
magnetization is selected (2) via the slow diffusion properties of the protein (diffusion filter). This mag-
netization is then transferred to bound ligands (3). Ligand signals of bound ligands can be detected, where-
as unbound ligands remain invisible. No subtraction is necessary, which reduces experiment time and
avoids subtraction artifacts. (b) The reverse NOE pumping experiment (RNP) also starts from an initial
uniform magnetization of protein and ligands (I). Then, only the magnetization of the ligands is selected
(2) by utilizing the more favorable relaxation properties of the ligands (relaxation filter). Bound ligands
lose magnetization while they are associated with the protein, which is not the case for free ligands. The
lost magnetization can be detected by subtraction from a reference spectrum.
transferred to the binding partner by intermolecular NOEs. In the case of STD and
RNP, subtraction from a reference spectrum follows. All three methods end up
with IH magnetization solely of bound ligands, which can subsequently be used
for a whole range of 2D-NMR techniques. However, multidimensional NMR
spectra utilizing NOE pumping or RNP have not yet been reported in the litera-
ture.
screening [41] in which weak binding compounds identified via diffusion edited
spectroscopy are used to guide computational clustering of compound databases
and therefore significantly bias compounds that then undergo high-throughput
screening. This NMR-based prescreening strategy resulted in a fourfold increase
in HTS hit rates when compared to a randomly selected set of compounds.
Diversity and robustness of NMR based screening methods make these techniques
highly attractive as tools for drug discovery. Although not all screening techniques
discussed here may be applicable to any given target, there is however a good
chance that at least one of the described methods will prove productive in finding
several medium affinity ligands. A comparison of each of the methods is given in
Table 1. For drug targets of molecular weight <30 kDa SAR by NMR appears to
be the method of choice since it yields detailed information about the location of
the binding site. It remains to be seen whether 15N_1H_TROSY based screening
techniques will prove useful for larger protein targets, especially considering the
added effort needed for spectral assignment and the increased complexity due to
spectral overlap. Nevertheless, with the application of new cryo-cooled NMR
probes, 15N_IH_HSQC based screening can now be considered a high throughput
method.
Ligand-based NMR screening methods can be used for protein targets of virtu-
ally any size, but are restricted in the ligand's binding affinity range. Because suf-
ficient ligand-protein dissociation rates are needed, only binding of ligands with
low (milimolar) to intermediate (micromolar) affinities is detectable. It is expect-
ed that cryo-cooled NMR probe technology will also advance ligand detected
NMR screening to the high throughput level. Certainly protein and ligand con-
centrations can be lowered drastically and experiment times can be shortened with
increased sensitivity. However, spectral overlap will be of major concern when
mixtures of up to 100 compounds are to be screened. For such applications only
techniques for which the signals of bound ligands survive will be useful, and
sophisticated software will be needed to deconvolute the spectra of multiple
bound ligands. Although only ligands with medium to low affinities can be found,
ligand based NMR screening has been used as an effective prescreening tool for
assay based high throughput screening. Identifying a large ensemble of medium
affinity ligands may not only aid in building a binding site pharmacophore model
(see Chapter 11), but also may yield crucial information for overcoming tissue
availability, toxicity, or even intellectual property related problems.
Although NMR based screening is only one of the more recent additions to the
bag of tools used in drug discovery [1, 2], its simplicity and wide range of appli-
cation (including protein-protein and protein-nucleic acid interactions) has attract-
ed much attention. Advances in NMR instrumentation and methodology have
already paved the road for NMR based screening to become a high throughput
technique. In addition to this, NMR is exceptional in the amount of detailed struc-
200 M. Vogtherr and K. Fiebig
Table 1. Comparison between the screening techniques discussed in the text. Given concentrations are
those stated in the cited references. The cryo-cooled probe technology was not taken into account for the
ligand-based methods, where it would also facilitate detection of binding at much lower concentrations.
tural information it can provide. Not only can NMR readily reveal the binding site
eSN)H-HSQC screening) or the conformation of the bound ligand (transfer
NOE), but it can also supply information that enables precise docking of the lig-
and to the protein's binding pocket (isotope-filtered NOESY). NMR data can
therefore provide a natural connection between experimental HTS and combina-
torial chemistry techniques with computational methods such as 3D-database
searching (see Chapter 10), virtual screening (docking) and structure-based ligand
design (see also Chapter 8).
9 NMR-based screening methods for lead discovery 201
9.6 References
I Roberts GCK (1999) NMR spectroscopy in structure-based drug design. Curr Opin Biotech 10: 42-47
2 Moore JM (1999) NMR screening in drug discovery. Curr Opin Biotech 10: 54-58
3 Roberts GCK (2000) Applications of NMR in drug discovery. Drug Discovery Today 5: 230-240
4 Feeney J, Birdsall B (1993) NMR studies of protein-ligand interactions. In: Roberts GCK (ed): NMR
of macromolecules. Oxford University Press, Oxford, 183-215
5 Jardetzki 0, Roberts GCK (1981) NMR in molecular biology. Academic Press, San Diego
6 Craig DJ, Higgins KA (1998) NMR studies of ligand-macromolecule interactions. Annu Rep NMR
Spectrosc 22: 61-138
7 Shuker SB, Hajduk PJ, Meadows RP et al (1996) Discovering high-affinity ligands for proteins: SAR
by NMR. Science 274: 1531-1534
8 Hajduk PJ, Sheppard G, Nettesheim DG et al (1997) Discovery of potent nonpeptide inhibitors of
stromelysin using SAR by NMR. JAm Chern Soc 119: 5818-5827
9 Hajduk PJ, Dinges J, Miknis GF et al (1997) NMR-based discovery oflead inhibitors that block DNA
binding of the human papillomavirus E2 protein. J Med Chern 40: 3144-3150
10 Hajduk PJ, Dinges J, Schkeryantz JM et al (1999) Novel inhibitors of Erm methyltransferases from
NMR and parallel synthesis. J Med Chern 42: 3852-5859
11 Sattler M, Schleucher J, Griesinger C (1999) Heteronuclear multidimensional NMR experiments for
the structure determination of proteins in solution employing pulsed field gradients. Prog NMR
Spectrosc 34: 93-158
12 Rizo J, Liu ZP, Gierasch LM (1994) IH and ION resonance assignments and secondary structure of cel-
lular retinoic acid-binding protein with and without bound ligand. J Biomol NMR 4: 741-760
13 Hensmann M, Booker GW, Panayotou G et al (1994) Phosphopeptide binding to the N-terminal SH2
domain of the p85· subunit of PI 3'-kinase: A heteronuclear NMR study. Protein Science 3: 1020-1030
14 Dalvit C, F10ersheim P, Zurini M et al (1999) Use of organic solvents and small molecules for locat-
ing binding sites on proteins in solutions. J Biomol NMR 14: 23-32
15 Pervushin K, Riek R, Wider G et al (1997) Attenuated T2 relaxation by mutual cancellation of dipole-
dipole coupling and chemical shift anisotropy indicates an avenue to NMR structures of very large bio-
logical macromolecules in solution. Proc Natl Acad Sci USA 23: 12366-12371
16 Pellecchia M, Sebbel P, Hermanns U et al (1999) Pilus chaperone FimC-adhesin FimH interactions
mapped by TROSY-NMR. Nat Struct Bioi 4: 336-339
17 Hajduk PJ, Gerfin T, Boehlen JM et al (1999) High-throughput nuclear magnetic resonance-based
screening. J Med Chern 42: 2315-2317
18 Lian LY, Roberts GCK (1993) Effects of chemical exchange on NMR spectra. In: GCK Roberts (ed):
NMR of macromolecules. Oxford University Press, Oxford, 153-182
19 van de Ven FJM (1995) Multidimensional NMR in liquids. VCH, Weinheim
20 Limmer S, Vogtherr M, Nawrot B et al (1997) Specific recognition of a minimal model of aminoacy-
lated tRNA by the elongation factor Tu of bacterial protein biosynthesis. Angew Chern Int Ed Engl 36:
2485-2489
21 Scherf T, Anglister J (1993) A Tl rho-filtered two-dimensional transferred NOE spectrum for study-
ing antibody interactions with peptide antigens. Biophys J 64: 754-761
22 Hajduk PJ, Olejniczak ET, Fesik SW (1997) One-dimensional relaxation- and diffusion-edited NMR
methods for screening compounds that bind to macromolecules. JAm Chern Soc 119: 12257-12261
23 Ni F (1994) Recent developments in transferred NOE methods. Prog NMR Spectrosc 26: 517-606
24 Meyer B, Weimar T, Peters T (1997) Screening mixtures for biological activity by NMR. Eur J
Biochem 246: 705-709
25 Henrichsen D, Ernst B, Magnani JL et al (1999) Bioaffinity NMR spectroscopy: Identification of an
E-selectin antagonist in a substance mixture by transfer NOE. Angew Chern Int Ed Eng138: 98-102
26 Herfurth L, Weimar T, Peters T (2000) Application of 3D TOCSY-TrNOESY for the assignment of
bioactive ligands from mixtures. Angew Chern Int Ed Eng139; 2097-2099
27 Breeze AL (2000) Isotope-filtered NMR methods for the study of biomolecular structure and interac-
tions. Prog NMR Spectrosc 36: 323: 372
28 Stejskal EO, Tanner JE (1965) Spin diffusion measurement: Spin echoes in the presence of a time-
dependent field gradient. J Chern Phys 42: 288-292
29 Lin M, Shapiro MJ (1996) Mixture analysis in combinatorial chemistry. Application of diffusion-
resolved NMR spectroscopy. J Org Chern 61: 7617-7619
202 M. Vogtherr and K. Fiebig
30 Lin M, Shapiro MJ, Wareing JR (1997) Screening mixtures by affinity NMR. J Org Chern 62:
8930-8931
31 Lin M, Shapiro MJ, Wareing JR (1997) Diffusion-edited NMR-affinity NMR for direct observation
of molecular interactions. J Arn Chern Soc 119: 5249-5350
32 Anderson RC, Lin M, Shapiro MJ (1999) Affinity NMR: Decoding DNA binding. J Cornb Chern I:
69-72
33 Bleicher K, Lin M, Shapiro MJ et a1 (1998) Diffusion edited NMR: Screening compound mixtures by
affinity NMR to detect binding ligands to vancomycin. J Org Chern 63: 8486-8490
34 Mayer M, Meyer B (1999) Characterization of ligand binding by saturation transfer difference NMR
spectra. Angew Chern Int Ed Eng135: 1784-1788
35 Klein J, Meinecke R, Mayer M et a1 (1999) Detecting binding affinity to immobilized receptor pro-
teins in compound libraries by HR-MAS STD NMR. J Arn Chern Soc 121: 5336-5337
36 Chen A, Shapiro MJ (1998) NOE pumping: A novel NMR technique for identification of compounds
with binding affinity to macromolecules. J Arn Chern Soc 120: 10258-10259
37 Shapiro MJ, Chen A (2000) NOE pumping. 2. A high-throughput method to determine compounds
with binding affinity to macromolecules by NMR. J Arn Chern Soc 122: 414-415
38 Vogtherr M, Peters T (2000) Application of NMR based binding assays to identify key hydroxy groups
for intermolecular recognition. J Arn Chern Soc 122: 6093-6099
39 Bemis OW, Murcko MA (1996) The properties of known drugs. 1. Molecular frameworks. J Med
Chern 39: 2887-2893
40 Bemis OW, Murcko MA (1999) Properties of known drugs. 2. Side chains. J Med Chern 42:
5095-5099
41 Fejzo J, Lepre CA, Peng JW et a1 (1999) The SHAPES strategy: An NMR-based approach for lead
generation in drug discovery. Chern Bioi 6: 755-769
Modern Methods of Drug Discovery 203
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
10.1 Introduction
10.2 3D database searching in lead discovery.
10.3 3D database searching in combinatorial library design
lOA Construction and filtering of virtual libraries
1004.1 Enumeration
1004.2 Filtering of virtual libraries
10.5 Design of directed libraries
10.5.1 SAR-driven design
10.5.2 Protein-structure-driven design
10.6 Conclusions
10.7 Acknowledgements
10.8 References
10.1 Introduction
During the 1980s, modem drug discovery was dominated by the idea of "struc-
ture-based drug design." The promise was, given an x-ray structure of the protein
target, one could design the ideal protein ligand in a small number of iterations. If
our computational methods could predict accurately which ligands would bind a
protein best, this approach would be wildly successful; however, our methods are
not that accurate, and in the end many experimental iterations have been required
(for a review, see [1]). It was during this era that a small number of groups began
experimenting with novel approaches to discovering leads in silica, now generi-
cally called "3D database searching." Here, either a pharmacophore or a protein
pocket was used to sift through a database of conformers of molecules, and those
molecules which met the criteria were submitted to a biological assay (for a
review, see [2]).
In the late 1980s and early 1990s, combinatorial chemistry emerged, with the
promise of making billions of molecules (see also Chapter 6). The pendulum had
swung away from the notion of designing the one perfect molecule and had moved
to the antipodal position: make billions of molecules, screen them all to look for
the hot molecule (see Chapter 4). Although the initial efforts in combi-chem relied
on peptides [3], it was not until the first organic libraries were prepared by Bunin
and Ellman [4] and DeWitt et al. at Parke-Davis [5] that this became a technolo-
gy enthusiastically embraced by the pharmaceutical industry.
204 J.H. van Drie et al.
Our own recent experiences have focused less on the application of 3D database
searching in lead discovery, and more on their use in combinatorial library design.
Nonetheless, it is worthwhile to examine the experiences of other groups that have
been reported in the literature on 3D database searching in lead discovery. These
applications were last reviewed comprehensively in 1995 [2]. Since that review, a
number of new applications of 3D database searching to lead discovery have
appeared, and the pace of such publications is accelerating. In the following, we
do not aspire to make a comprehensive review of that literature; these reports
highlighted here are a subset of the literature of pharmacophoric 3D database
10 Structure-based design of combinatorial libraries 205
searching which has appeared since that 1995 review. For a review on "virtual
screening," see Walters et al. [23] and Shoichet and Bussiere [8].
G. W. Milne's laboratory at the NCI has been exceptionally prolific in the appli-
cation of pharrnacophoric 3D database searching to lead discovery, and unusual-
ly methodical in probing the assumptions underlying the use of these methods.
Subsequent to their work already cited in the 1995 review, they have reported on
the discovery of novel HIV-l protease inhibitors [9]. A pharmacophore was
derived from the x-ray co-crystal structures, containing three features (two
hydroxyls, one carbonyl, with defined geometric constraints). They performed a
search of the open NCI database (206,000 compounds), retrieving 2400 hits,
which was further filtered by novelty and the presence of an additional hydropho-
bic feature, to a final 50 compounds submitted to biological testing. The two best
hits were shown to possess sub-11M activity against the enzyme; one showed
antiviral activity in a cell-based activity at 12 f.lM.
Milne's laboratory also reported on the discovery of HI V-I integrase inhibitors
by 3D database searching [10]. Using known HIV-l integrase inhibitors, a phar-
macophore was developed, containing three features (chosen from 2 hydroxyls,
and two carbonyl oxygens). Searching the open NCI database, 340 hits were
found. This led to 10 novel, structurally-distinct HIV-l integrase inhibitors, four
of which were below 30 11M.
A second report from Milne's laboratory on the discovery of HIV-l integrase
inhibitors via pharmacophoric 3D database searching described a different tack
[11]. Natural product inhibitors of this enzyme were used to construct two
three-feature pharmacophores, each containing two oxygens and one hydroxyl
with differing geometric constraints. These were both used to search the open NCI
database, yielding 800 unique hits from the two pharmacophore searches com-
bined. Twenty-seven hits showed inhibition better than 100 11M.
Kaminski et al. [12] discovered novel famesyl protein transferase (FPT)
inhibitors via 3D database searching. Using known FPT inhibitors, a five-feature
pharrnacophore was constructed in an automated fashion using the Hypothesis
Generation module of Catalyst [13]. Four of these five features were "hydropho-
bic" features, the fifth a H-bond acceptor, exhibiting what many have argued is a
flaw in the Hypothesis Generation procedure-a tendency to exaggerate the con-
tribution of hydrophobic features in a pharrnacophore. Nonetheless, despite this
non-selective, "greasy" pharmacophore, Kaminski et al. identified 718 hits in the
Schering-Plough corporate database, five of which showed in vitro FPT inhibition
better than 5 11M. This led to a novel series of dihydrobenzothiophene FPT
inhibitors.
Llorens et al. [14] describe the discovery of amygdalin as a CD4 receptor lig-
and via pharrnacophoric 3D database searching. The structure-activity relationship
(SAR) of analogs of peptide T led them to propose a four-feature pharrnacophore
for CD4-peptide T interaction (a phenyl, an amide, a carbonyl, and a hydroxyl in
a specific geometric arrangement). Searching the NCI, Derwent, Maybridge, and
Biobyte databases, they retrieved an unspecified number of hits, one of which was
amygdalin, which was found to bind to CD4 with a sub-11M IC so .
206 J .H. van Drie et al.
Marriott et al. [15] describe the discovery of novel muscarinic (M3) antagonists
via phannacophoric 3D database searching. Using known SAR against the M3
receptor, they developed a four-feature phannacophore (two H-bond donors, one
H-bond acceptor, and a basic, tertiary amine, in a specific geometric arrangement)
using the automated procedure of DISCO [16]. With this phannacophore, they
performed a 3D search against their corporate database. They retrieved 177 hits,
three of which were found to have potent M3 antagonist activity, many struc-
turally-distinct from known M3 antagonists.
Kiyama et al. [17] report the discovery of novel angiotensin II receptor antag-
onists, by finding replacements for the biphenyltetrazole moiety of DuP 753.
Using a three-feature phannacophore (2 aryl groups, and the third group one of:
carboxyl, ester, amide, or tetrazole, in a specific geometric arrangement), they per-
formed a search ofthe MDDR database of 94.000 compounds, retrieving 139 hits,
which ultimately led to a series of novel All antagonists containing a tricyclic
dibenzocycloheptene system, with affinities in the range 0.29-12 nM.
One of the most ingenious applications of phannacophoric 3D database search-
ing was described by Greenidge et al. [18]. Beginning with the x-ray co-crystal
structure of the thyroid hormone bound to its receptor, they define a phanna-
cophore consisting of seven features (five hydrophobic features, 2 H-bond accep-
tors) augmented critically by -100 exclusion spheres to represent the steric
boundaries of the active site. Searching the Maybridge database of 47,000 com-
pounds, this phannacophore retrieved only one hit, a compound whose IC so
against the thyroid hormone receptor was 69 j.!M (by contrast, the phannacophore
devoid of exclusion constraints retrieved four hits). Those authors appear to
express surprise that the Catalyst 3D database searching software was capable of
employing numerous exclusion spheres. It should be noted that this software orig-
inally had been designed with such a capability, based on the experiences with
ALADDIN [19], which similarly allowed the description of steric constraints with
an unlimited number of exclusion spheres.
An impressive contribution was made by a group at the Georgetown Institute
for Cognitive and Computational Science. Wang et al. [20] report on the discov-
ery of a novel dopamine transporter inhibitor via 3D database searching. Their
phannacophore contained three features (phenyl, carbonyl oxygen, secondary or
tertiary basic amine, in a specific geometric arrangement). Using this phanna-
cophore to search the NCI database, they retrieved 4,000 hits, which was filtered
heuristically to 385 hits, one of which was used as a lead for further optimization,
ultimately leading to a compound whose binding affinity was 11 nM.
To our knowledge, the non plus ultra of lead discovery via 3D database search-
ing in the 1990s was the little-noticed contribution of Leysen and Kelder in the
discovery of Org-8484, a exceptionally-selective 5-HT2c agonist [21]. Beginning
with the known structure-activity relationship of various serotonin (5-HT) ligands,
they first report the result of 2D substructure searching, next the result of applica-
tion of the principles of bioisosterism, and finally the result of phannacophoric 3D
database searching. Their 3D phannacophore was quite simple: two features
(phenyl ring, basic amine), with geometric constraints based on the putative recep-
10 Structure-based design of combinatorial libraries 207
tor site point with which the amine interacts. Using this pharmacophore to search
their corporate database, one hit was especially noteworthy: Org-8484, showing
6 nM activity against 5-HT2c and 1600 nM against 5-HT2a. This selectivity for
5-HT2c over 5-HT2a is unparalleled in the serotonin literature.
Our own recent interest has focused on the use of 3D database searching and relat-
ed techniques in combinatorial library design. The process we follow proceeds in
three stages:
Stage I is the domain of the synthesis chemist. A scaffold is chosen, usually
based on its prior success at leading to active molecules against the given target;
this scaffold is frequently found via screening. The chemist chooses the scheme
for performing combi-chem on that scaffold, which leads to the criteria by which
reagents can be selected from reagent libraries. For example, the scheme selected
by Kick and Ellman for their cathepsin D library ([22], see Fig. 1) led to the iden-
tification of 700 amines, and 1900 acylating agents which could be used to con-
struct molecules.
o
Ser 80
Figure 1. Combinatorial library design of Kick Roe et al. in the design of cathepsin D inhibitors.
10.4.1 Enumeration
Given a scaffold and a description of the types of chemistry that can be perfonned
combinatoric ally, and sets of reagents for each point of elaboration, enumeration
is the computational process of constructing each individual molecule that could
be made with that library. Broadly, one may characterize the methods for enu-
meration into two classes: "chemical" enumeration, in which the rules of synthet-
ic chemistry are considered as the reagents are attached to the scaffold; and "plas-
tic model" enumeration, in which molecules are put together as one puts plastic
models together. Both can be directed to yield the same result. Chemical enumer-
ation tends to be easier-to-use, especially by the synthesis chemist, while plastic-
model enumeration tends to be much faster, and better suited for the generation of
libraries in excess of one million compounds.
For chemical enumeration, we rely on the PC-based Afferent software [30].
One simply specifies the scaffold, the chemistry, and provides the reagent lists as
.sd files, and Afferent will construct the virtual library. It typically takes about 1 h
to construct an 1O,000-molecule library. One feature of Afferent that stands out is
its ability to handle multi-center reactions, such as a Diels-Alder reaction (though
these tend not to be common in combi-chem). Afferent can write the output either
to an .sd file, or to Oracle tables.
For plastic-model enumeration, we rely on MOE (Molecular Operating
Environment) [31]. It can enumerate a 200,000-molecule library in 1 h. It writes
the output as an .sd file. These capabilities are comparable to most molecular
modelling packages, which now routinely include plastic-model enumeration
capabilities. As shall be seen, MOE allows ready customization, and interfaces
easily with other components of our multi-vendor software environment.
In our setting, we have two filtering steps for virtual libraries. The first filtering
step, PUREOS-l, allows one to reject reagents with objectionable properties. The
second filtering step, PUREOS-2, examines the whole molecule, and allows one
to reject products with objectionable properties.
PUREOS-l can filter SD files generated from ACD, CRCD, or other chemical
reagent databases. Twenty-one filters are contained in this program including num-
ber of special atoms like CI, Br, I, metal, number of functional groups like NH2,
N02, COCl, CONH2, S02NH2, double bonds, triple bonds, Esters, and physical
properties like molecular weight and LogP(o/w). Each filter represents a particu-
lar functional group or physical-chemical property of a chemical compound. Users
can select a set of filters for a particular project and define the range for each fil-
ter. Using PUREOS-l, chemists can eliminate reagents containing unwanted func-
tional groups or with unfavorable physical-chemical properties before actual
210 J.H. van Drie et al.
library enumeration. PUREOS-I has been found to be a very useful tool in the
design of a library of inhibitors against a kinase of therapeutic interest. In that case,
using PUREOS-I, 7000 out of 9000 amines from the ACO were eliminated.
PUREOS-2 is a program developed to screen virtual combinatorial library
according to Lipinski's "rule of five" and similar rules. It filters a virtual library
to eliminate virtual compounds with undesirable predicted physico-chemical
properties. The current filters for PUREOS-2 include molecular weight (Weight),
hydrophobicity (log P(o/w», number of rotatable bonds (n_rotatable), number of
hydrogen bond donors (HB_don), and number of hydrogen bond acceptors
(HB_acc) and molecular polar surface area. In the design of several combinatori-
allibraries, it has been found that nearly 50% of the virtual library compounds can
be eliminated through PUREOS-2.
Both PUREOS-l and PUREOS-2 were implemented under the Molecular
Operating Environment, MOE [31]. This implementation is quite flexible, making
it easy to expand the types of constraints one wishes to apply at each step.
Both PUREOS-l and PUREOS-2 are fast, typically processing 100,000 mole-
cules per hour.
In cases where no x-ray structure is available, one can still develop empirical mod-
els based on the structure-activity-relationship (SAR), and use these models to
direct the library design. One may term this process SAR-driven design. We
describe here three methods for this type of design: 1) using 20 molecular simi-
larity, 2) using "binary QSAR," and 3) using pharmacophore discovery and phar-
macophoric 30 database searching.
By 2D similarity
The simplest method for SAR-driven design is borrowed from our experiences in
processing the hits from high-throughput screening. There, one commonly per-
forms similarity searches on each hit and submits those similar molecules to bio-
logical testing. Analogously, in a combi-chem setting, one may construct a large
virtual library, synthesize and test a small subset, and submit those for biological
testing. Similarity searches can be performed on the virtual library against the
actives, and new combinatorial libraries can be built focused on these presumed
"islands of activity" (see Fig. 2). This is the quickest and easiest method for SAR-
driven design. Such similarity searches on 100,000-compound virtual libraries
typically take less than 1 min in Pharmacia's Cousin database system (which runs
on a multiprocessor P6 computer). Cousin's similarity metric is a modified
Tanimoto coefficient (see references in [26]). The disadvantage of this method is
that the design is based solely on 20 structural similarity, and focuses the libraries
narrowly on a small number of templates and a small number of reagents.
10 Structure-based design of combinatorial libraries 211
o 0
0
0 o 0
• o
•
0
o
0
0 0
• 0
0
Figure 2. Schematic representation of the simplest " structure-based" method for library design, in which
molecules are selected based on the 2D similarity of their structure to known actives.
Using B-QSAR
Binary QSAR is QSAR-like method developed by Paul Labute which correlates
structures, using molecular descriptors, with a binary expression of biological
activity (i.e., 1 for active and 0 for inactive) and calculates a probability distribu-
tion for active and inactive compounds in a training set [32]. This binary QSAR
model then can be used to predict the probability for a new compound to be active.
It is distinct from the typical Hansch-style QSAR, in that a regression line is not
computed; B-QSAR relies strictly on Bayesian inference.
Binary QSAR estimates the probability density Pr(Y = IIX = x) from a training
set with biological activity Y (0 for inactive and 1 for active) and molecular
descriptors X. A principal components analysis (PCA) is conducted on the train-
ing set to calculate an n by p linear transform, Q, and an n-vector, u, such that the
random p-vector Z = Q(X- u) has mean and variance equal to the p by P identity
matrix, where p is called the number of principal components.
The original molecular descriptors, X, are transformed by Q and u to get a
decorre1ated and normalized set of descriptors. The probability density is then
estimated by Bayes' theorem and assuming that the transformed descriptors are
independent:
212 I.H. van Drie et aI.
ring, positive- and negative-charge. Following the method of Mayer et al. [39],
DANTE identifies candidate pharmacophores as those sets of geometric arrange-
ments of features common to most or all of the actives. Unique to DANTE, these
candidate pharmacophores are ranked by their selectivity, a mathematical expres-
sion reflecting the probability that such a pattern could emerge at random. Finally,
also unique to DANTE, all molecules are superimposed using the most selective
pharmacophore, and the "shrink-wrap" procedure is applied to infer the steric
boundaries of the binding site.
Figure 3 depicts a typical DANTE pharmacophore, one for benzodiazepine
CCK antagonists based on the data of Evans et al. [40]. Four chemical features are
used to superimpose all conformers shown, constraining their conformation
according to the geometric constraints attached to each feature. In addition,
unique to DANTE is the presence of a full description of the steric constraints
associated with that SAR: the core of the pharmacophore is surrounded in all
directions by opaque patches of surface symbolizing a putative steric boundary on
the receptor.
Figure 4 shows how a benzodiazepine combinatorial library could be designed
with such a pharmacophore. Once a 3D database has been constructed from the
virtual library, that database can be searched using that pharmacophore, with those
reagents leading to steric clashes being rejected during the search. The final,
designed library consists of those molecules which were hits during this 3D data-
base search.
Even more importantly than avoiding steric clashes is the notion of exploring
terra incognita. A common misperception in the modelling literature is that the
Figure 3. "Shrink-wrap" representation of inferred binding site for CCK antagonists based on the data of
Evans et a!., constructed using the DANTE software.
214 J .H. van Drie et al.
Figure 4. How the shrink-wrap surface may be used in the design of a benzodiazepine combinatorial
library.
In those cases where the directed combi-chem library is against a target for which
we have a macromolecular structure, we have the greatest potential for designing
libraries which have a high proportion of actives. Initially, we began by using 3D
database searching, in a manner similar to what was described in the previous sec-
tion. Those experiences were not satisfying, initiating a quest for the optimal pro-
tein-structure-driven combinatorial library design tool. This may be a Holy Grail.
We describe here our experiences thus far with three software tools:
CombiBUILD, Omega, and SLIDE.
CombiBUILD
CombiBUILD is software developed by Diana Roe in the Kuntz lab at UCSF [41],
an outgrowth of earlier work on a de novo design program BUILDER [42].
CombiBUILD is the software used in the most remarkable published success to
date in protein-structure-based combi-chem, the design of cathepsin D inhibitors
[43]. In this completely prospective study, synthesis chemist Ellen Kick in Jon
Ellman's lab at UC Berkeley devised the synthetic combinatorial scheme shown
in Figure 1. Facing the prospect of more than 3 x 109 molecules, she approached
the Kuntz group for assistance in designing a smaller library. In response, Diana
Roe in the Kuntz group developed CombiBUILD. They chose 1000 molecules for
synthesis that CombiBUILD had deemed to optimally complement the active site
of cathepsin D, whose x-ray structure had at that time recently been published
[44]. Of those 1000 molecules designed by CombiBUILD, 67 had sub-flM activ-
ity, 23 had <330 nM activity, and seven had < 100 nM activity. This stimulated
another round of combi-chem, after which four molecules emerged with <20 nM
activity.
CombiBUILD functions by performing its own enumeration and performing
nearly-exhaustive, recursive conformational exploration in the active site, with the
initial placement of the molecule determined by overlaying the scaffold to user-
defined points. (By contrast, CombiDOCK, also from the Kuntz group [45], docks
each molecule to determine the initial binding geometry). Each torsional angle is
rotated through a set of predefined ideal values based on the nature of the bond,
and rigid body minimization is performed in the active site for each conformation.
Special bump grids are employed to prune the conformational space. The inter-
molecular energy, as determined by a simplified AMBER force-field, is used to
compute a score, by which each member of the library is ranked.
We have yet to duplicate the stunning success of CombiBUILD against a target
of interest to us, though it has generated libraries superior to those constructed
216 J.H. van Drie et al.
Omega
Omega (Optimized Molecular Ensemble Generation Application) is software
recently made available by Matt Stahl at OpenEye Software [46] for performing
exhaustive, unconstrained conformational search, and for performing active-site-
constrained conformational search. Though the approach is unpublished, Omega
has an extensive lineage, as both an updated version of the WIZARD conforma-
tional analysis software [47,48] and as an updated version of the MONGOOSE
protein-structure-based combinatorial library design software [23]. Verbal
accounts of MONGOOSE and its successor Skizmo indicate that it has been used
with success in protein-structure-based combinatorial library design at Vertex.
Omega has the ability to do gas-phase and restricted active-site searches. The
gas-phase ensemble generation uses a deterministic breadth-first, best-first torsion
search of conformational space, using an energy cutoff to exclude high-energy
branches of the search tree. Discrete torsion values (defined by user-modifiable
rules) are used; there is no variation of bond lengths, angles, or ring torsions. The
restricted active site searching (restricted docking) uses an initial guess (provided
by the user) about the orientation of a common ligand substructure in the active
site to orient the gas phase search results, which are then evaluated using a mod-
ified version of the scoring function of Eldridge et al. [49], with the position of the
top-scoring conformations being optimized.
Gas-phase conformational searching generates structures of lower quality than
would be generated with a typical all-degree-of-freedom energy minization pro-
cedure, but with that tradeoff is able to perform full conformational searches very
quickly. Full searches on drug-like molecules typically take a few tenths of a sec-
ond; highly flexible molecules may take up to a few seconds. Molecules with
extremely high flexibility (e.g., > 17 rotatable bonds) are typically excluded, as the
results would likely be meaningless in any case, but Omega prunes the search and
limits the total number of conformations being considered (to typically 200,000 at
any given time). The quality of the generated structures is dependent on the accu-
racy of the rules used to generate rotamers, and also depends on the quality of the
initial 3D structure provided as input (which defines bond lengths, angles, and
ring geometries). The torsion rules are contained in an customizable file, with sub-
structures defined as SMARTS strings followed by the torsion values to be sam-
10 Structure-based design of combinatorial libraries 217
pled. The ability to generate full conformational models for thousands of com-
pounds per hour allows small virtual libraries to be processed interactively, and
large libraries overnight. Also, since Omega is starting with fully-enumerated
libraries, and all structures are evaluated independently, the search can be readily
split over multiple processors.
The active-site search constitutes a restricted flexible docking method, the
restriction being the requirement that the user have a hypothesis about the binding
mode of the compounds in question, and that they share some common substruc-
ture which can be used to establish an initial approximate location within the pro-
tein active site. The active-site search is also very fast, nearly as fast as the gas-
phase search alone. In contrast to CombiBUILD, Omega does not enumerate the
virtual library in the course of doing a search through conformationallchemical
space. Virtual libraries are enumerated externally and an initial 3D structure is
generated as is the case for a gas-phase search. Each compound to be docked first
undergoes a gas-phase conformational search; the resultant low-energy structures
are then oriented into the active site and scored following a rigid-body geometri-
cal optimization against the intermolecular scoring function. The best n structures
are then written out. If the scoring function is to be used directly, a single confor-
mation/compound suffices, but if further processing (e.g., energy minimization,
scoring by another method, etc.) is envisioned then more structures can/should be
written. Each structure written out has both an intermolecular pseudoenergy from
the scoring function and an intramolecular strain energy from the gas-phase
search. No attempt is made to combine the two energies, as it would be unclear
how to properly scale them. The lack of a true minimum-energy docked con-
former is ameliorated by the rigid-body optimization and rather soft potentials of
the scoring function, with the result that good-scoring conformations are general-
ly quite reasonable. Searches on ligands from crystal structures will typically find
a docked conformation which is very similar to the crystallographic result, but it
will not necessarily be the best-scoring conformation. Due to the limitations on
both the conformational generation (discrete, unminimized conformations) and
docking (rigid body optimization, scoring function limitations), as well as the req-
uisite assumption about docking mode, differences of a few kcallmole are not
likely to be distinguished.
Overall, our impression of Omega is incomplete, but at this point it appears to
be fast (4000 structureslh) and uses as input a virtual library, which is convenient
in our setting. Our experiences lead us to be concerned about scoring functions
here as well.
SLIDE
Mark Twain, speaking about the weather, once complained "Everybody talks
about it, but nobody does anything about it." The same has been true about the
issue of the flexibility of the protein in performing protein-structure-based ligand
design. We have seen the importance of this on numerous occasions in a variety
of protein-structure-based design projects. In one example, the chemists had syn-
thesized a novel, potent compound, which we could not determine how it fit into
218 J.H. van Drie et al.
the protein based on the x-ray structure we had. Once this compound appeared in
a co-crystal structure, we saw that the protein had flexed to accommodate this lig-
and, and with this x-ray structure it was easy for any computational docking pro-
cedure to place the ligand in the proper binding orientation. DeGrado and col-
leagues have reported similar observations [50]. There are many tools to deal with
the conformational flexibility of the ligand (e.g., CombiBUILD, Omega, FlexX
[51]), while none deal with the more difficult problem of dealing with the confor-
mational flexibility of the protein.
Until recently. SLIDE (Screening for Ligands by Induced-Fit Docking) is a
program developed by Volker Schnecke and Leslie Kuhn at Michigan State
University that is designed for searching large libraries to discover ligands best
complementing a protein active site [52-54]. While we are still preparing to apply
this software to the design of combinatorial libraries, it is already apparent from
their published data that it should have the speed and other attributes that we need
as a tool for protein-structure-based combinatorial library design.
The binding site of the protein is represented by surface residues, water mole-
cules, and a set of favorable interactions points above the protein surface. SLIDE
docks ligands into the site using geometric hashing techniques to evaluate shape
and chemical complementarity. Induced flexibility of the protein side chains and
ligand are modeled with equal status, applying directed minimal rotations to the
rotatable single bonds in the interface. Lowest-cost conformational changes in
both ligand and protein that generate a shape-complementary surface guide the
search. The backbone of the protein is held fixed; only side-chain induced-fit is
considered.
Astonishingly, they report execution times on libraries in the range of
50,000-100,000 molecules ranging from under 1 h to 24 h. As we are just arrang-
ing to put this software to the test in our own environment, we do not yet have any
independent experience to report on the use of SLIDE applied to combinatorial
library design. Nonetheless, their reported experience, and the published algorith-
mic details, leave us sanguine about its potential.
10.6 Conclusions
10.7 Acknowledgements
Unless otherwise noted, all timings were performed on a single SGI RIOK proces-
sor. As all the operations described here are highly-parallelizable, and readily
spread across multiple processors, the actual elapsed time is much smaller.
References
1 Erickson JW, Fesik SW (1992) Macromolecular x-ray crystallography and NMR as tools for structure-
based drug design. Annu Rep Med Chern 27: 271-289
2 van Drie JH (1995) 3D database searching in drug discovery:
http://www.netsci.org/Science/Cheminformlfeature06.html
3 Furka A, Sebestyen F, Asgedom M, Dibo G (1991) General method for rapid synthesis of multicom-
ponent peptide mixtures. Int J Peptide Protein Res 37: 487-493
4 Bunin BA, Ellman JA (1992) A general and expedient method for the solid-phase synthesis of
1,4-Benzodiazepine derivatives. J Arner Chern Soc 114: 10997-10998
5 DeWitt SH, Kiely JS, Stankovic CJ et al (1993) "Diversomers": an approach to nonpeptide, nono-
ligomeric chemical diversity. Proc Natl Acad Sci USA 90: 6909-6913
6 Martin EJ, Blaney JM, Siani MA et al (1995) Measuring diversity: experimental design of combina-
toriallibraries for drug discovery. J Med Chern 38: 1431-1436
7 Van Drie JH, Nugent RA (1997) Addressing the challenges posed by combinatorial chemistry: 3D
databases, pharmacophore recognition and beyond. Env Res 9: 1-21
8 Shoichet BK, Bussiere DE (2000) The role of macromolecular crystallography and structure for drug
discovery: Advances and caveats. Curr Opin Drug Disc Dev 3: 408-422
9 Wang S, Milne GW, Yan X et al (1996) Discovery of novel, non-peptide HIV-I protease inhibitors by
pharmacophore searching. J Med Chern 39: 2047-2054
10 Hong H, Neamati N, Wang S et al (1997) Discovery of HIV-l integrase inhibitors by pharmacophore
searching J Med Chern 40: 930-936
II Neamati N, Hong H, Mazumder A et al (1997) Depsides and depsidones as inhibitors of HIV-I inte-
grase: discovery of novel inhibitors through 3D database searching J Med Chern 40: 942-951
12 Kaminski JJ, Rane DF, Snow ME et al (1997) Identification of novel famesyl protein transferase
inhibitors using three-dimensional database searching methods J Med Chern 40: 4103-4112
220 J.H. van Drie et al.
13 Anonymous (1993) Hypotheses in Catalyst. This is the sole documentation on the theoretical and
mathematical basis of Hypothesis Generation, privately distributed by BioCAD, unpublished
14 Llorens 0, Filizola M, Spisani S et al (1998) Amygdalin binds to the CD4 receptor as suggested from
molecular modeling studies. Bioorg Med Chern Lett 8: 781-786
15 Marriott DP, Dougall IG, Meghani Pet al (1999) Lead generation using pharmacophore mapping and
three-dimensional database searching: application to muscarinic M(3) receptor antagonists. J Med
Chern 42: 3210-3216
16 Martin YC, Bures MG, Danaher EA et al (1993) A fast new approach to pharmacophore mapping and
its application to dopaminergic and benzodiazepine agonists. J Comp-Aided Mol Des 7: 83-102
17 Kiyama R, Honma T, Hayashi K et al (1995) Novel angiotensin II receptor antagonists. Design, syn-
thesis, and in vitro evaluation of dibenzo[ a,dlcycloheptene and dibenzo[b,f]oxepin derivatives.
Searching for bioisosteres of biphenylyltetrazole using a three-dimensional search technique. J Med
Chern 38: 2728-2741
18 Greenidge PA, Carlsson B, Bladh LG et al (1998) Pharmacophores incorporating numerous excluded
volumes defined by X-ray crystallographic structure in three-dimensional database searching: appli-
cation to the thyroid hormone receptor. J Med Chern 41: 2503-2512
19 Van Drle JH, Weininger D, Martin YC (1989) ALADDIN: an integrated tool for computer-assisted
molecular design and pharmacophore recognition from geometric, steric, and substructure searching
of three-dimensional molecular structures. J Comput Aided Mol Des 3: 230-255
20 Wang S, Sakamuri S, Enyedy IJ et al (2000) Discovery of a novel dopamine transporter inhibitor,
4-hydroxy-I-methyl-4-(4-methylphenyl)-3-piperidyl 4-methylphenyl ketone, as a potential cocaine
antagonist through 3D-database pharmacophore searching. Molecular modeling, structure-activity
relationships, and behavioral pharmacological studies. J Med Chern 43: 351-360
21 Leysen D, Kelder J (1998) "Ligands for the 5-HT2c receptor as potential antidepressants and anxi-
olytics". In: v.d. Goot H (ed.) Trends in Drug Research II. Elsevier, 49-60
22 Kick EK, Ellman JA (1995) Expedient method for the solid-phase synthesis of aspartic acid protease
inhibitors directed toward the generation oflibraries. J Med Chern 38: 1427-1430
23 Walters PW, Stahl MT, Murcko MA (1998) Virtual Screening: An Overview Drug Discovery Today 3:
160-178
24 Lipinski CA, Lombardo F, Dominy BW et al (1997) Experimental and computational approaches to
estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev
23: 3-25
25 Hansch C, Leo A (1995) Exploring QSAR: Fundamentals, Applications in Chemistry, Biology ACS
Publications Washington DC Hansch C, Bjorkroth JP, Leo A (1987) Hydrophobicity and central nerv-
ous system agents: On the principle of minimal hydrophobicity in drug design. J Pharm Sci 76:
663-687
26 Van Drie JH, Lajiness MS (1998) Approaches to virtual library design, Drug Discovery Today 3:
274-283
27 Lewell XQ, Judd DB, Watson SP et al (1998) RECAP-retrosynthetic combinatorial analysis proce-
dure: a powerful new technique for identifying privileged molecular fragments with useful applications
in combinatorial chemistry. J Chern In/Comp Sci 38: 511-522
28 Ajay A, Walters WP, Murcko MA (1998) Can we learn to distinguish between "drug-like" and "non-
drug-like" molecules? J Med Chern 41: 3314-3324
29 Sadowski J, Kubinyi H (1998) A scoring scheme for discriminating between drugs and nondrugs J
Med Chern 41: 3325-3329
30 Afferent Systems, see: http://www.afferent.com
31 Chemical Computing Group, see: http://www.chemcomp.com
32 Labute P (1999) Binary QSAR: A new method for the determination of quantitative structure-activity
relationships. In: Altman R.B, Dunker AK, Hunter L et al (eds): Pacific Symposium on Biocomputing
'99. World Scientific, New Jersey, 444-455
33 Gao H, Bajorath B (1999) Comparison of binary and 2D QSAR analyses using inhibitors of human
carbonic anhydrase II as a test case. Mol Diver 4: 115-130
34 Lauri G, Lynch D, Brown, RD "Application of B-QSAR to Pharmacopaeia HTS data"; unpublished
work
35 Van Drie JH (1996) An inequality for 3D database searching and its use in evaluating the treatment of
conformational flexibility. J Comput Aided Mol Des 10: 623-630
36 Van Drie JH (1997) Strategies for the determination of pharmacophoric 3D database queries. J Comput
Aided Mol Des 11: 39-52
10 Structure-based design of combinatorial libraries 221
37 Van Drie JH (1997) "Shrink-wrap" surfaces: a new method for incorporating shape into pharma-
cophoric 3D database searching. J Chem InjComp Sci 37: 38-42
38 Catalyst is software originally developed at BioCAD (1990-1994), and is now marketed by MSI, see:
http://www.msi.com
39 Mayer D, Naylor CB, Motoc I et al (1987) A unique geometry of the active site of ACE consistent with
structure-activity studies. J Comput Aided Mol Des 1: 3-16
40 Evans BE, Rittle KE, Bock MG et al (1998) Methods for drug discovery: development of potent, selec-
tive, orally effective cholecystokinin antagonists. J Med Chem 31: 2235-2246
41 Roe DC, Kuntz ID (1995) BUILDER v.2: improving the chemistry of a de novo design strategy. J
Comp-Aided Mol Des 9: 269-282. (The CombiBUILD software is available from UCSF, after pur-
chase of the DOCK 4.0 suite)
42 Lewis RA, Roe DC, Huang C et al (1992) Automated site-directed drug design using molecular lat-
tices. J Mol Graph 10: 66-78
43 Kick EK, Roe DC, Skillman AG et al (1997) Structure-based design and combinatorial chemistry yield
low nanomolar inhibitors of cathepsin D. Chem Bioi 4: 297-307
44 Baldwin ET, Bhat TN, Gulnik Set al (1993) Crystal structures of native and inhibited forms of human
cathepsin D: implications for lysosomal targeting and drug design. Proc Natl Acad Sci USA 90:
6796-6800
45 Sun Y, Ewing TJ, Skillman AG et al (1998) CombiDOCK: structure-based combinatorial docking and
library design. J Comp-Aided Mol Des 12: 597-604
46 OpenEye, see: http://www.eyesopen.com
47 Dolata DP, Leach AR, Prout K (1987) WIZARD: AI in conformational analysis. J Comput Aided Mol
Des I: 73-85
48 Dolata DP, Walters WP (1993) MOUSE: a teachable program for learning in conformational analysis.
J Mol Graph 11: 106-111
49 Eldridge MD, Murray CW, Auton TR et al (1997) Empirical scoring functions I: The development of
a fast empirical scoring function to estimate the binding affinity of ligands in receptor-ligand com-
plexes. J Comput Aided Mol Des 11: 425-445
50 Rockwell A, Melden M, Copeland RA et al (1996) Complementarity of combinatorial chemistry and
structure-based ligand design: Application to the discovery of novel inhibitors of matrix metallopro-
teinases. J Amer Chem Soc 118: 10337-10338
51 Rarey M, KramerB, LengauerTet al (1996) A fast flexible docking method using an incremental con-
struction algorithm. J Mol Bioi 261: 470-489
52 Schnecke V, Swanson CA, Getzoff ED et al (1998) Screening a Peptidyl Database for Potential
Ligands to Proteins with Side-chain Flexibility. Proteins 33: 74-87
53 Schnecke V, Kuhn LA (1999) Database Screening for HlV Protease Ligands: The Influence of
Binding-Site Conformation and Representation on Ligand Selectivity. In: Lengauer T, Schneider R,
Bork P et al (eds.): Proceedings ISMB 99, 7th International Conference on Intelligent Systems jor
Molecular Biology. AAAI Press, Menlo Park, CA
54 Schnecke V, Kuhn LA (1999) Flexibly Screening for Molecules Interacting with Proteins. In: Thorpe
MF, Duxbury PM (eds): Applications in Rigidity Theory. Plenum Publishing, New York, 385-400:
242-251
Modern Methods of Drug Discovery 223
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
ILl Introduction
11.2 Computational technologies
11.2.1 True 3D methods
11.2.2 Pseudo-3D methods
11.3 The 3D QSAR alignment problem and pharmacophore hypotheses
11.4 Applications of 3D QSAR
11.4.1 CoMFA of experimental HIV-I protease complexes
11.4.2 Arsenal of 3D QSAR methods
11.4.3 Recent success stories of 3D QSAR
11.5 Conclusions
11.6 Acknowledgements
II. 7 References
11.1 Introduction
The belief that there is a direct relationship between chemical structure and bio-
logical activity of therapeutic agents is fundamental to the field of medicinal
chemistry. Indeed, the efforts of early medicinal chemists focused on well-defined
structural modifications to active lead compounds as molecules were developed
into drugs. The relationship between chemical properties like solvent partitioning
and biological activity was recognized over a century ago [1]. Almost 40 years
ago, Hansch, Fujita and co-workers invented the field of Quantitative Structure-
Activity Relationships (QSAR) [2]. In this approach whole molecule parameters
such as LogP o/ w (the partition coefficient for l-octanollwater partitioning),[3]
molar refractivity, shape and topology indices [4], etc. for groups of related com-
pounds are statistically correlated with measures of biological activity to obtain a
QSAR equation. This equation relates easy to measure (or predict/calculate) val-
ues for molecules to the more difficult to measure biological activities. Once a
QSAR is obtained, verified and found to be "predictive," the biological activity for
new chemical entities that mayor may not exist can easily be predicted.
Numerous success stories from QSAR over the past four decades validate the fun-
damental relationship between structure and activity.
One frustrating aspect of traditional (2D) QSAR is the difficulty of "inverting"
QSARs into better drugs. That is, given an equation relating several parameters to
224 G.E. KeUog and S.F. Semus
(2)
(3)
I Definitions for symbols: fit is the van der Waals constant, r is the distance between the test atom (t) and
the molecular atom (i), Ri and R t are van der Waals radii, 332.17 is a unit conversion constant, Qi and Qt
are partial atomic charges, Dit is the value of the dielectric function for i and t, ai is the hydrophobic atom
constant and Si is the solvent accessible surface area for i.
II 3D QSAR in modern drug design 225
Figure I. A molecule in a cage of test atoms at predefined grid points. The map value at each test atom is
the sum of all measured "interactions" between that test atom and all atoms in the molecule. The interac-
tions are measured or scored by the type of physical property being obtained, utilizing a functional rela-
tionship between atomic properties of the test and molecular atoms as varied by distance.
For CoMFA and related technologies, the "field" around the molecule is of more
importance than the actual structure of the molecule. In fact, grid points that are
located within the van der Waals surface of the molecule are truncated at prede-
fined values for two significant reasons. First, this eliminates the computationally
unpleasant possibility of division by zero if a grid point coincidentally lands on an
atom (c.f., Eqs. (1) and (2)). Second, and more important is that these maps pro-
vide 3D profiles of their associated properties, in a manner that may correspond
to receptor recognition. Clearly macromolecule receptors are more keyed to mole-
cular shapes (steric field), electrostatic and lipophilic potentials than they are to
whole molecule descriptors or specific atoms or functional groups. Other field
types describing hydrogen bonding, electronic (molecular orbital) properties,
polarizability, topology etc. are possible [8 - lO].
The HASL method of Doweyko [11] is similar. In it, each grid point within the
van der Waals surface of the molecule represents a triad of counters noting
hydrophobes, hydrogen bond donors and hydrogen bond acceptors.
In CoMSIA (Comparative Molecular Similarity Indices Analysis) [12] the
implementation of the fields differs from CoMFA, although the underlying
methodology is consistent. In CoMSIA five different similarity fields are calcu-
lated: steric, electrostatic, hydrophobic, hydrogen bond donor and hydrogen bond
acceptor. These fields were selected to cover the commonly accepted major con-
tributions to ligand binding. Similarity indices are calculated at regularly spaced
grid points for the pre-aligned molecules. A comparison of the relative shapes of
CoMSIA and CoMFA fields is shown below (Fig. 2) [6, 12]. For the distance
dependence between the probe atom and the molecule atoms a Gaussian function
226 G.E. Kellog and S.F. Semus
E(R) Cut-off
Lennard-Jones
potential
potential
Figure 2. Graph showing distance-dependent behavior of energy functions: Coulomb potential, Lennard-
Jones potential and Gaussian (Adapted from CoMSIA documentation, Tripos, Inc.).
is used. Because of the different shape of the Gaussian function, the similarity
indices can be calculated at all grid points, both inside and outside the molecular
surface. One consequence of this alternative approach is the ability to display the
resultant fields directly onto the molecules being studied rather than (only) pro-
jecting them into the enveloping space.
At this point, we generally are dealing with biological activities as functions of
3D descriptor arrays. The statistical techniques to solve this problem treat this data
as linear arrays of descriptors that can then be solved with multilinear regression,
partial least squares (PLS) [5, 13], genetic algorithms (GA) [14], etc. In effect the
3D information is maintained external to the statistics engine. This is a significant
point that reveals why the superimposition of the molecules in the data set is crit-
ical (vide infra). In order for there to be correspondence between 3D descriptors
of different molecules in the data set, the structures of the molecules must be
aligned in space and sampled in the same cage of grid points. We focus on the
"alignment problem" in the next section. The end result is a single equation relat-
ing biological activity to the calculated values at grid points. Thus, the biological
activity of new molecules that can be adequately superimposed on molecules of
the learning data set can, in principle, be predicted. Statistical validation tech-
niques such as bootstrapping or cross-validation or the use of test sets of known
compounds are used to verify the predictiveness of 3D-QSAR models.
models that correlate either two or three dimensional dependent descriptors with
biological activity. In fact, the former method employs hashed fingerprints that do
not require information pertaining to the three-dimensional nature of the ligands.
The latter method, like any other that utilizes descriptors that are dependent upon
the three-dimensional structure, is highly influenced by the conformation of the
molecules. EVA is a vector descriptor based on EigenVAlues corresponding to
individual molecular vibrational modes. This approach makes use of values
derived from semi-empirical quantum chemical calculations that are related to
spectroscopic data. Normal coordinate frequencies are calculated and the result-
ing eigenvalues are projected onto a vibrational profile. A Gaussian smoothing
function is applied to each vibration and the resultant profile is sampled at fixed
intervals to provide a set of values that describes each molecule. This approach
averts the use of normal coordinate frequencies where the number of normal
modes varies with the number of atoms in the molecule and thus ensures the use
of a constant number of descriptors across the data set.
The premise of HQSAR [16] is that since the structure of a molecule is encod-
ed within its 2D fingerprint and that structure is the key determinant of all mole-
cular properties (including biological activity), then it should be possible to pre-
dict the activity of a molecule from its fingerprint. It is claimed that this assump-
tion is valid because of the high degree of success of 2D similarity searching of
chemical databases in which the similarity measure is the Tanimoto coefficient
between fingerprints [17] (for a detailed discussion on 2D descriptors see Chapter
7). HQSAR uses an extended form of fingerprint, known as a Molecular
Hologram, which encodes more information, for example, branched and cyclic
fragments as well as stereochemistry, than the traditional 2D fingerprint. The key
difference, however, is that a Molecular Hologram contains all possible molecu-
lar fragments within a molecule, including overlapping fragments, and maintains
a count of the number of times each unique fragment occurs. The authors main-
tain that this process of incorporating information about each fragment, and each
of its constituent sub-fragments, implicitly encodes 3D structural information.
While this approach may provide information as to the relative arrangements of
fragments within the molecule it clearly does not provide any spatial parameters.
There are a number of descriptors available that do encode such information and
have been successfully applied in QSAR studies. These may depend upon the
internal coordinates of the molecule alone or on the absolute orientation. An
example of a conformation independent descriptor would be the magnitude of the
dipole moment, whereas the x, y or z component of the dipole moment would be
influenced by the orientation. Examples of 3D descriptors include principal
moment of inertia, radius of gyration, volume, surface area and polar solvent
accessible surface area.
Finally, in an attempt to capture 3D information pertaining to size, shape, sym-
metry and atomic distribution, Weighted Holistic Invariant Molecular (WHIM)
descriptors [18] have been developed. The method consists of performing a prin-
cipal component analysis (PCA) on the centered molecular coordinates using a
weighted covariance matrix that is obtained by applying different weighting
228 G.E. Kellog and S.F. Semus
schemes to the constituent atoms. The atoms may be weighted by their number,
mass, volume, electronegativity, polarizability or electrotopological index.
Directional or non-directional WHIM descriptors are then calculated dependent
on whether one includes information related to the principal axes. The use of
WHIM descriptors has proven of value in the development of 3D QSAR models;
however, their principal disadvantage is the lack of feedback information that can
be applied to further drug design. In a similar fashion to the topological indices
from which they may be derived it is essentially impossible to employ the descrip-
tor information in the design of future analogs or to readily interpret the variation
of structure with descriptor value.
By the very nature of the technique, the most crucial step in this 3D approach
is the relative orientation of the test molecules in space. That is, the chosen align-
ment of the compounds in the training set is going to have the most profound
impact on the predictive ability of the model. We have already noted the plethora
of methods available to us for structural superimposition. From a purely drug
design perspective, in those instances where one has no knowledge of the three-
dimensional shape of the receptor, a set of alignment rules based upon a pharma-
cophore hypothesis will probably be the most valuable. Indeed, perhaps the prin-
cipal value of this methodology is the evaluation of such alignments based upon
the predictive power of the derived model. One may conclude that the greater the
predictive power of the model the more the alignment reflects the bioactive con-
formation of the molecules.
Thus, the real value of this approach is probably not in the evaluation over a
single congeneric series, but more where one is trying to align molecules that do
not possess a common core structure. In such a case, as in the development of
pharmacophore hypotheses, one is often trying to identify features from the mol-
ecules of separate series that may present common interaction points with the
receptor. For example, the presence of H-bond donors or acceptors, the placement
of hydrophobic moieties, or the presence or absence of heteroatoms, may form a
basis of such a pharmacophore alignment. In addition, for mathematical reasons
in any QSAR method, each molecule needs to have a similarly developed set of
descriptors; that is, each molecule must be described with the same number of
independent variables. This is a simple matter in grid-based 3D QSAR-each
molecule is analyzed by fields of the same dimensions and resolution. The
researcher will develop a number of alternate pharmacophore hypotheses, by the
manual manipulation of the data set or by automated means, such that the evalu-
ation of each of the models may be achieved by the use of such grid-based meth-
ods.
Can a limited model be extended to a non-congeneric series of molecules? A
large number of research groups have clearly shown that it can. A number of years
ago, we demonstrated that CoMFA could be used to predict the biological activi-
ty of a series of classical and non-classical cannabinoids [21]. Although their
structures are not dramatically different and the derivation of the non-classical
OH
(
OH
L'l-9-THC CP55,940
Scheme 1.
230 G.E. Kellog and S.F. Semus
Based on the above discussion, perhaps the most crucial step in designing and
evaluating three-dimensional quantitative structure-activity relationships for drug
discovery is the designation of a model for superimposing the molecules of the
study. Sophisticated methods are often necessary to define the superposition. The
only experimental approach is to rely on x-ray crystallographic determinations of
ligand-receptor complexes (see Chapter 8), but often only one or just a few com-
plex structures are known. In these cases, the coordinate data from the ligand of
one complex is used as a template upon which other congeneric ligands in the data
set can be superimposed with a fairly high level of confidence. There have also
11 3D QSAR in modem drug design 231
been 3D QSAR studies where the crystal structures of multiple complexes were
known. The most complete of these (1993-1994) by Garland Marshall et al. of
Washington University in St. Louis examined data and structures for HIV-l pro-
tease inhibitors from several laboratories [25-27].
In this section we wish to review the Marshall 3D QSAR study, and also pres-
ent for comparison some recent results we have obtained using the entire range of
3D QSAR techniques on another series of HIV-l inhibitors published more
recently by Holloway et al. of Merck Laboratories [28]. The models for these
complexes were obtained by analogy to crystal structures of a handful of actual
HIV-lIinhibitor complexes. The drug design and development efforts based on
these two landmark modeling studies have contributed significantly to the devel-
opment of the currently marketed class of HIV-l protease inhibitors (for a list of
HIV-l protease inhibitors v."hich have reached the market see Chapter 8). Our plan
in the latter case is to didactIcally describe the use of these techniques as we pres-
ent the results. This will put us in a position to comment on the relative merits of
the methods. The last portion of this section will review a small number of recent
success stories that feature 3D QSAR.
graphically-determined) ligand will be mimicked in the field fit, but absent in sim-
ple site-constrained minimizations. Third, while the steric and electrostatic fields
of CoMFA can represent much of the free energy of binding, there are other con-
tributions, e.g., from solvation and hydrophobicity, that are being ignored. The
addition of this type of data to the CoMFA model was one of the foci of the third
part [27] of the study. Fourth, creation of a satisfactory internally self-consistent
QSAR for the learning set is not sufficient to validate a 3D QSAR model. It must
be tested on a external test set of compounds that were not part of the learning set
[26]. Selection of good test sets involves a) consideration of quality and compati-
bility of biological data, b) the range of activities should be comparable to but not
exceed that of the learning set, and c) the test set should have a balance of active
and inactive compounds.
In fact, responses to many of these observations were incorporated into the
more contemporary 3D QSAR methods. The three critical factors are: alignment
rule, mathematical modeling and rational selection of physically and chemically
meaningful "field" properties, and selection of an intelligent and statistically valid
data set. Below, we look at another data set of HIV-1 protease inhibitors with sev-
eral modem 3D QSAR methods.
where Einter is the sum of intermolecular van der Waals and electrostatic interac-
II 3D QSAR in modern drug design 233
Table I. Structures and HIV protease binding affinities of molecules employed in the study
Compound RI R2 R3 IC so (nM)
Scheme 2.
234 G.E. Kellog and S.F. Semus
tions between the ligand and enzyme. The model gave r2 =0.797 and a standard
error of 0.675. This is a surprisingly simple and yet robust model considering the
complexity of the molecules. In fact this model is superior to that obtained with a
standard CoMFA model (see Tab. 2): Thirty-one compounds bound to HIV pro-
tease were analyzed in a standard CoMFA model. 2. 3
This model yielded a cross-validated correlation (q2) = 0.297, with a standard
error of prediction of 1.249 with two components. A fitted PLS regression
employing the same number of components yielded a higher ~ of 0.623, with a
standard error of estimate of 0.919.
Comparison of CoMFA to CoMSIA on the same data set reveals similar results,
in line with previously published observations. 4 There is a nominal improvement
in the steric/electrostatic model by incorporation of the hydrophobic and hydro-
gen bond field descriptors, although it is interesting to note that the hydrophobic
field alone is of greater predictive power.
The HINT intermolecular field analysis (HIFA) [24] model ofthe ligands in the
receptor complex yielded a cross-validated correlation (q2) =0.696, with a stan-
dard error of prediction of 0.958 with five components. 5 A fitted PLS regression
employing the same number of components yielded a higher ~ of 0.912, with a
standard error of estimate of 0.477. The superiority of this model is a consequence
of the additional structural information that is derived from measurement of the
interaction between the ligands and their receptor, unlike CoMFA and CoMSIA
that are performed via measurement of only the properties of the ligand molecules.
The hologram QSAR method (HQSAR) does provide a better analysis than the
usual grid based 3D approaches. 4 The researcher is able to regulate the type of fin-
gerprint that may be generated and subsequently employed in the analysis.
2 Computational studies were initiated on the x-ray coordinates of HIV-l protease complexed with the
Merck inhibitor, L-689,502,[28] utilizing the Sybyl suite of molecular modeling programs [6]. All crys-
tallographic waters, with the exception of the tightly bound water (residue 407), were removed. The oxy-
gen of this water, together with the protein heavy atoms were kept rigid during geometry optimizations.
Hydrogen atoms were added to the protein-water-inhibitor complexes and the structure were subse-
quently optimized with the Merck Molecular Force Field (MMFF) [30]. All binding affinities were
expressed as pICso . A 3D region defining the boundaries of the grid-based calculations was constructed
that surrounds the ligand-binding domain, with spacing between sampling points of 2A.
3 For the CoMFA study, the optimized ligands were extracted from the protein binding site and assigned
charges by the Gasteiger-Hiickel method. The optimum model was determined by the standard PLS algo-
rithm with a leave-one-out cross-validation, using up to ten components. A CoMFA was performed on
the 31 ligands using an Sp3 C probe bearing a unit positive charge, employing the described 3D region.
4 As for CoMFA, the optimized ligands were extracted from the protein and assigned Gasteiger-Hiickel
charges for the CoMSIA and HQSAR studies. CoMSIA was performed in a similar manner to CoMFA,
using the standard steric-electrostatic, donor-acceptor and hydrophobic fields. HQSAR was performed
using the default hologram lengths [53, 59, 61, 71, 83, 97, 151, 199] and selecting the "best" model that
gave the highest cross-validated r2.
s In the HIFA study the HIV protease-water complex was partitioned in HINT employing the protein dic-
tionary method of determination, using all hydrogens. The ligand was partitioned in HINT by the explic-
it calculation method, also using all hydrogens. Hydropathic interaction maps were calculated using the
HINT intermolecular protocol, specifying all interactions constrained by the described 3D region. The
resultant maps were imported into Sybyl using the same region definition. A QSAR data table was con-
structed containing the molecules that are reported in Table 1 and a column generated for the map files
as a CoMFA field-type, using the externally created fields.
11 3D QSAR in modem drug design 235
Table 2. 3D QSAR results for the HIV-l protease inhibitor data set
Features that may be considered in unique fingerprint generation are atom type
(a), bond type (b), atomic connection (c), hydrogen bond donor or acceptor (d),
chirality (ch) and inclusion of hydrogen atoms (h). Thus, in the case where all
options are suppressed, an identical fingerprint would be generated for benzene
and pyridine. However, unique fingerprints would be obtained if one employs
either the atom type specifier where the difference in the C and N atom type would
be observed, or by inclusion of the hydrogen atoms where one would now count
the six hydrogen atoms in benzene compared to the 5 in pyridine.
From the Table 2, it can be seen that a modest improvement in the HQSAR
analysis can be obtained by addition of the hydrogen bond donor/acceptor and
hydrogen atom count. The chirality flag appears to have little effect, which is entire-
ly predictable from the data set (Tab. 1), since there is only one diastereomeric pair
and there is little separation in potency. An important role of a QSAR model,
besides predicting the activities of untested molecules, is to provide hints about
what molecular fragments may be important contributors to activity. In HQSAR,
following the calculation of atom contributions to activity, the molecule display is
color coded to reflect the individual atomic contributions. In this example, the col-
ors at the red end of the spectrum (red, red orange, and orange) reflect favorable (or
positive) contributions, while colors at the green end (yellow, green blue, and green)
reflect unfavorable (negative) contributions. Atoms with intermediate contributions
are colored white. In the case of the inactive analog 2PCM shown below (Fig. 3),
the phenyl ring projects into a region that the corresponding CoMSIA derived
hydrophobic field associates with poor activity. In fact, there is generally good
agreement in the graphical output of these two methods. The highly active com-
pound 2PIO occupies steric regions that are associated with good binding affinity
and matches well to the electrostatic pattern (not shown) determined by CoMSIA.
It is particularly interesting to note that HQSAR, an inherently 2D method, pro-
vides a more predictive model than its standard 3D grid based counterparts. This
begs the question as to why this should be? In part, this would suggest that at a
superficial level the variation in biological activity is entirely encoded within the
236 G.E. Kellog and S.F. Semus
Figure 3. Projection of steric (green/red) and hydrophobic (orangelblue) CoMSIA maps onto the HQSAR
color coded models of A) 2PCM and B) 2PIO.
be pointed out, if it is not already obvious, that it is, at least, several years after the
completion of a pharmaceutical drug discovery project before the strategies, etc.
behind the project are disclosed by publication. We are thus certain that published
studies of 30 QSAR are only the tip of the iceberg with respect to the actual usage
of this paradigm for drug discovery and development in industry. We have chosen,
for brief review here, a few recent examples that we find particularly interesting
for a variety of reasons. In particular, we highlight studies that effectively use a
combination of 20 and/or 30 techniques.
In a study where there is little information about the receptor, a 30 QSAR was
created and used in a drug design project at Astra involving muscarinic ligands
[31]. To initiate the study, a potent, conformationally rigid muscarinic agonist,
dihydro-syn-4'-methylspiro[l-azabicyclo[2.2.1]heptane-3,5'(4'H)-furan]-3'-one
[31], was examined with calculated electrostatic and GRID [32] maps. This data
afforded information as to the optimal position of potential interaction sites
between the ligand and its receptor. Consequently, the requisite groups were orien-
tated in space to achieve maximal interaction with the compound and thus provid-
ed a pseudo-receptor structure. For example, as suggested by GRID, a carboxylate
group was placed adjacent to the protonated ring nitrogen of the rigid compound.
Similarly, aliphatic hydroxyl groups were constructed to maximally interact with
both the ring ether and carbonyl oxygen atoms. The complex structure was then
geometry optimized. A diverse subset of muscarinic agonists were "docked" into
the pseudo receptor site model defined by these studies and the resultant alignment
was employed in the 30 QSAR. A CoMFA was performed on 80 ligands modeled
in this pseudo receptor site, including an extensive sample of acyl hydrazones,
hydrazides and oximes. The model obtained had a cross-validated ~ of 0.556,
employing six components, with a final ~ of 0.900, standard error of 0.278 [31].
This model was further refined with the use of alternative probe atoms of differing
charge, the use of smaller grid spacing and the incorporation of hydropathic fields
calculated by HINT [7]. The final model was employed in the design and affinity
prediction of a number of potential muscarinic receptor ligands [31].
In another study, the inhibitory activity values (IC so) of 54 phenyltropanes at
the dopamine (OA), serotonin (5-HT) and noradrenaline (NA) transporters were
quantitatively examined by use of different QSAR-techniques and PLS [33]. The
use of 36 physicochemical and quantum mechanical parameters resulted in
informative 20 models that were simple to interpret. For the purpose of compar-
ison, additional CoMFA models using the standard fields were constructed from
a molecular alignment maximizing the similarity of shape and electrostatic
potential. Highly significant models with good fitting and predictive abilities
were developed for each transporter. The major structural requirements revealed
from the CoMFA analyses were in good agreement with the findings of the indi-
vidual 20 analyses. The models obtained suggest a close similarity between the
examined transporters in terms of binding interaction. A comparison of the
CoMFA contour maps provided a rational basis for the design of transporter-
selective ligands.
238 G.E. Kellog and S.F. Semus
Pajeva and Wiese have recently reported molecular modeling studies on mul-
tidrug resistance (MDR) modifiers based on phenothiazines and related com-
pounds [34, 35]. MDR is a major impediment to the treatment of metastatic can-
cers. It is a broad-spectrum invulnerability to chemotherapy, involving a number
of mechanisms. A number of drugs have been identified, called MDR reversing
agents or modifiers, that are not cytotoxic by themselves but can reverse MDR.
The authors performed QSAR and CoMFA studies on phenothizines, thioxan-
thenes and related compounds to identify and quantitatively assess structural fea-
tures important for anti-MDR activity. The key factor of membrane penetrations
and interactions, expected to be related to hydrophobicity, was demonstrated in
CoMFA models with cross-validated statistics (q2) as high as 0.93 when the HINT
field was combined with the steric and electrostatic fields [35]. In effect,
hydrophobicity is a molecular property that significantly influences MDR rever-
sals. Furthermore, describing hydrophobicity as a space-directed molecular prop-
erty, i.e., a field, is preferable to the use of scalar (LogPo/ w ) representations of
hydrophobicity.
Predictive correlations have been obtained for angiotensin-converting enzyme
inhibitors in a number of different protocols [36-38], including CoMFA and
Catalyst [39] over the past several years. However, certain inadequacies of the
CoMFA technique have been noted, primarily that the standard steric and electro-
static fields alone do not fully characterize the zinc-ligand interaction. This has
been partially rectified by the inclusion of indicator variables into the QSAR table
to designate the class of zinc-binding ligand. Also, using molecular orbital (MO)
fields derived from semi-empirical calculations as additional descriptors in the
QSAR table produced predictive correlations based on CoMFA and MO fields
alone [40].
11.5 Conclusions
a set of existing compounds so that new ones with better properties can be found.
In that context, it is important to design QSAR studies so that chemically mean-
ingful information can be extracted-the kind of information that can be directly
applied to drug design. Since the development of pharmacophore hypotheses is
fundamental to the researcher in the absence of structural information, the ability
to assess the relevance and predictability of such models is crucial. Grid based 3D
approaches, in particular, not only provide valuable scoring data, but also deliver
graphical information correlating structural modification with biological activity.
Finally, to reiterate the conclusions of Oprea, Waller and Marshall [25], there is a
difference between internally-consistent QSAR models and externally-predictive
QSAR models. In many ways this represents the errors that occur when extrapo-
lating from a model rather than interpolating within a model.
11.6 Acknowledgements
The authors acknowledge the support of their organizations: VCU and Wyeth-
Ayerst. We thank Tripos, Inc. for their help over the past decade of our research in
developing and using 3D QSAR applications. Finally, we are grateful to Derek J.
Cashman (VCU) for help with manuscript graphics.
References
1 Meyer H (1899) Zur Theorie der Alkoholnarkose; erste Mittheilung: Welche Eigenschaft der
Anaesthetica bedingt ihre narkotische Wirkung? Arch Exp Pathol Pharmakol 42: 109-118
2 Hansch C, Fujita T (1964) p--1t analysis. A method for the correlation of biological activity and
chemical structure. JAm Chem Soc 86: 1616-1626
3 Rekker RF (ed) (1977) The hydrophobic fragmental constant. Its derivation and applications. A means
of characterizing membrane systems. Elsevier, New York
4 Hall LH, Kier LB (1991) The molecular connectivity chi indexes and kappa shape indexes in struc-
ture-property relations. In: Boyd D, Lipkowitz K (eds): Reviews of computational chemistry. VCH
Publishers, Weinheim, Germany, 367-422
5 Cramer RD, III, Patterson DE, Bunce JD (1988) Comparative molecular field analysis, 1. Effect of
shape on binding of steroids to carrier proteins. JAm Chem Soc 110: 5959-5967
6 Sybyl molecular modeling suite, Tripos Inc., 1699 S. Hanley Road, St. Louis MO 63144, USA
7 Kellogg GE, Semus SF, Abraham DJ (1991) HINT - a new method of empirical hydrophobic field cal-
culation for CoMFA. J Comput Aided Mol Design 5: 545-552
8 Vaz R (1997) Use of electron densities in comparative molecular field analysis (CoMFA): A quantita-
tive structure activity relationship (QSAR) for electronic effects of groups. Quant-Struct Act Relat 16:
303-308
9 Kellogg GE, Kier LB, Gaillard P et al (1996) The E-State fields. Application to 3D QSAR. J Comput
Aided Mol Des 10: 513-520
10 Kellogg GE (1997) Finding optimum field models for 3D QSAR. Med Chem Res 7: 417-427
11 Woolfrey JR, Avery MA, Doweyko AM (1998) Comparison of 3D quantitative structure-activity rela-
tionship methods: Analysis of the in vitro antimalarial activity of 154 artemisinin analogues by hypo-
thetical active-site lattice and comparative molecular field analysis. J Comput Aided Mol Des 12:
165-181
12 Klebe G, Abraham U, Mietzner T (1994) Molecular similarity indexes in a comparative-analysis
(CoMSIA) of drug molecules to correlate and predict their biological-activity. J Med Chem 37:
4130-4146
240 G.E. Kellog and S.F. Semus
13 Wold S, Johansson E, Cocchi M (1993) PLS - Partial Least Squares projections to latent structures.
In: Kubinyi H (ed): 3D QSAR in drug design. Theory rnethods and applications. Escom, Leiden,
523-550
14 Rogers D, Hopfinger AJ (1994) Application of genetic function approximation to quantitative struc-
ture-activity-relationships and quantitative structure-property relationships. J Chern Info Cornput Sci
34: 854-866
15 Ferguson AM, Heritage T, Jonathon P et al (1997) EVA: A new theoretically based molecular descrip-
tor for use in QSARlQSPR analysis. J Cornput Aided Mol Des II: 143-152
16 Tong W, Lowis DR, Perkins R et al (1998) Evaluation of quantitative structure-activity relationship
methods for large-scale prediction of chemicals binding to the estrogen receptor. J Chern Info Cornput
Sci 38: 669-677
17 Willet P, Winterman V, Bawden D (1986) Implementation of nearest-neighbor searching in an online
chemical structure search system. J Chern Info Cornput Sci 26: 36-41
18 Todeschini R, Gramatica P (1997) 3D-modelling and prediction by WHIM descriptors. Part 5. Theory
development and chemical meaning of WHIM descriptors. Quant Struct-Act Relat 16: 113-119
19 Martin YC, Bures MG, Danaher EA et al (1993) A fast new approach to pharmacophore mapping and
its application to dopaminergic and benzodiazepine agonists. J Cornput Aided Mol Des 7: 83-102
20 Klebe G, Abraham U (1993) On the prediction of binding properties of drug molecules by compara-
tive field analysis. J Med Chern 36: 70-80
21 Thomas BF, Compton DR, Martin BR et al (1991) Modeling the cannabinoid receptor: A three-dimen-
sional quantitative structure activity analysis. Mol Pharmacol40: 656-665
22 Pertwee R (1997) Cannabis and cannabinoids: Pharmacology and rationale for clinical use. Pharm Sci
3:539-545
23 Baker D, Pryce G, Croxford JL et al (2000) Cannabinoids control spasticity and tremor in a multiple
sclerosis model. Nature 404: 84-87
24 Semus SF (1999) A novel hydropathic intermolecular field analysis (HIFA) for the prediction of lig-
and-receptor binding affinities. Med Chern Res 9: 535-550
25 Waller CL, Oprea TI, Giolitti A et al (1993) 3-dimensional QSAR of human-immunodeficiency-virus-
(I) protease inhibitors. 1. A CoMFA study employing experimentally-determined alignment rules. J
Med Chern 36: 4152-4160
26 Oprea TI, Waller CL, Marshall GR (1994) 3-dimensional quantitative structure-activity relationship of
human-immunodeficiency-virus-(I) protease inhibitors. 2. Predictive power using limited exploration
of alternate binding modes. J Med Chern 37: 2206-2215
27 Oprea TI, Waller CL, Marshall GR (1994) 3D-QSAR of human immunodeficiency virus (I) protease
inhibitors. III Interpretation of CoMFA results. Drug Design and Discovery 12: 29-51
28 Holloway MK, Wai 1M, Halgren TA et al (1995) A priori prediction of activity for HIV-1 protease
inhibitors employing energy minimization in the active-site. J Med Chern 38: 305-317
29 Wei DT, Meadows JC, Kellogg GE (1997) Effects of entropy on QSAR equations for HlV-1 protease:
1. Using hydropathic binding descriptors. 2. Unrestrained complex structure optimizations. Med Chern
Res 7: 259-270
30 Halgren TA (1996) Merck molecular force field. V. Extension of MMFF94 using experimental data,
additional computational data and empirical rules. J Cornput Chern 17: 616-641
31 Wu ESC, Kover A, Semus SF (1998) Synthesis and modeling studies of a potent conformationally
rigid muscarinic agonist: 1-azabicyclo[2.2.1]heptanespirofuranone. J Med Chern 41: 4181-4185
32 Goodford PI (1985) A computational procedure for determining energetically favorable binding sites
on biologically important macromolecules. J Med Chern 28: 857-864
33 Muszynski IC, Scapozza L, Kovar K-A et al (1999) Quantitative structure-activity relationships of
phenyltropanes as inhibitors of three monoamine transporters: classical and CoMFA studies. Quant
Struct-Act Relat 18: 342-353
34 Pajeva IK, Wiese M (1997) QSAR and molecular modeling of catamphiphilic drugs able to modulate
multidrug resistance in tumors. Quant Struct-Act Relat 16: 1-10
35 Pajeva IK, Wiese M (1998) Molecular modeling of phenothiazines and related drugs as multidrug
resistance modifiers: A comparative molecular field analysis study. J Med Chern 41: 1815-1826
36 DePriest SA, Mayer D, Naylor CB et al (1993) 3D-QSAR of angiotensin-converting enzyme and ther-
molysin inhibitors: A comparison of CoMFA models based on deduced and experimentally determined
active site geometries. J Arn Chern Soc 115: 5372-5384
37 DePriest SA, Shands EFB, Dammkoehler RA et al (1991) 3D-QSAR: Further studies on inhibitors of
angiotensin-converting enzyme. Pharmacochern Libr 16: 405-416
11 3D QSAR in modem drug design 241
38 Sjoestroem M, Eriksson L, Hellberg S et al (1989) Peptide QSARS: PLS modeling and design in prin-
cipal properties. Prog Clin Bioi Res 291: 313-317
39 Sprague PW (1995) Automated chemical hypothesis generation and database searching with Catalyst.
Persp Drug Discovery Design 3: 1-20
40 Waller CL, Marshall OR (1993) 3D-QSAR three dimensional quantitative structure activity relation-
ship of angiotension-converting enzyme and thermolysin inhibitors. II. A comparison of CoMFA mod-
els incorporating molecular orbital fields and desolvation free energies based on active-analog and
complementary-receptor-field alignment rules. J Med Chem 36: 2390-2403
Modern Methods of Drug Discovery 243
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
Pfizer Global Research and Development, PDM, Sandwich, Kent CTJ3 9NJ, UK
12.1 Introduction
12.2 Experimental approaches
12.2.1 Solubility
12.2.2 Lipophilicity
12.2.3 Absorption models
12.3 Computational approaches
12.4 Design of combinatorial libraries
12.5 Selection of subsets for HTS and hit validation
12.6 Lead optimization and candidate selection
12.7 Examples of the use of physicochemical properties in drug discovery
12.7.1 HIV inhibitors
12.7.2 GABA uptake inhibitors
12.7.3 Antifungal fluconazole
12.7.4 Endothelin antagonists
12.7.5 HI-antihistaminics
12.7.6 5-HTIB/ID receptor agonists as anti-migraine agents
12.7.7 Renin inhibitors
12.7.8 Modulation of mu1tidrug resistance (MDR)
12.8 References
12.1 Introduction
A successful drug candidate has the right attributes to reach and bind its molecu-
lar target and has the desired duration of action. Binding to the target can be opti-
mised by designing the proper three-dimensional arrangement of functional
groups. Each chemical entity also has, through its structure, physicochemical and
biopharmaceutical properties. These are generally related to processes such as dis-
solution, oral absorption, uptake into the brain, plasma protein binding, distribu-
tion, and metabolism. Therefore fine-tuning of the physicochemical properties has
an important place in lead optimization.
In this chapter recent developments around some of the key physicochemical
properties and how they are used in various phases of the discovery process are
addressed.
244 H. van de Waterbeemd
12.2.1 Solubility
The first step in the process to enter the body from a formulation for most drugs
is dissolution. Traditional dissolution and solubility measurements are quite
tedious. In today's discovery environment higher throughput demands are extend-
ed to physicochemical measurements, such as assessment of solubility.
Turbidometric or kinetic solubility evaluation has been reported as a reasonable
fIrst approximation to detect low solubility compounds [1, 2]. An automated
potentiometric titration method for solubility measurement, and solubility-pH
profIles has been introduced [3]. In addition to these new experimental approach-
es various efforts have been reported to estimate solubility from molecular struc-
ture (see section 12.3). However, knowledge of aqueous solubility and dissolution
rate still may be insufficient to predict in vivo performance, since food effects may
have considerable and variable impact [4].
12.2.2 Lipophilicity
pH-metric (titration)
SF
.........~___ Iog ~iPosomes
logPhexane logPheptaneiglyeol
logPPGDP
«
log Poet logPDCE
CPC ODS,ABZ,ODP
silica, end-capped, poly)ner)
~--RP-HPLC
Figure I.Experimental methods to measure lipophilicity (modified after [7]). Key of abbreviations-
10gPoct: l-octanoVwater partition coefficient, log Pliposom,,: partition coefficient between liposomes and
buffer, 10gPhexanc: I-hexane/water partition coefficient, 10gPPGDP: propyleneglycol dipelargonate/water par-
tition coefficient, 10gPhcptane/glycol: a non-aqueous partitioning system, SF: shake-flask, pH-metric: 10gP
determination based on potentiometric titration in water and octanoVwater, CPC: centrifugal partition
chromatography, RP-HPLC: reversed-phase high performance liquid chromatography, TI..C: thin-layer
chromatography, ODS: octadecylsilane, ABZ: end-capped silica RP-18 column, ODP: octadecylpolyvinyl
packing, lAM: immobilized artificial membrane, ILC: immobilized liposome chromatography, MEKC:
micellar electrokinetic capillary chromatography.
Along the same lines several groups have suggested that studying partitioning
into liposomes may produce relevant information related to membrane uptake and
absorption [8-15].
Another development using distribution or partition coefficients was to look at
the difference between two solvent scales, particularly octanol/water and alka-
ne/water. The thus obtained Mog P values appear to encode for the H-bonding
capacity of a solute. H-bonding itself was demonstrated to be an important prop-
erty for membrane crossing. Further research showed that H-bonding can be more
conveniently assessed by computational methods (see below).
log Sw =0.52 -1.00 R2 + 0.771t2H + 2.17 L~H + 4.24 L~2H - 3.36 La2HL~2H - 3.99 Vx
Once a lead has been discovered in terms of sufficient binding to the molecular
target and having appropriate chemical attractiveness, a number of properties will
need to be fine-tuned. Generally this includes further improvements in binding
and selectivity, as well as optimization of biopharmaceutical and pharmacokinet-
ic properties. It is in this stage that physicochemical profiling and absorption esti-
mation may guide a project towards a small set of suitable candidates.
The optimal candidate meets a number of pre-set criteria. These may vary
according to the desired application form. For oral absorption it is obvious to have
a good estimate of predicted human absorption and bioavailability. For once or
twice a day dosing reliable clearance predictions to man need to be available. The
role of physicochemistry and structure in metabolism and pharmacokinetics is
reasonably well understood [41, 42]. An important task is the metabolic evalua-
tion of possible candidates (see Chapter 13). There are now well-established in
vitro techniques for assessing the role of specific P450 isoenzymes in the metab-
olism of drugs and to evaluate the potential for drug-drug interactions.
12 Physicochemical concepts in drug design 249
Marketed inhibitors of the human immunodeficiency virus (HIV) are all relative-
ly large in molecular size, e.g., saquinavir (MW 671) and indinavir (MW 614).
Increasing solubility, lowering H-bond potential were recognized as favorable to
increase bioavailability of this class of compounds, while the role of molecular
size indeed remains unclear [43]. The incorporation of an ionizable center, such
as an amine or similar function, into a template can bring a number of benefits
including water solubility. A key step [44] in the discovery of indinavir was the
incorporation of a basic amine (and a pyridine) into the backbone of hydroxyeth-
ylene transition state mimic compounds (Fig. 2a) to enhance solubility and poten-
cy (for structure-based design efforts on indinavir see Chapter 8).
In another series of potential HIV inhibitors it was tried to introduce specific
functional groups to improve solubility and bioavailability [45]. It was concluded
that structural features regarded as beneficial for oral uptake in one series of com-
pounds may not enhance bioavailability in other types of inhibitors. Rather the
overall structural features of a compound govern the oral bioavailability.
b ~)
o4!c
y F
Fluconazole Ketoconazole
c )=N )=N
BrAyb c'Ayb
0" NH 0" NH
"s' "s'
r(~o If \ ~o
'S~NH~ s
o 1.--:; o
o
0-.1
\
N-
d '0.
1;0
I' ('"
o=<~ 5--1;0
0
I~ (
H H
Serotonin Sumatriptan Zomitriptan
H~J
,........N . . . S,
a 1 ""
.--:; N
~
H
Naratriptan Eletriptan
e NC'CC(:'OH
cr:t,0~R2
~1
1.--:; 1.--:; 0
o
1
Figure 2. a) Structures of lead compound L-685,434 (left) and indinavir (right) which incorporates basic
functions aiding water solubility. b) Stuctures of the antifungal agents fluconazole and ketoconazole. c)
Replacement of amide with acetyl in a series of amidothiophenesulfonamide endothelin-A antagonists to
improve oral bioavailability. d) Structure of serotonin (5-hydroxytryptamine) and some 5-HT IBIID ago-
nists. e) Examples of proposed MDR modulators.
12 Physicochemical concepts in drug design 251
a
activity
coo ,
~r
0.11 active 3.57/ 0.99 -0.43 1.42
9.23
/1
"'-
(jeoo.
at
NH+
/1
"'-
NH toO
x y Z i.d.AUC
(1-19 min/mL)
eOOH
A NH NH NH 0.3
B NH 0 NH 20.8
C NH 0 0 48.9
0 NMe 0 0 110.3
NH toO
Figure 3. Removal of H-bonds as a synthetic strategy for a series of azole containing endothelin antago-
nists aimed at improving bioavailability by lowering H-bonding potential [47].
12.7.5 Hrantihistaminics
Lipophilicity and H-bonding and important parameters for uptake of drugs in the
brain [49]. Their role has, for example, been studied in a series of structurally
diverse sedating and non-sedating histamine HI-receptor antagonists [50]. From
these studies a decision tree guideline for the development of non-sedative anti-
histamines was designed (see Fig. 4).
illog P > 4 :
••
l,",g P<2
good brain
penetration
Figure 4. Decision tree for the design of non-sedative H,-antihistaminics. Log D is measured at pH 7.4,
while Mog P refers to compounds in their neutral state.
12 Physicochemical concepts in drug design 253
A number of serotonin receptor 5-HT IBIID agonists, such as sumatriptan, are mar-
keted as anti-migraine drugs (Fig. 2d). SAR of 5-substituted tryptamines made
clear that acceptable levels of oral absorption can only be obtained when the com-
pounds are small and not too hydrophilic [52]. It is believed that such molecules
can be absorbed through aqueous pores using the para-cellular pathway. First in
class compound sumatriptan has a log D of -1.2 [7] and only 14% bioavailabili-
ty, while zomitriptan has an estimated log D = -1.0 and is 40% bioavailable [52].
More recent compounds such as naratriptan (CLOGP = 1.696) and eletriptan
(CLOGP = 3.08) have higher CLOGP values than sumatriptan (CLOGP =0.863),
and are designed to improve oral absorption.
1.0
0.8 - 300
-e- - - - - - 400 -
renin inhibitors • •
0.6
• 600-
0.4
g 800
:0
ell
02 •
Ql MW
E 00
CD
.3:
OJ
g
-0.2
•
-0.4
-0.6
-0.8
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 50 5.5
log D at pH 7.4
Figure 6. Iso-molecular weight curves showing the influence of molecular size on membrane permeabili-
ty with increasing lipophilicity [58].
lipophilicity were better absorbed from the intestine [54]. Absorption and bile
elimination rate both are MW-dependent. Lower MW gives better absorption and
less bile excretion. The combined influence of molecular size and lipophilicity on
absorption of a series of renin inhibitors can be seen in Figure 6. The observed iso-
size curves are believed to be part of a general sigmoidal relationship between per-
meability and lipophilicity [56, 57].
pounds appear to be better P-gp inhibitors than the smaller and more hydrophilic
compounds in that series.
References
Lipinski CA, Lombardo F, Dominy BW et al (1997) Experimental and computational approaches to
estimate solubility and penneability in drug discovery and development settings. Adv Drug Deliv Revs
23: 3-25
2 Kansy M, Kratzat K, Parrilla I et al (2000) Physicochemical high throughput screening (pC-HTS):
Detennination of membrane penneability, partitioning and solubility. In: K Gundertofte, FS J~rgensen
(eds) Molecular Modeling and Prediction of Bioactivity, Kluwer AcademiclPlenum Publishers, 237-
243
3 Avdeef A (1998) pH-metric solubility. I. Solubility-pH profiles from Bjerrum plots. Gibbs buffer and
pKa in the solid state. Pharm Pharmacol Commun 4: 165-178
4 Stella VL, Martodhardjo S, Rao VM (1999) Aqueous solubility and dissolution rate does not ade-
quately predict in vivo perfonnance: a probe utilizing some N-acyloxymethyl phenytoin prodrugs. J
PharmSci88: 775-779
5 Van de Waterbeemd H, Carter RE, Grassy G et al (1997) Glossary of tenns used in computational drug
design (IUPAC Recommendations 1997)Pure & Appl Chem 68: 1137-1152
6 Stewart BH, Chan OH (1998) Use of immobilized artificial membrane chromatography for drug trans-
port applications. J Pharm Sci 87: 1471-1478
7 Van de Waterbeemd H (2000) Intestinal Permeability: Prediction from Theory. In: J. B. Dressman, H.
Lenner (eds): Oral Drug Absorption: Prediction and Assessment. Marcel Dekker, New York, 31-49
8 Ottiger C, Wunderli-Allenspach H (1997) Partition behaviour of acids and bases in a phosphatidyl-
choline liposome-buffer equilibrium dialysis system. Eur J Pharm Sci 5: 223-231
9 Ottiger C, Wunderli-Allenspach H (1999) Immobilized artificial membrane (IAM)-HPLC for partition
studies of neutral and ionized acids and bases in comparison with the liposomal partition system.
Pharm Res 16: 643-650
10 Austin RP, Barton P, Davis AM et al (1998) The effect of ionic strength on liposome-buffer and
l-octanol-buffer distribution coefficients. J Pharm Sci 87: 599-607
11 Austin RP, Davis AM, Manners CN (1995) Partitioning of ionizing molecules between aqueous buffers
and phospholipid veshicles. J Pharm Sci 84: 1180-1183
12 Barton P, Davis AM, McCarthy DJ et al (1997) Drug-phospholipid interactions. 2. Predicting the sites
of drug distribution using n-octanollwater and membrane/water distribution coefficients. J Pharm Sci
86: 1034-lO39
13 Fruttero R, Caron G, Fornatto E et al (1998) Mechanisms of liposomes/water partitioning of
(p-methylbenzyl)alkylamines. Pharm Res 15: 1407-1413
14 Pauletti GM, Wunderli-Allenspach H (1994) Partition coefficients in vitro: artificial membranes as a
standardized distribution model. Eur J Pharm Sci I: 273-282
15 Balon K, Riebesehl BU, Muller BW (1999) Determination of liposome partitioning of ionizable drugs
by titration. J Pharm Sci 88: 802-806
16 Artursson P, Palm K, Luthman K (1996) Caco-2 monolayers in experimental and theoretical predic-
tions of drug transport. Adv Drug Del Rev 22: 67-84
17 Irvine JD, Takahashi L, Lockhart K et al (1999) MDCK (Madin-Darby canine kidney) cells: a tool for
membrane penneability screening. J Pharm Sci 88: 28-33
18 Gres MC, Julian B, Bourrie M et al (1998) Correlation between oral drug absorption in humans, and
apparent drug penneability in TC-7 cells, a human epithelial intestinal cell line: comparison with the
parental Caco-2 cell line. Pharm Res 15: 726-733
19 Raessi SD, Hidalgo IJ, Segura-Aguilar J et al (1999) Interplay between CYP3A-mediated metabolism
and polarized efflux of terfenadine and its metabolites in intestinal epithelial Caco-2 (TC7) cell mono-
layers. Pharm Res 16: 625-632
20 Johnson MD, Anderson BD (1999) In vitro models of the blood-brain barrier to polar penneants:
Comparison of transmonolayer flux measurements and cell uptake kinetics using cultured cerebral
capillary endothelial cells. J Pharm Sci 88: 620-625
21 Borchardt RT, Smith PL, Wilson G (eds) (1996) Models for assessing drug absorption and metabo-
lism. Plenum Press, New York
256 H. van de Waterbeemd
22 Abraham MH, Le J (1999) The correlation and prediction of the solubility of compounds in water
using an amended solvation energy relationship. J Pharm Sci 88: 868-880
23 Huuskonen J, Salo M, Taskinen J (1998) Aqueous solubility prediction of drugs based on molecular
topology and neural network modeling. J Chern Inf Cornput Sci 38: 450-456
24 Leo AJ (1993) Calculating log Poet from structure. Chern Rev 93: 1281-1306
25 Buchwald P, Bodor N (1998) Octanol-water partitioning: Searching for predictive models. Curr Med
Chern 5: 353-380
26 Mannhold R, Van de Waterbeemd H (2001) Substructure and whole molecule approaches for calcu-
lating log P. J Cornput Aided Mol Des 15: 337-354
27 Palm K, Luthman K, Ungell AL et al (1998) Evaluation of dynamic polar molecular surface area as
predictor of drug absorption: Comparison with other computational and experimental predictors. J
Med Chern 41: 5382-5392
28 Clark DE (1999) Rapid calculation of polar molecular surface area and its application to the predic-
tion of transport phenomena. 1. Prediction of intestinal absorption. J Pharm Sci 88: 807-814
29 Van de Waterbeemd H, Kansy M (1992) Hydrogen bonding capacity and brain penetration. Chirnia 46:
299-303
30 Clark DE (1999) Rapid calculation of polar molecular surface area and its application to the predic-
tion of transport phenomena. 2. Prediction of blood-brain barrier penetration. J Pharm Sci 88:
815-821
31 Raevsky OA, Schaper KJ (1998) Quantitative estimation of hydrogen bond contribution to permeabil-
ity and absorption processes of some chemicals and drugs. Eur J Med Chern 33: 799-807
32 Norinder U, Oesterberg T, Artursson P (1999) Theoretical calculation and prediction of intestinal
absorption of drugs in humans using MolSurf parametrizaton and PLS statistics. Eur J Pharm Sci 8:
49-56
33 Winiwarter S, Bonham NM, Ax F et al (1998) Correlation of human jejunal permeability (in vivo) of
drugs with experimentally and theoretically derived parameters. A multivariate data analysis approach.
J Med Chern 41: 4939-4949
34 Camenisch G, Alsenz J, Van de Waterbeemd H et al (1998) Estimation of permeability by passive dif-
fusion through Caco-2 cell monolayers using the drugs' lipophilicity and molecular weight. Eur J
Pharm Sci 6: 313-319
35 Norinder U, Sjoeberg P, Oesterberg T (1998) Theoretical calculation and prediction of brain-blood par-
titioning of organic solutes using MolSurf parametrization and PLS statistics. J Pharm Sci 87:
952-959
36 Van de Waterbeemd H (1996) Design of bioactive compounds. In: Van de Waterbeemd, H. (ed):
Structure-Property Correlations in Drug Research. Academic Press & R.G. Landes Comp., Austin,
1-9
37 Walters WP, Stahl MT, Murcko MA (1998) Virtual screening - an overview. Drug Disc Today 3:
160-178
38 Agrafiotis DK, Myslik JC, Salemme FR (1999) Advances in diversity profiling and combinatorial
series design. Mol Diversity 4: 1-22
39 Finn PW (1996) Computer-based screening of compound databases for the identification of novel
leads. Drug Disc Today 1: 363-370
40 Keighley WW, Nabney IT, Van de Waterbeemd H et al (2003) Data-handling for high throughput
screening. In: E Murray (ed): The Principle and Practice of High Throughput Screening. Blackwell
Science, London
41 Smith DA, Jones BC, Walker DK (1996) Design of drugs involving the concepts and theories of drug
metabolism and pharmacokinetics. Med Res Revs 16: 243-266
42 Van de Waterbeemd H, Smith DA, Jones BC (2001) Lipophilicity in PK design: methyl, ethyl, futile.
J Cornput Aided Mol Des 15: 273-286
43 Kempf DJ (1994) Progress in the discovery of orally bioavailable inhibitors of HIV protease. Perspect
Drug Disc Des 2: 427-436
44 Vacca JP, Dorsey BD, Schleif WA et al (1994) L-735,524: an orally bioavailable human immunodefi-
ciency virus type 1 protease inhibitor. Proc Nat Acad Sci 91: 4096-4100
45 Lehr P, Billich A, Charpiot B et al (1996) Inhibitors of human immunodeficiency virus type I protease
containing 2-aminobenzyl-substituted 4-amino-3-hydroxy-5-phenylpentanoic acid: synthesis, activity,
and oral bioavailability. J Med Chern 39: 2060-2067
46 N'Goka V, Schlewer G, Linget JM et al (1991) GABA-uptake inhibitors: Construction of a general
pharmacophore model and successful prediction of a new representative. J Med Chern 34: 2547-2557
12 Physicochemical concepts in drug design 257
47 Von Geldern, TW, Hoffman DJ, Kester, JA et al (1996) Azole endothelin antagonists. 3. Using D log
P as a tool to improve absorption. J Med Chern 39: 982-991
48 Wu, C, Chan MF, Stavros F et al (1997) Discovery of TBC 11251, a potent, long acting, orally active
endothelin receptor-A selective antagonist. J Med Chern 40: 1690-1697
49 Van de Waterbeemd H, Camenisch G, Folkers G et al (1998) Estimation of blood-brain barrier cross-
ing of drugs using molecular size and shape, and H-bonding descriptors. J Drug Target 6: 151-165
50 Ter Laak AM, Tsai RS, Donne-Op den Kelder GM et al (1994) Lipophilicity and hydrogen-bonding
capacity of H,-antihistaminic agents in relation to their central sedative side-effects. Eur J Pharm Sci
2: 373-384
51 Testa B, Pagliara A, Carrupt PA (1997) The molecular behaviour of cetirizine. Clin Exp Allerg 27:
S13-S18
52 Glen RC, Martin GR, Hill AP et al (1995) Computer-aided design and synthesis of 5-substituted trypt-
arnines and their pharmacology at the 5-HT ID receptor: discovery of compounds with potential anti-
migraine properties. J Med Chern 38: 3566-3580
53 Boyd SA, Fung AKL, Baker WR et al (1994) Nonpeptide renin inhibitors with good intraduodenal
bioavailability and efficacy in dog. J Med Chern 37: 2991-3007
54 Hamilton HW, Steinbaugh BA, Stewart BH et al (1995) Evaluation of physicochemical parameters
important to the oral bioavailability of peptide-like compounds: implications for the synthesis of renin
inhibitors. J Med Chern 38: 1446-1455
55 Chan OH, Stewart BH (1996) Physicochemical and drug-delivery considerations for oral drug
bioavailability. Drug Disc Today 1: 461-473
56 Carnenisch G, Folkers G, Van de Waterbeemd H (1996) Review of theoretical passive drug absorption
models: Historical background, recent developments and limitations. PharmActa Helv 71: 309-327
57 Carnenisch G, Folkers G, Van de Waterbeemd H (1998) Shapes of membrane permeability-lipophilic-
ity curves: Extension of theoretical models with an aqueous pore pathway. Eur J Pharm Sci 6:
321-329
58 Van de Waterbeemd H (1997) Application of physicochemical methods to oral drug absorption esti-
mation. Eur J Pharm Sci 5 Suppl 2: S26-S27
59 HiessbOck R, Wolf C, Richter E et al (1999) Synthesis and in vitro multidrug resistance modulating
activity of a series of dihydrobenzopyrans and tetrahydrohydroquinolines. J Med Chern 42: 1921-1926
60 Tmej C, Chiba P, Huber Met al (1998) A combined HanschlFree-Wilson approach as predictive tool
in QSAR studies on propafenone-type modulators of multidrug resistance. Arch Pharm Pharm Med
Chern 331: 233-240
Modern Methods of Drug Discovery 259
ed. by A. Hillisch and R. Hilgenfeld
© 2003 Birkhauser Verlag/Switzerland
Liverpool John Moores University, School of Pharmacy and Chemistry, Byrom Street, Liverpool, L3 3AF,
UK
13.1 Introduction
13.2 Toxicity tests
13.2.1 Acute toxicity
13.2.2 Repeated dose toxicity assays
13.2.3 Testing for carcinogenicity
13.2.4 Tests for reproductive toxicity and teratogenicity
13.2.5 Alternatives to toxicity testing
13.3 Computer-aided prediction of toxicity and metabolism
13.3.1 Quantitative structure-activity relationships
13.3.2 Expert systems for toxicity prediction
13.3.3 Utility of QSARs and expert systems
13.3.4 Problems and drawbacks of using computer-aided toxicity prediction methods
13.4 Conclusions and perspectives
13.5 References
13.1 Introduction
All medicines must be safe for human use. This simple statement belies the com-
plexity of regulations, both national and international, as well as time and finan-
cial resource required to ensure drug safety. To establish drug safety "hazard" is
ascertained following which risk, or conversely drug safety, may be confirmed.
Toxicity tests, normally using surrogate animal species, are required to ascertain
drug hazard, or otherwise. If a drug is determined to be too hazardous, i.e., too
toxic for its intended use, then it will not be developed further. Considering that
the drug development process may take up to 8-10 years and cost hundreds of
millions dollars (see Chapter 1), from discovery to registration, it is essential that
drug toxicity be identified as early as possible. The aim of this chapter is to review
briefly the more common toxicological assays, and then to assess the progress
made by computer-aided toxicity and metabolism prediction to identify toxic drug
entities.
260 M.T.D. Cronin
Acute toxicity tests are normally performed in rats and mice and occasionally in
the rabbit and guinea pig. The purpose of the assay is to determine ultimately the
LDso, or the dose of drug that will be lethal to 50% of a population. The time peri-
od of the test is short (normally 24 h and occasionally up to 48 h) with observa-
tion following dosing of up to 14 days. Dosing the animals may be via a number
of routes including intravenous, intraperitoneal, dermal, oral and inhalation.
Organization for Economic Co-operation and Development (OECD) guidelines
provide experimental details for oral (OECD Guideline 401) and dermal dosing
(OECD Guideline 402) as well as the inhalation route (OECD Guideline 403;
details of the OECD guidelines can be obtained from their internet site listed in
Table 1). Whilst the acute toxicity test is relatively simple to perform, it can pro-
vide the experienced toxicologist with a wealth of information far beyond the
basic LDso, including information on observational and physiological effects. The
LDso test is now not acceptable in many countries and has been replaced by the
fixed dose, up-and-down, and acute toxic class procedures.
timeframe. Short-term repeated dose studies (OECD Guideline 407) last between
14 and 28 days. Dosing is graded in 3-4 concentrations with the highest dose
designed to cause some toxicity, but not lethality. Normally between 5-10 rats of
each sex (though mice and dogs are also utilized) are tested per group. At the com-
pletion of the test a whole host of clinical and histological evaluations are record-
ed, including experimental observations and whole body and individual organ
analysis. Such information will clearly enhance that gathered from acute toxicity
studies. Other subchronic toxicity studies are maintained for up to 90 days (OECD
Guideline 408). Again animals are exposed to the drug continuously and poten-
tially via a number of different routes. This provides much information regarding
organ toxicity, but the advantages of the 90 day test over the 28 day test remain
debatable.
Table 1. Internet addresses of software vendors and other useful sites. (Please note whilst the author has
taken every measure to ensure the accuracy of the information and addresses, he is not responsible for
changes of addresses, or the information that may be placed on these sites)
• Organization for economic co-operation and development. This site provides information on (and
links to) testing guidelines as well as an on-line bookshop: http://www.oecd.org/ehs/
• The Registry Of Toxic Effects Chemical Substances (RTECS). This site provides background
information on the RTECS database: http://www.cdc.gov/niosh/rtecs.html
• MicroQSAR. This site provides brief information and purchase details:
http://www.tds-tds.comlind_num2.htm#MICROQSAR
• ECOSAR. This site provides basic background information and the opportunity to download the
software: http://www.epa.gov/opptintr/newchms/21ecosar.htm
• TOPKAT (and other Accelrys). This site provides basic background information and samples of its
output: http://www.accelrys.comlproducts/topkati
• CASE, MuItiCASE, META etc. This site provides basic background information and examples of
their use: http://www.multicase.com
• LHASA (DEREK, StAR, METEOR). This site provides basic background information and
examples of their use and a description of the collaborative agreement:
http://www.chem.leeds.ac. uk/luk/
• HazardExpert and MetabolExpert. This site gives basic descriptions of these programs and all
others marketed by CompuDrug: http://www.compudrug.coml
• OncoLogic. This site provides background information: http://www.logichem.coml
• SciVision. This site provides background information on TOXsys, ONCOis, and all their other
products: http://www.scivision.coml
262 M.T.D. Cronin
attempts to invoke negligible target organ toxicity and will not decrease the life-
span of the animals (other than by the production of cancers). Large numbers of
animals are used, commonly at least 50 animals per sex at each dose. During and
after the testing, animals are subjected to a plethora of clinical, observational, and
histological assessments. Of the last, at least 30 tissues may be examined. An
assessment of the capability of a drug to act specifically as a genotoxic (as opposed
to non-genotoxic) carcinogen is also made. Such assessment is normally made in
vitro. In 1997, the International Conference on Harmonization (ICH) of Technical
Requirements for Registration of Pharmaceuticals for Human Use commended the
use of a minimum battery of in vitro genotoxicity assays [5]. The battery approach
attempts to ensure that all potential mechanisms of genotoxicity can be observed.
The ICH recommendations include an in vitro assay for gene mutation and the use
of both an in vitro and in vivo assay for chromosomal aberration.
The gene mutation assay is performed most commonly with Salmonella
typhimurium (the so-called Ames test), and to a lesser extent with Escherichia
coli. The Ames test is normally performed according to OECD Guideline 471. It
is reliant upon a mutant strain of S. typhimurium that requires histidine for growth.
The test is performed on agar that is histidine deficient (and thus will not support
growth of the mutant bacterium). Exposure to a genotoxic chemical may cause
mutation in the bacterium causing a reversion to the wild type, which is then
observed to grow on the histidine deficient agar. A number of modifications can
be made to the Ames test using different strains of bacteria, and the inclusion of
metabolic enzymes.
In vitro systems for chromosomal damage include the mammalian cytogenic
test, performed to OECD guideline 473. In this assay cell cultures from man, or
from standard cell lines, are exposed to the drug compound. Human lymphocytes
are commonly used in this assay. After the exposure period, the metaphase cells
are examined microscopically for evidence of chromosome aberration. In vivo
tests for chromosomal damage include the micronucleus test (performed to OECD
guideline 474). This assay assesses the damage in immature erythrocytes of rat
bone marrow cells after exposure to the drug.
at each dose level. The drug is administered during the development of the major
organs in the foetus (e.g., 6-15 days after conception). The animals are dosed via
their drinking water or by gavage. The test is terminated on the day prior to nor-
mal delivery and foetal development, in terms of both the number of live foetus-
es, and of any malformations present, is assessed.
For the more accurate assessment of effects of a drug on species fecundity and
between generations, a multigeneration toxicity test is required. A good example
is the two-generation reproduction toxicity test (OEeD Guideline 416), although
these tests may also be maintained for three or more generations. Such assays are
performed usually with rats at three dose levels. The highest dose level is typical-
ly one-tenth of the LDso; the lowest should cause no sub-chronic toxicity. Males
and females are treated with the drug for 60 days, after which they are allowed to
mate. Subsequent generations are assessed for a wide variety of endpoints includ-
ing: number of live births; abnormalities at birth; gender and weight at birth; his-
tological examinations, etc.
It can be concluded from this brief review of toxicological methods that the exper-
imental assessment of drug toxicity is a time-consuming, expensive, yet essential,
part of the pharmaceutical research and development process. The consequence of
a drug being found to be toxic at a late stage of development could be immense-
ly costly to a company. With this in mind, methods are constantly being sought to
determine drug toxicity as early and as cheaply as possible. Much effort has been
placed into the development of in vitro assays, cell culture techniques and most
recently DNA arrays as replacements for toxicity testing (the latter is likely to see
an enormous growth in use in the future, see Rockett and Rix [6] for more infor-
mation). In addition, a number of computer-aided toxicity prediction methods are
available. These are based on the fundamental premise that the toxicological activ-
ity of a drug will be a function of the physico-chemical and/or structural proper-
ties of the substance. Once such a relationship has been established, further chem-
icals with similar properties can be predicted to be toxic. The remainder of this
chapter will review computer-aided methods to predict toxicity and metabolism.
cules, or account for effects such as ionization or steric hindrance of reactive cen-
ters. It should be noted that such a tiered approach still requires more effort in the
measurement of toxicological activity and modeling.
The second caveat relates to the nature of the biological activity itself. It should
be no surprise that the data have been obtained from an in vitro toxicological
assay. Such data are quickly and cheaply determined (and can be obtained in one
laboratory, as in this case). Undoubtedly for drug toxicity a model based on mam-
malian LDso data would be preferable. However, very few such QSARs are avail-
able. The reason for this is believed to be that there are not sufficient "quality" tox-
icity data on which to develop these QSARs. Whilst a large number of toxicity
data may be available on databases such as the Registry of Toxic Effects of
Chemical Substances (RTECS; see internet site listed in Table 1), there is no con-
sistency in the data, and many attempts to model such data are clearly hindered by
the excessive error and inter-laboratory variation that may exist. Johnson and Jurs
[16] report a predictive model for the acute oral mammalian toxicity for 115 ani-
lines using data retrieved from the RTECS database. The model utilizes a neural
network based on five physico-chemical descriptors. It is reported to predict the
toxicity of an external validation set well. Further unpublished analysis of the tox-
icity data suggests that there may, however, be considerable variation in the data,
following a comparison of the toxicity data for positional isomers in the data set.
Other workers have investigated the acute toxicity data from RTECS to rats and
mice (based on an arbitrary categoric scale) to develop a model based on a deci-
sion tree approach for aliphatic alcohols [17].
QSARs have been developed for a whole range of other toxicity endpoints,
especially those that provide a quantitative determination of toxicity. An example
is the modeling of mutagenicity data from the Ames test. Mutagenicity is a toxic-
ity reliant on the capability of a molecule to react covalently with DNA. Often,
such reactivity can be described and modeled by the use of calculated molecular
orbital properties. Hatch and Colvin [18], for instance, demonstrated the use of
whole molecule parameters, as well as those based on the nitrenium ion, to model
the mutagenic potency of aromatic and heterocyclic amines.
Furthermore, Hansch et al. [19] have reviewed the use of QSARs to predict a
number of teratological endpoints. A number of QSARs are presented, some of
which have been re-evaluated in a mechanistic light. For instance, the embyro tox-
icity of 4-substituted phenols (ET) was found to be well correlated to the Hammett
constant for the substituent (0"):
meters, the best being found with a measure of the shape similiarity of the peni-
cillin molecules to benzylpenicillin (Sim.BP). The following relationship suggests
that a receptor binding phenonemon is important for this endpoint:
QSARs can also be used to model the absorption, distribution, metabolism and
excretion (ADME) properties of drugs. Cronin et al. [21] demonstrated that the
permeability coefficients (Kp) of 107 heterogeneous compounds through the skin
in vitro could be well modeled by hydrophobicity and molecular size (as parame-
terized by molecular weight (MW». It should be noted in this approach a passive
diffusion process is assumed; active and facilitated transport are little addressed
by QSAR modeling. Thus the combination of the following (and similar for other
membranes) equation with the Lipinski "rule of five" [22] (see Chapters 10 and
12) may provide a powerful tool to determine drug uptake:
QSAR models for metabolism are also possible and may take a number of forms.
Hansch and Zhang [23] have reviewed the capability of substrates and inducers to
bind to the cytochrome P-450 enzymes and have developed mechanistically based
QSARs for these data. Typically the data sets analyzed are small (fewer than 20
compounds) which illustrates the paucity of data for modeling in this area. One of
the larger data sets analyzed is for the induction by miscellaneous compounds of
CYPIA2 (I), which reveals the following significant QSAR with hydrophobicity:
logt1/2=0.17Q-IO.1 Qc+0.ll10gP-3.81
where n is the inaccessible solid angle around the carbonyl Sp2 oxygen (a
measure of steric hindrance).
Qc is the calculated charge on the carbonyl carbon.
This result exemplifies the problems associated with using compilations of human
data, namely that a number of half-life values did not fit the model. Fortunately
the authors took a pragmatic approach and developed a model on parameters that
agree with the mechanism of hydrolysis by carboxylesterases.
A further approach to the prediction of metabolism is the qualitative identifica-
tion of molecular features that dictate whether or not a molecule may enter a par-
ticular pathway. A good example was the use of non-linear mapping to determine
the properties of benzoic acids and congeners which were necessary for glucoro-
nic acid or glycine conjugation [25].
For the terms of reference in this chapter "expert system" is taken to mean a com-
puter-assisted approach to predict toxicity. Again the reader is referred elsewhere
for further details on expert systems [9-11, 26, 27].
ture the knowledge of an expert for utilization by a non-expert. The power and
utility of these systems is reliant upon two items: firstly adequate software is
required to comprehend and interpret chemical structures; and secondly knowl-
edge is needed to form the rule-base of the expert systems. The former is well
developed and a number of software packages are commercially available as
detailed herein; the latter, for many toxicological endpoints, is still at a rudimen-
tary level.
The software packages developed by LHASA Ltd (see internet site listed in
Tab. 1) provide a good illustration of the systems available to predict toxicologi-
cal and metabolic endpoints. LHASA Ltd. itself is a unique company amongst the
expert system providers. LHASA Ltd. has charitable status and is the coordinator
for a collaborative group of "customers" who purchase its products (in particular
DEREK (Deductive Estimation of Risk from Existing Knowledge) for
Windows™). The collaborative group includes members from the pharmaceutical,
agrochemical, and personal product industries, as well as regulatory agencies,
from Europe and North America. At the time of writing there are over 20 mem-
bers in the collaborative group. Members of the group are given the opportunity
to contribute their own knowledge to the development of new rules [33].
Probably the most developed product from LHASA is the DEREK for
Windows™ software, which provides qualitative predictions of toxicity from its
rulebase. As with all such systems, the concept is simple: namely that if a partic-
ular molecular fragment is known to cause toxicity in one compound, if it is found
in another compound the same toxicity will be observed. The system is driven by
the LHASA software originally written in the CHMTRN language for the predic-
tion of chemical synthesis and reactions. Examples of the rules (i.e., toxicity asso-
ciated with a particular molecular fragment) contained within the knowledge-base
are listed in Table 2. The knowledge base contains rules for a number of endpoints
including: skin sensitization, irritancy, mutagenicity, carcinogenicity and many
others. It should be noted that there appears to be considerable variety in the num-
ber and quality of rules for different endpoints; skin sensitization for instance has
a well developed set of rules, whereas other endpoints such as anticholinesterase
activity are less developed, and may comprise merely a handful of rules. The rules
themselves again provide the power and accuracy of the predictions. The DEREK
rulebase has not recently been published and thus is not open for inspection.
Examples of the rules are provided, however, by Sanderson and Earnshaw [34]
who list 49 rules for a wide variety of toxicities and Barratt et al. [35] who list 40
rules specifically for skin sensitization. Other toxicological prediction software
developed by LHASA includes StAR (Standardised Argument Report), which is
designed to provide the user with more information following a "Logic of
Argumentation" process.
The METEOR program from LHASA is a rule-based system for the prediction
of metabolism [33]. It consists of a knowledge base of biotransformations that
describe the metabolic reactions catalyzed by specific enzymes, or that are relat-
ed to specific substrates. Again, the knowledge is utilized in a form such that if a
particular biotransformation is associated with a molecular fragment, and that
molecular fragment is found in another compound, then that biotransformation
will be assumed to occur. There is currently little open information regarding the
number or quality of the rules in the METEOR knowledge base.
HazardExpert has been developed and marketed by CompuDrug Chemistry
Ltd. (see Tab. 1 for internet site). At its heart is a rule-based expert system that pre-
dicts toxicity. Predictions may be made for a variety of toxicity endpoints includ-
ing mutagenicity, carcinogenicity, irritancy and several others. The knowledge has
been developed from a number of sources, including the open scientific literature
and the reports of the U.S. E.P.A. The software is able to identify molecular frag-
ments within chemical structures. Additionally it takes account of other factors
such as species and exposure duration, dose and route. Furthermore, it automati-
cally calculates molecular weight, pKa, log P and log D (the distribution coeffi-
cient). Combining physico-chemical information with the rule-base allows it to
make predictions for risk, taking account of the properties and type of exposure.
HazardExpert is linked to a second CompuDrug product MetabolExpert. The
latter program attempts to make qualitative predictions of metabolites of com-
pounds. These predictions are again founded upon a molecular fragment-led rule
base approach. Linking the two packages allows the user to perform also predic-
tive toxicological risk assessment on the potential metabolites of a drug substance,
as well as on the parent molecule.
Another rule-based expert system worthy of note is OncoLogic. This system
provides predictions of the carcinogenicity of chemicals, and is marketed by
LogiChem Inc. (see Tab. 1 for the internet address). The development of this sys-
tem is again unique, in that it has been developed by the U.S. E.P.A. Office of
Pollution Prevention and Toxics, i.e., a regulatory agency [36]. It is thus provid-
ed, on a commercial basis, to allow chemical producers and industries access to
the knowledge on which the U.S. E.P.A. will make their decisions. As such, the
OncoLogic software contains knowledge based on data for more than 10,000
13 Computer-aided prediction of drug toxicity and metabolism 271
chemicals, which is formed into a matrix of over 40,000 discrete rules. The clear
advantage of this system is its depth of knowledge, and that it allows industrial
users to follow and observe the predictions that will be made by regulatory agen-
cies.
A further approach to the prediction of toxicity and metabolism is provided by
the Computer-Optimised Molecular Parametric Analysis of Chemical Toxicity
(COMPACT) technique of Lewis and co-workers [37]. Rather than being a
methodology, COMPACT may be thought of as a philosophy. The aim of the phi-
losophy is to be able to predict, from structural considerations, whether a mole-
cule has the potential to act as a substrate for one or more of the cytochromes P450
(CYPs), or the ability to promote peroxisome proliferation. Whilst oxidative
metabolism by CYPs normally results in detoxification, metabolism by, for
instance, CYPI may result in the formation of epoxides.
The COMPACT approach has been built up over a number of years, and is
based upon the premise that there are certain, basic, structural requirements of a
molecule that make it susceptible to oxidative metabolism. Firstly molecules
capable of binding to CYPs are considered to be planar in nature. There is no
direct physico-chemical assessment of planarity, although it may be estimated fol-
lowing molecular modeling and 3-D geometry optimization. Thus, molecular pla-
narity is considered to be a function of molecular cross-sectional area (a) and
molecular depth (d). It may be defined thus:
-7.8)
I 2
COMPACT radius = ~(~ - 9.5)2 +(;2
The COMPACT radius thus provides a "rule" which may be used to predict toxi-
city following metabolism, the rule being based upon easily-calculable physico-
chemical parameters. It is possible that extensions of this approach will provide
272 M.T.D. Cronin
predictive methodologies for the substrates of other enzymes. The prediction rates
from the COMPACT system were found to be considerably improved when com-
bined with the HazardExpert system [39], demonstrating the utility of the battery
approach to prediction.
The COMPACT philosophy has been extended to include to molecular (pro-
tein) modeling of the enzymes themselves. This has successfully utilized homol-
ogy modeling following the determination of the sequences, and isolation of the
enzymes, for rat and man [40].
A further method to enable prediction of potential metabolites is the META
system, which is part of the suite of programs developed by Klopman et al. [41,
42]. The software utilized is similar to that of the other CASE programs (see
Tab. 1 for internet site). Rather than developing the rule-base from a statistical
approach, rules have been taken from standard biochemical sources. Examples of
such rules are provided in Table 3.
Other expert systems to predict toxicity include TOXsys and ONCOis from
SciVision (see Tab. 1 for internet details).
Table 3. Examples of rules from the META metabolism knowledge base (information extracted from
Talafous et al. [42])
From To
* X is any halogen
the user can search the database on which the prediction was made, to gain an
idea, if only subjectively, of how the prediction was made. Similarly the DEREK
software, and other rule-based systems, contain a large amount of material which
the user can search to obtain information regarding the prediction. All such activ-
ities will lengthen considerably the time taken to make a prediction. As such they
may compromise the utility of these systems to make adequate predictions for
large combinational libraries. Another practical aspect to predictive toxicology is
the ultimate compatibility of the hardware and software platforms. This is little
addressed, especially in the area of combinatorial chemistry library design (see
Chapters 7). Approaches such as the integration of TOPKAT and Q-TOX into the
TSAR molecular spreadsheet are, however, encouraging, though it should be
obvious that it is only strategic alliances between companies such as combinator-
ial chemistry and expert system providers that will drive this area of science for-
ward.
The drawback with using QSARs and expert systems to predict toxicity is simple
to define, yet much more complex to understand and to fix. The drawback is sim-
ple: these techniques cannot make adequate predictions for all compounds and for
all endpoints! There are many reasons for this:
• Many models are poorly developed in many chemical areas. This is due to there
being a paucity of available toxicity data either to build the models, or for their
validation. Not only are more data required for modeling, but those data need
to be of a high standard (e.g., measured by the same protocol in one laborato-
ry) to provide reliable predictive models. An example of the gaps that are pres-
ent in the training sets of the systems is that it is only OncoLogic which is able
to make predictions for constituents such as fibers, polymers and inorganic-
containing compounds. The problem of gaps in the data will be exacerbated by
the novel chemistries that are being created by combinatorial chemistry. It is
unlikely that a molecular fragment rule-based approach will be able to predict
reliably the toxicity of completely novel compounds.
• A considerable amount of expertise is required to interpret and validate the
results. The basic premise of an expert system is that it presents the knowledge
of an expert for use by a non-expert. It is the contention of the author that users
of these systems should not, however, be non-experts. Users require an ade-
quate level of toxicological training and expertise.
• Rule-based expert systems for predicting toxicity may be over-predictive. An
example of this was the prediction of skin sensitization by the DEREK soft-
ware, which predicted a number of non-sensitizers to be sensitizers as these
compounds contained a structural alert. Reasons for the over-prediction include
lack of knowledge concerning the effect of modulating factors on particular
functional groups, and lack of permeability assessment [44]. Whilst the latter
13 Computer-aided prediction of drug toxicity and metabolism 275
point may be, at least partially, addressed in the StAR and HazardExpert sys-
tems, more work is required to predict membrane permeability. Similarly, rule-
based systems for the prediction of metabolism will simply provide an idea of
the metabolites formed, but not a quantitative estimation of the quantity of each
that may be formed.
• There is clearly an issue with the role of mechanisms of action in making tox-
icological prediction. Systems such as DEREK, HazardExpert and OncoLogic
have rule bases developed specifically from a mechanistic viewpoint. Other
systems such as TOPKAT and CASE are less, if at all, mechanistically based.
The problems of this lack of mechanistic basis to the prediction have never,
however, been adequately addressed or quantified.
• The commercial environments in which the systems are placed, effectively as
competitors, does little or nothing to assist in the recognition of strengths or
weaknesses of each of the systems. It is likely, for instance, that an integrated
battery approach to the use of these systems (see next section) will reap the
most rewards.
• In many systems there is only limited ability to include proprietary data into the
rule-base or predictive system. Some manufacturers such as CASE and TOP-
KAT will model proprietary data. In other systems, there are opportunities to
influence the rule-base either by contributing openly to the rule base, or by the
development of proprietary rule bases. Generally, though, these systems require
mechanisms to allow users more freely to expand and contribute to the rule-
bases.
As described in this chapter, there are a variety of methods and techniques avail-
able with which to make predictions of toxicity and metabolism. Before their use
a decision is required as to where they will be useful. There are a number of poten-
tial uses in the drug development process, including at the library design stage
(see also Chapter 7), or in the optimization of a lead compound. As one progress-
es through the development of a drug the time and expense that may be expended
on computer-aided prediction will increase. Concomitantly so will the expected
accuracy of the model, and thus its likely complexity. A generalized screen is pre-
sented in Figure 1. In this, libraries could be screened to remove the most toxic
compounds using basic structure-activity rules and high-throughput screening
techniques (see Chapter 4). At this stage it is probable that only "gross structure-
based rules" would be of any use, for instance those that will identify toxic com-
pounds with a great degree of certainty (e.g., presence of an aliphatic epoxide).
This would allow a large number of compounds to proceed to screening, which
will, of course, still require toxicological assessment. Likely hits could be
screened in more detail to rank them according to 'desirability'. The top hits could
then be more fully assessed with a fuller battery of predictive methods, including
molecular modeling and in vitro tests. Clearly the key to this whole approach is
276 M.T.D. Cronin
High throughput
Gross structure-based Design of chemical
...... tox screens (e.g.
toxicity rules libraries
Ames, cytotoxicity)
~Gl
>< E
"Q.;
Gl ._
Identify
~ Iii
hits
u:..
tIIU
-III
C 1"11
1"11 U
.
:::I
Gl U
~ 1"11
.E Screen hits, more refined
structure-activty rules,
QSARs, PK properties,
formulations
Figure 1. Scheme for the integration of computer-aided predictions into the drug discovery process.
one of integration, not only of the predictive techniques, but also of predictive and
in vitro modeling.
13.5 References
1 Gorrod JW (ed) (1981) Testing for toxicity. Taylor and Francis, London
2 Griffin JP (1998) The evolution of human medicines control from a national to an international per-
spective. Adverse Drug React Toxicol Res 17: 19-50
3 Hodgson E (1997) Toxicity testing and risk assessment. In: E Hodgson, PE Levi (eds): A textbook of
modem toxicology. 2nd Edition. Appleton and Lange, Stamford, Connecticut, 285-338
4 Kroes R (1995) toxicity testing and human health. In: CJ van Leeuwen, JLM Hermens (eds): Risk
assessment of chemicals: an introduction. Kiuwer, Dordrecht, The Netherlands, 147-174
5 van Deun K, van Cauteren H, Vandenberghe J et al (1997) Review of alternative methods of carcino-
genicity testing and evaluation of human pharmaceuticals. Adverse Drug React Toxicol Res 16:
215-233
6 Rockett JC, Dix DJ (1999) Application of DNA arrays to toxicology. Environ Health Persp 107:
681-685
7 Karcher W, Devillers J (eds) (1990) Practical applications of quantitative structure-activity relation-
ships (QSAR) in environmental chemistry and toxicology. Kiuwer, Dordrecht
8 Martin YC (1978) Quantitative drug design. Marcel Dekker, New York
9 Benfenati E, Gini G (1997) Computational predictive programs (expert systems) in toxicology. Toxicol
119: 213-225
10 Benigni R, Richard AM (1998) Quantitative structure-based modeling applied to characterization and
prediction of chemical toxicity. Meth Enzymol14: 264-276
11 Cronin MTD (1998) Computer-aided prediction of drug toxicity in high-throughput screening. Pharm
Pharmacol Commun 4: 157-163
12 Cronin MTD, Dearden JC (1995) QSAR in toxicology 1. Prediction of aquatic toxicity. Quant Struct-
13 Computer-aided prediction of drug toxicity and metabolism 277
38 Parke DV. Ioannides C, Lewis DFV (1990) The Safety evaluation of drugs and chemicals by the use
of computer optimized molecular parametric analysis of chemical toxicity (COMPACT). ATLA 18:
91-102
39 Brown SJ, Raja AA, Lewis DFV (1994) A comparison between COMPACT and Hazard Expert eval-
uations for 80 chemicals tested by the NTPINCI rodent bioassay. ATLA 22: 482-500
40 Lewis DFV, Lake BG, George SG et al (1999) Molecular modelling of CYPI family enzymes
CYPIAI, CYPIA2, CYPIA6 and CYPIBI based on sequence homology with CYPI02. Toxicol139:
53-79
41 Klopman G, Diumayuga M, Galafgous J (1994) META. I. A program for the evaluation of metabolic
transformation of chemicals. J Chem In! Comput Sci 34: 1320-1325
42 Talafous J, Sayre LM, Mieyal 11 et al (1994) META. 2. A dictionary model of mammalian xenobiotic
metabolism. J Chem In! Comput Sci 34: 1326-1333
43 Benigni R (1997) The first US National Toxicology Program exercise on the prediction of rodent car-
cinogenicity: definitive results. Mut Res 387: 35-45
44 Barratt MD, Basketter DA (1994) Structure-activity relationships for skin sensitization: an expert sys-
tem. In: Rougier A, Goldberg AM, Maibach HI (eds): Alternative methods in toxicology. Vol 10. In
vitro skin toxicology. Irritation, photo toxicity, sensitisation. Mary Ann Liebert, New York, 293-301
Index
eleutherobin 91 fluorescence 80
endothelin antagonist 110,251 fluorescence correlation spectroscopy
enrichment factor 139 (FCS) 9,80
ENTREZ database 43, 49 fluorescence polarization (FP) 9, 80
epibatidine 91 fluorescence resonance energy
epothilone 91 transfer (FRET) 9,80, 165
EST (expressed sequence tag) 51 fluorometric microvolume assay
EST clustering 51 technology (FMAT) 80
Euclidean distance 133 fluvastatin 88
european medicines evaluation agency focused library 127, 134
(EMEA) 260 fold assignment 163
EVA (EigenVAlues) descriptor 226, fold recognition 55, 163
227 folding, de novo 56
evolutionary distance 36 food and drug administration (FDA)
expert system for toxicity prediction 260
267 four-feature pharmacophore 206
expert system, rule-based approach Fourier transformation (FT) 158
268 four-point pharmacophore concept 131
expressed sequence tag (EST) 51 fragment based homology modeling
expression analysis 7 55
expression pharmacogenetics 15 fragment linking 186
extracellular protein 58 fragment optimization 186
extreme value distribution 35 2D fragment-based method 133
eye irritation 260 fragmentation, for peptide mass
finger-printing 23
farnesyl protein transferase (FPT) fragment-merging 186
205 free induction decay (FlD) 161
FASTA 35,40, 163 frequency table 36
fast-exchange limit 189 FRET (fluorescence resonance energy
FCS (fluorescence correlation transfer) 9, 80, 165
spectroscopy) 9, 80 Fujita 223
FDA (food and drug administration) fully integrated robot 78
260 fumagillin 91
feature tree descriptor 133 fungi, as a source of bioactive
field fit minimization 231 metabolite 95
field similarity fit method 228 fusion phage 110
fingerprint 227
2Dfingerprint 128,133,141,144 GABA uptake inhibitor 249
FK 506 (tacrolimus) 88 gain of function study 26
FK 506 binding protein 186 gap creation penalty 37
flavopiridol 91 gap extension penalty 37
Flexsim-X 133 gap model 35
FLIPR 80 Gasteiger-Hiickel method 234
fluconazole 249 Gaussian function 225, 226
fluorescein 80 gene characterization 49
284 Index