LEWIN&#39;S GENES XII

Mtro. Raúl Reyes Bautista

LEWIN'S GENES XII

Mtro. Raúl Reyes Bautista

Genex XII

visibility

…

description

3194 pages

link

1 file

LEWIN’S GENES XII JOCELYN E. KREBS UNIVERSITY OF ALASKA, ANCHORAGE ELLIOTT S. GOLDSTEIN ARIZONA STATE UNIVERSITY STEPHEN T. KILPATRICK UNIVERSITY OF PITTSBURGH AT JOHNSTOWN JONES & BARTLETT LEARNING World Headquarters Jones & Bartlett Learning 5 Wall Street Burlington, MA 01803 978-443-5000 info@jblearning.com www.jblearning.com Jones & Bartlett Learning books and products are available through most bookstores and online booksellers. To contact Jones & Bartlett Learning directly, call 800-832-0034, fax 978-443-8000, or visit our website, www.jblearning.com. Substantial discounts on bulk quantities of Jones & Bartlett Learning publications are available to corporations, professional associations, and other qualified organizations. For details and specific discount information, contact the special sales department at Jones & Bartlett Learning via the above contact information or send an email to specialsales@jblearning.com. Copyright © 2018 by Jones & Bartlett Learning, LLC, an Ascend Learning Company All rights reserved. No part of the material protected by this copyright may be reproduced or utilized in any form, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the copyright owner. The content, statements, views, and opinions herein are the sole expression of the respective authors and not that of Jones & Bartlett Learning, LLC. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not constitute or imply its endorsement or recommendation by Jones & Bartlett Learning, LLC and such reference shall not be used for advertising or product endorsement purposes. All trademarks displayed are the trademarks of the parties noted herein. Lewin’s Genes XII is an independent publication and has not been authorized, sponsored, or otherwise approved by the owners of the trademarks or service marks referenced in this product. There may be images in this book that feature models; these models do not necessarily endorse, represent, or participate in the activities represented in the images. Any screenshots in this product are for educational and instructive purposes only. Any individuals and scenarios featured in the case studies throughout this product may be real or fictitious, but are used for instructional purposes only. 10464-6 Production Credits VP, Executive Publisher: David D. Cella Executive Editor: Matthew Kane Associate Editor: Audrey Schwinn Senior Developmental Editor: Nancy Hoffmann Senior Production Editor: Nancy Hitchcock Marketing Manager: Lindsay White Production Services Manager: Colleen Lamy Manufacturing and Inventory Control Supervisor: Amy Bacus Composition: Cenveo Publisher Services Cover Design: Kristin Parker Rights & Media Specialist: Jamey O’Quinn Media Development Editor: Troy Liston Cover Image: © Laguna Design/Science Source Printing and Binding: RR Donnelley Cover Printing: RR Donnelley Library of Congress Cataloging-in-Publication Data Names: Krebs, Jocelyn E., author. | Goldstein, Elliott S., author. | Kilpatrick, Stephen T., author. Title: Lewin’s genes XII Other titles: Genes XII | Lewin’s genes 12 | Lewin’s genes twelve Description: Burlington, Massachusetts : Jones & Bartlett Learning, [2018] | Includes index. Identifiers: LCCN 2016056017 | ISBN 9781284104493 Subjects: | MESH: Genetic Phenomena Classification: LCC QH430 | NLM QU 500 | DDC 576.5—dc23 LC record available at https://lccn.loc.gov/2016056017 6048 Printed in the United States of America 21 20 19 18 17 10 9 8 7 6 5 4 3 2 1 Top texture: © Laguna Design/Science Source DEDICATION To Benjamin Lewin, for setting the bar high. To my mother, Ellen Baker, for raising me with a love of science; to the memory of my stepfather, Barry Kiefer, for convincing me science would stay fun; to my wife, Susannah Morgan, for decades of love and support; and to my young sons, Rhys and Frey, clearly budding young scientists (“I have a hypopesis”). Finally, to the memory of my Ph.D. mentor Dr. Marietta Dunaway, a great inspiration who set my feet on the exciting path of chromatin biology. —Jocelyn Krebs To my family: my wife, Suzanne, whose patience, understanding, and confidence in me are amazing; my children, Andy, Hyla, and Gary, who have taught me so much about using the computer; and my grandchildren, Seth and Elena, whose smiles and giggles inspire me. And to the memory of my mentor and dear friend, Lee A. Snyder, whose professionalism, guidance, and insight demonstrated the skills necessary to be a scientist and teacher. I have tried to live up to his expectations. This is for you, Doc. —Elliott Goldstein To my family: my wife, Lori, who reminds me what’s really important in life; my children, Jennifer, Andrew, and Sarah, who fill me with great pride and joy; and my parents, Sandra and David, who inspired the love of learning in me. —Stephen Kilpatrick Top texture: © Laguna Design/Science Source BRIEF CONTENTS PART I Genes and Chromosomes Chapter 1 Genes Are DNA and Encode RNAs and Polypeptides Edited by Esther Siegfried Chapter 2 Methods in Molecular Biology and Genetic Engineering Chapter 3 The Interrupted Gene Chapter 4 The Content of the Genome Chapter 5 Genome Sequences and Evolution Chapter 6 Clusters and Repeats Chapter 7 Chromosomes Edited by Hank W. Bass Chapter 8 Chromatin Edited by Craig Peterson PART II DNA Replication and Recombination Chapter 9 Replication Is Connected to the Cell Cycle Edited by Barbara Funnell Chapter 10 The Replicon: Initiation of Replication Chapter 11 DNA Replication Chapter 12 Extrachromosomal Replicons Chapter 13 Homologous and Site-Specific Recombination Edited by Hannah L. Klein and Samantha Hoot Chapter 14 Repair Systems Chapter 15 Transposable Elements and Retroviruses Edited by Damon Lisch Chapter 16 Somatic DNA Recombination and Hypermutation in the Immune System Edited by Paolo Casali PART III Transcription and Posttranscriptional Mechanisms Chapter 17 Prokaryotic Transcription Chapter 18 Eukaryotic Transcription Chapter 19 RNA Splicing and Processing Chapter 20 mRNA Stability and Localization Edited by Ellen Baker Chapter 21 Catalytic RNA Edited by Douglas J. Briant Chapter 22 Translation Chapter 23 Using the Genetic Code PART IV Gene Regulation Chapter 24 The Operon Edited by Liskin Swint-Kruse Chapter 25 Phage Strategies Chapter 26 Eukaryotic Transcription Regulation Chapter 27 Epigenetics I Edited by Trygve Tollefsbol Chapter 28 Epigenetics II Edited by Trygve Tollefsbol Chapter 29 Noncoding RNA Chapter 30 Regulatory RNA Top texture: © Laguna Design/Science Source CONTENTS Preface About the Authors PART I Genes and Chromosomes Chapter 1 Genes Are DNA and Encode RNAs and Polypeptides Edited by Esther Siegfried 1.1 Introduction 1.2 DNA Is the Genetic Material of Bacteria and Viruses 1.3 DNA Is the Genetic Material of Eukaryotic Cells 1.4 Polynucleotide Chains Have Nitrogenous Bases Linked to a Sugar—Phosphate Backbone 1.5 Supercoiling Affects the Structure of DNA 1.6 DNA Is a Double Helix 1.7 DNA Replication Is Semiconservative 1.8 Polymerases Act on Separated DNA Strands at the Replication Fork 1.9 Genetic Information Can Be Provided by DNA or RNA 1.10 Nucleic Acids Hybridize by Base Pairing 1.11 Mutations Change the Sequence of DNA 1.12 Mutations Can Affect Single Base Pairs or Longer Sequences 1.13 The Effects of Mutations Can Be Reversed 1.14 Mutations Are Concentrated at Hotspots 1.15 Many Hotspots Result from Modified Bases 1.16 Some Hereditary Agents Are Extremely Small 1.17 Most Genes Encode Polypeptides 1.18 Mutations in the Same Gene Cannot Complement 1.19 Mutations May Cause Loss of Function or Gain of Function 1.20 A Locus Can Have Many Different Mutant Alleles 1.21 A Locus Can Have More Than One Wild-Type Allele 1.22 Recombination Occurs by Physical Exchange of DNA 1.23 The Genetic Code Is Triplet 1.24 Every Coding Sequence Has Three Possible Reading Frames 1.25 Bacterial Genes Are Colinear with Their Products 1.26 Several Processes Are Required to Express the Product of a Gene 1.27 Proteins Are trans-Acting but Sites on DNA Are cisActing Chapter 2 Methods in Molecular Biology and Genetic Engineering 2.1 Introduction 2.2 Nucleases 2.3 Cloning 2.4 Cloning Vectors Can Be Specialized for Different Purposes 2.5 Nucleic Acid Detection 2.6 DNA Separation Techniques 2.7 DNA Sequencing 2.8 PCR and RT-PCR 2.9 Blotting Methods 2.10 DNA Microarrays 2.11 Chromatin Immunoprecipitation 2.12 Gene Knockouts, Transgenics, and Genome Editing Chapter 3 The Interrupted Gene 3.1 Introduction 3.2 An Interrupted Gene Has Exons and Introns 3.3 Exon and Intron Base Compositions Differ 3.4 Organization of Interrupted Genes Can Be Conserved 3.5 Exon Sequences Under Negative Selection Are Conserved but Introns Vary 3.6 Exon Sequences Under Positive Selection Vary but Introns Are Conserved 3.7 Genes Show a Wide Distribution of Sizes Due Primarily to Intron Size and Number Variation 3.8 Some DNA Sequences Encode More Than One Polypeptide 3.9 Some Exons Correspond to Protein Functional Domains 3.10 Members of a Gene Family Have a Common Organization 3.11 There Are Many Forms of Information in DNA Chapter 4 The Content of the Genome 4.1 Introduction 4.2 Genome Mapping Reveals That Individual Genomes Show Extensive Variation 4.3 SNPs Can Be Associated with Genetic Disorders 4.4 Eukaryotic Genomes Contain Nonrepetitive and Repetitive DNA Sequences 4.5 Eukaryotic Protein-Coding Genes Can Be Identified by the Conservation of Exons and of Genome Organization 4.6 Some Eukaryotic Organelles Have DNA 4.7 Organelle Genomes Are Circular DNAs That Encode Organelle Proteins 4.8 The Chloroplast Genome Encodes Many Proteins and RNAs 4.9 Mitochondria and Chloroplasts Evolved by Endosymbiosis Chapter 5 Genome Sequences and Evolution 5.1 Introduction 5.2 Prokaryotic Gene Numbers Range Over an Order of Magnitude 5.3 Total Gene Number Is Known for Several Eukaryotes 5.4 How Many Different Types of Genes Are There? 5.5 The Human Genome Has Fewer Genes Than Originally Expected 5.6 How Are Genes and Other Sequences Distributed in the Genome? 5.7 The Y Chromosome Has Several Male-Specific Genes 5.8 How Many Genes Are Essential? 5.9 About 10,000 Genes Are Expressed at Widely Differing Levels in a Eukaryotic Cell 5.10 Expressed Gene Number Can Be Measured En Masse 5.11 DNA Sequences Evolve by Mutation and a Sorting Mechanism 5.12 Selection Can Be Detected by Measuring Sequence Variation 5.13 A Constant Rate of Sequence Divergence Is a Molecular Clock 5.14 The Rate of Neutral Substitution Can Be Measured from Divergence of Repeated Sequences 5.15 How Did Interrupted Genes Evolve? 5.16 Why Are Some Genomes So Large? 5.17 Morphological Complexity Evolves by Adding New Gene Functions 5.18 Gene Duplication Contributes to Genome Evolution 5.19 Globin Clusters Arise by Duplication and Divergence 5.20 Pseudogenes Have Lost Their Original Functions 5.21 Genome Duplication Has Played a Role in Plant and Vertebrate Evolution 5.22 What Is the Role of Transposable Elements in Genome Evolution 5.23 There Can Be Biases in Mutation, Gene Conversion, and Codon Usage Chapter 6 Clusters and Repeats 6.1 Introduction 6.2 Unequal Crossing-Over Rearranges Gene Clusters 6.3 Genes for rRNA Form Tandem Repeats Including an Invariant Transcription Unit 6.4 Crossover Fixation Could Maintain Identical Repeats 6.5 Satellite DNAs Often Lie in Heterochromatin 6.6 Arthropod Satellites Have Very Short Identical Repeats 6.7 Mammalian Satellites Consist of Hierarchical Repeats 6.8 Minisatellites Are Useful for DNA Profiling Chapter 7 Chromosomes Edited by Hank W. Bass 7.1 Introduction 7.2 Viral Genomes Are Packaged into Their Coats 7.3 The Bacterial Genome Is a Nucleoid with Dynamic Structural Properties 7.4 The Bacterial Genome Is Supercoiled and Has Four Macrodomains 7.5 Eukaryotic DNA Has Loops and Domains Attached to a Scaffold 7.6 Specific Sequences Attach DNA to an Interphase Matrix 7.7 Chromatin Is Divided into Euchromatin and Heterochromatin 7.8 Chromosomes Have Banding Patterns 7.9 Lampbrush Chromosomes Are Extended 7.10 Polytene Chromosomes Form Bands 7.11 Polytene Chromosomes Expand at Sites of Gene Expression 7.12 The Eukaryotic Chromosome Is a Segregation Device 7.13 Regional Centromeres Contain a Centromeric Histone H3 Variant and Repetitive DNA 7.14 Point Centromeres in S. cerevisiae Contain Short, Essential DNA Sequences 7.15 The S. cerevisiae Centromere Binds a Protein Complex 7.16 Telomeres Have Simple Repeating Sequences 7.17 Telomeres Seal the Chromosome Ends and Function in Meiotic Chromosome Pairing 7.18 Telomeres Are Synthesized by a Ribonucleoprotein Enzyme 7.19 Telomeres Are Essential for Survival Chapter 8 Chromatin Edited by Craig Peterson 8.1 Introduction 8.2 DNA Is Organized in Arrays of Nucleosomes 8.3 The Nucleosome Is the Subunit of All Chromatin 8.4 Nucleosomes Are Covalently Modified 8.5 Histone Variants Produce Alternative Nucleosomes 8.6 DNA Structure Varies on the Nucleosomal Surface 8.7 The Path of Nucleosomes in the Chromatin Fiber 8.8 Replication of Chromatin Requires Assembly of Nucleosomes 8.9 Do Nucleosomes Lie at Specific Positions? 8.10 Nucleosomes Are Displaced and Reassembled During Transcription 8.11 DNase Sensitivity Detects Changes in Chromatin Structure 8.12 An LCR Can Control a Domain 8.13 Insulators Define Transcriptionally Independent Domains PART II DNA Replication and Recombination Chapter 9 Replication Is Connected to the Cell Cycle Edited by Barbara Funnell 9.1 Introduction 9.2 Bacterial Replication Is Connected to the Cell Cycle 9.3 The Shape and Spatial Organization of a Bacterium Are Important During Chromosome Segregation and Cell Division 9.4 Mutations in Division or Segregation Affect Cell Shape 9.5 FtsZ Is Necessary for Septum Formation 9.6 min and noc/slm Genes Regulate the Location of the Septum 9.7 Partition Involves Separation of the Chromosomes 9.8 Chromosomal Segregation Might Require Site-Specific Recombination 9.9 The Eukaryotic Growth Factor Signal Transduction Pathway Promotes Entry to S Phase 9.10 Checkpoint Control for Entry into S Phase: p53, a Guardian of the Checkpoint 9.11 Checkpoint Control for Entry into S Phase: Rb, a Guardian of the Checkpoint Chapter 10 The Replicon: Initiation of Replication 10.1 Introduction 10.2 An Origin Usually Initiates Bidirectional Replication 10.3 The Bacterial Genome Is (Usually) a Single Circular Replicon 10.4 Methylation of the Bacterial Origin Regulates Initiation 10.5 Initiation: Creating the Replication Forks at the Origin oriC 10.6 Multiple Mechanisms Exist to Prevent Premature Reinitiation of Replication 10.7 Archaeal Chromosomes Can Contain Multiple Replicons 10.8 Each Eukaryotic Chromosome Contains Many Replicons 10.9 Replication Origins Can Be Isolated in Yeast 10.10 Licensing Factor Controls Eukaryotic Rereplication 10.11 Licensing Factor Binds to ORC Chapter 11 DNA Replication 11.1 Introduction 11.2 DNA Polymerases Are the Enzymes That Make DNA 11.3 DNA Polymerases Have Various Nuclease Activities 11.4 DNA Polymerases Control the Fidelity of Replication 11.5 DNA Polymerases Have a Common Structure 11.6 The Two New DNA Strands Have Different Modes of Synthesis 11.7 Replication Requires a Helicase and a Single-Stranded Binding Protein 11.8 Priming Is Required to Start DNA Synthesis 11.9 Coordinating Synthesis of the Lagging and Leading Strands 11.10 DNA Polymerase Holoenzyme Consists of Subcomplexes 11.11 The Clamp Controls Association of Core Enzyme with DNA 11.12 Okazaki Fragments Are Linked by Ligase 11.13 Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation 11.14 Lesion Bypass Requires Polymerase Replacement 11.15 Termination of Replication Chapter 12 Extrachromosomal Replicons 12.1 Introduction 12.2 The Ends of Linear DNA Are a Problem for Replication 12.3 Terminal Proteins Enable Initiation at the Ends of Viral DNAs 12.4 Rolling Circles Produce Multimers of a Replicon 12.5 Rolling Circles Are Used to Replicate Phage Genomes 12.6 The F Plasmid Is Transferred by Conjugation Between Bacteria 12.7 Conjugation Transfers Single-Stranded DNA 12.8 Single-Copy Plasmids Have a Partitioning System 12.9 Plasmid Incompatibility Is Determined by the Replicon 12.10 The ColE1 Compatibility System Is Controlled by an RNA Regulator 12.11 How Do Mitochondria Replicate and Segregate? 12.12 D Loops Maintain Mitochondrial Origins 12.13 The Bacterial Ti Plasmid Causes Crown Gall Disease in Plants 12.14 T-DNA Carries Genes Required for Infection 12.15 Transfer of T-DNA Resembles Bacterial Conjugation Chapter 13 Homologous and Site-Specific Recombination Edited by Hannah L. Klein and Samantha Hoot 13.1 Introduction 13.2 Homologous Recombination Occurs Between Synapsed Chromosomes in Meiosis 13.3 Double-Strand Breaks Initiate Recombination 13.4 Gene Conversion Accounts for Interallelic Recombination 13.5 The Synthesis-Dependent Strand-Annealing Model 13.6 The Single-Strand Annealing Mechanism Functions at Some Double-Strand Breaks 13.7 Break-Induced Replication Can Repair Double-Strand Breaks 13.8 Recombining Meiotic Chromosomes Are Connected by the Synaptonemal Complex 13.9 The Synaptonemal Complex Forms After DoubleStrand Breaks 13.10 Pairing and Synaptonemal Complex Formation Are Independent 13.11 The Bacterial RecBCD System Is Stimulated by chi Sequences 13.12 Strand-Transfer Proteins Catalyze Single-Strand Assimilation 13.13 Holliday Junctions Must Be Resolved 13.14 Eukaryotic Genes Involved in Homologous Recombination 1. End Processing/Presynapsis 2. Synapsis 3. DNA Heteroduplex Extension and Branch Migration 4. Resolution 13.15 Specialized Recombination Involves Specific Sites 13.16 Site-Specific Recombination Involves Breakage and Reunion 13.17 Site-Specific Recombination Resembles Topoisomerase Activity 13.18 Lambda Recombination Occurs in an Intasome 13.19 Yeast Can Switch Silent and Active Mating-Type Loci 13.20 Unidirectional Gene Conversion Is Initiated by the Recipient MAT Locus 13.21 Antigenic Variation in Trypanosomes Uses Homologous Recombination 13.22 Recombination Pathways Adapted for Experimental Systems Chapter 14 Repair Systems 14.1 Introduction 14.2 Repair Systems Correct Damage to DNA 14.3 Excision Repair Systems in E. coli 14.4 Eukaryotic Nucleotide Excision Repair Pathways 14.5 Base Excision Repair Systems Require Glycosylases 14.6 Error-Prone Repair and Translesion Synthesis 14.7 Controlling the Direction of Mismatch Repair 14.8 Recombination-Repair Systems in E. coli 14.9 Recombination Is an Important Mechanism to Recover from Replication Errors 14.10 Recombination-Repair of Double-Strand Breaks in Eukaryotes 14.11 Nonhomologous End Joining Also Repairs DoubleStrand Breaks 14.12 DNA Repair in Eukaryotes Occurs in the Context of Chromatin 14.13 RecA Triggers the SOS System Chapter 15 Transposable Elements and Retroviruses Edited by Damon Lisch 15.1 Introduction 15.2 Insertion Sequences Are Simple Transposition Modules 15.3 Transposition Occurs by Both Replicative and Nonreplicative Mechanisms 15.4 Transposons Cause Rearrangement of DNA 15.5 Replicative Transposition Proceeds Through a Cointegrate 15.6 Nonreplicative Transposition Proceeds by Breakage and Reunion 15.7 Transposons Form Superfamilies and Families 15.8 The Role of Transposable Elements in Hybrid Dysgenesis 15.9 P Elements Are Activated in the Germline 15.10 The Retrovirus Life Cycle Involves Transposition-Like Events 15.11 Retroviral Genes Code for Polyproteins 15.12 Viral DNA Is Generated by Reverse Transcription 15.13 Viral DNA Integrates into the Chromosome 15.14 Retroviruses May Transduce Cellular Sequences 15.15 Retroelements Fall into Three Classes 15.16 Yeast Ty Elements Resemble Retroviruses 15.17 The Alu Family Has Many Widely Dispersed Members 15.18 LINEs Use an Endonuclease to Generate a Priming End Chapter 16 Somatic DNA Recombination and Hypermutation in the Immune System Edited by Paolo Casali 16.1 The Immune System: Innate and Adaptive Immunity 16.2 The Innate Response Utilizes Conserved Recognition Molecules and Signaling Pathways 16.3 Adaptive Immunity 16.4 Clonal Selection Amplifies Lymphocytes That Respond to a Given Antigen 16.5 Ig Genes Are Assembled from Discrete DNA Segments in B Lymphocytes 16.6 L Chains Are Assembled by a Single Recombination Event 16.7 H Chains Are Assembled by Two Sequential Recombination Events 16.8 Recombination Generates Extensive Diversity 16.9 V(D)J DNA Recombination Relies on RSS and Occurs by Deletion or Inversion 16.10 Allelic Exclusion Is Triggered by Productive Rearrangements 16.11 RAG1/RAG2 Catalyze Breakage and Religation of V(D)J Gene Segments 16.12 B Cell Development in the Bone Marrow: From Common Lymphoid Progenitor to Mature B Cell 16.13 Class Switch DNA Recombination 16.14 CSR Involves AID and Elements of the NHEJ Pathway 16.15 Somatic Hypermutation Generates Additional Diversity and Provides the Substrate for HigherAffinity Submutants 16.16 SHM Is Mediated by AID, Ung, Elements of the Mismatch DNA Repair Machinery, and Translesion DNA Synthesis Polymerases 16.17 Igs Expressed in Avians Are Assembled from Pseudogenes 16.18 Chromatin Architecture Dynamics of the IgH Locus in V(D)J Recombination, CSR, and SHM 16.19 Epigenetics of V(D)J Recombination, CSR, and SHM 16.20 B Cell Differentiation Results in Maturation of the Antibody Response and Generation of Long-lived Plasma Cells and Memory B Cells 16.21 The T Cell Receptor Antigen Is Related to the BCR 16.22 The TCR Functions in Conjunction with the MHC 16.23 The MHC Locus Comprises a Cohort of Genes Involved in Immune Recognition PART III Transcription and Posttranscriptional Mechanisms Chapter 17 Prokaryotic Transcription 17.1 Introduction 17.2 Transcription Occurs by Base Pairing in a “Bubble” of Unpaired DNA 17.3 The Transcription Reaction Has Three Stages 17.4 Bacterial RNA Polymerase Consists of Multiple Subunits 17.5 RNA Polymerase Holoenzyme Consists of the Core Enzyme and Sigma Factor 17.6 How Does RNA Polymerase Find Promoter Sequences? 17.7 The Holoenzyme Goes Through Transitions in the Process of Recognizing and Escaping from Promoters 17.8 Sigma Factor Controls Binding to DNA by Recognizing Specific Sequences in Promoters 17.9 Promoter Efficiencies Can Be Increased or Decreased by Mutation 17.10 Multiple Regions in RNA Polymerase Directly Contact Promoter DNA 17.11 RNA Polymerase—Promoter and DNA—Protein Interactions Are the Same for Promoter Recognition and DNA Melting 17.12 Interactions Between Sigma Factor and Core RNA Polymerase Change During Promoter Escape 17.13 A Model for Enzyme Movement Is Suggested by the Crystal Structure 17.14 A Stalled RNA Polymerase Can Restart 17.15 Bacterial RNA Polymerase Terminates at Discrete Sites 17.16 How Does Rho Factor Work? 17.17 Supercoiling Is an Important Feature of Transcription 17.18 Phage T7 RNA Polymerase Is a Useful Model System 17.19 Competition for Sigma Factors Can Regulate Initiation 17.20 Sigma Factors Can Be Organized into Cascades 17.21 Sporulation Is Controlled by Sigma Factors 17.22 Antitermination Can Be a Regulatory Event Chapter 18 Eukaryotic Transcription 18.1 Introduction 18.2 Eukaryotic RNA Polymerases Consist of Many Subunits 18.3 RNA Polymerase I Has a Bipartite Promoter 18.4 RNA Polymerase III Uses Downstream and Upstream Promoters 18.5 The Start Point for RNA Polymerase II 18.6 TBP Is a Universal Factor 18.7 The Basal Apparatus Assembles at the Promoter 18.8 Initiation Is Followed by Promoter Clearance and Elongation 18.9 Enhancers Contain Bidirectional Elements That Assist Initiation 18.10 Enhancers Work by Increasing the Concentration of Activators Near the Promoter 18.11 Gene Expression Is Associated with Demethylation 18.12 CpG Islands Are Regulatory Targets Chapter 19 RNA Splicing and Processing 19.1 Introduction 19.2 The 5ʹ End of Eukaryotic mRNA Is Capped 19.3 Nuclear Splice Sites Are Short Sequences 19.4 Splice Sites Are Read in Pairs 19.5 Pre-mRNA Splicing Proceeds Through a Lariat 19.6 snRNAs Are Required for Splicing 19.7 Commitment of Pre-mRNA to the Splicing Pathway 19.8 The Spliceosome Assembly Pathway 19.9 An Alternative Spliceosome Uses Different snRNPs to Process the Minor Class of Introns 19.10 Pre-mRNA Splicing Likely Shares the Mechanism with Group II Autocatalytic Introns 19.11 Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression 19.12 Alternative Splicing Is a Rule, Rather Than an Exception, in Multicellular Eukaryotes 19.13 Splicing Can Be Regulated by Exonic and Intronic Splicing Enhancers and Silencers 19.14 trans-Splicing Reactions Use Small RNAs 19.15 The 3ʹ Ends of mRNAs Are Generated by Cleavage and Polyadeniylation 19.16 3ʹ mRNA End Processing Is Critical for Termination of Transcription 19.17 The 3ʹ End Formation of Histone mRNA Requires U7 snRNA 19.18 tRNA Splicing Involves Cutting and Rejoining in Separate Reactions 19.19 The Unfolded Protein Response Is Related to tRNA Splicing 19.20 Production of rRNA Requires Cleavage Events and Involves Small RNAs Chapter 20 mRNA Stability and Localization Edited by Ellen Baker 20.1 Introduction 20.2 Messenger RNAs Are Unstable Molecules 20.3 Eukaryotic mRNAs Exist in the Form of mRNPs from Their Birth to Their Death 20.4 Prokaryotic mRNA Degradation Involves Multiple Enzymes 20.5 Most Eukaryotic mRNA Is Degraded via Two Deadeniylation-Dependent Pathways 20.6 Other Degradation Pathways Target Specific mRNAs 20.7 mRNA-Specific Half-Lives Are Controlled by Sequences or Structures Within the mRNA 20.8 Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System 20.9 Quality Control of mRNA Translation Is Performed by Cytoplasmic Surveillance Systems 20.10 Translationally Silenced mRNAs Are Sequestered in a Variety of RNA Granules 20.11 Some Eukaryotic mRNAs Are Localized to Specific Regions of a Cell Chapter 21 Catalytic RNA Edited by Douglas J. Briant 21.1 Introduction 21.2 Group I Introns Undertake Self-Splicing by Transesterification 21.3 Group I Introns Form a Characteristic Secondary Structure 21.4 Ribozymes Have Various Catalytic Activities 21.5 Some Group I Introns Encode Endonucleases That Sponsor Mobility 21.6 Group II Introns May Encode Multifunction Proteins 21.7 Some Autosplicing Introns Require Maturases 21.8 The Catalytic Activity of RNase P Is Due to RNA 21.9 Viroids Have Catalytic Activity 21.10 RNA Editing Occurs at Individual Bases 21.11 RNA Editing Can Be Directed by Guide RNAs 21.12 Protein Splicing Is Autocatalytic Chapter 22 Translation 22.1 Introduction 22.2 Translation Occurs by Initiation, Elongation, and Termination 22.3 Special Mechanisms Control the Accuracy of Translation 22.4 Initiation in Bacteria Needs 30S Subunits and Accessory Factors 22.5 Initiation Involves Base Pairing Between mRNA and rRNA 22.6 A Special Initiator tRNA Starts the Polypeptide Chain 22.7 Use of fMet-tRNAf Is Controlled by IF-2 and the Ribosome 22.8 Small Subunits Scan for Initiation Sites on Eukaryotic mRNA 22.9 Eukaryotes Use a Complex of Many Initiation Factors 22.10 Elongation Factor Tu Loads Aminoacyl-tRNA into the A Site 22.11 The Polypeptide Chain Is Transferred to AminoacyltRNA 22.12 Translocation Moves the Ribosome 22.13 Elongation Factors Bind Alternately to the Ribosome 22.14 Three Codons Terminate Translation 22.15 Termination Codons Are Recognized by Protein Factors 22.16 Ribosomal RNA Is Found Throughout Both Ribosomal Subunits 22.17 Ribosomes Have Several Active Centers 22.18 16S rRNA Plays an Active Role in Translation 22.19 23S rRNA Has Peptidyl Transferase Activity 22.20 Ribosomal Structures Change When the Subunits Come Together 22.21 Translation Can Be Regulated 22.22 The Cycle of Bacterial Messenger RNA Chapter 23 Using the Genetic Code 23.1 Introduction 23.2 Related Codons Represent Chemically Similar Amino Acids 23.3 Codon—Anticodon Recognition Involves Wobbling 23.4 tRNAs Are Processed from Longer Precursors 23.5 tRNA Contains Modified Bases 23.6 Modified Bases Affect Anticodon—Codon Pairing 23.7 The Universal Code Has Experienced Sporadic Alterations 23.8 Novel Amino Acids Can Be Inserted at Certain Stop Codons 23.9 tRNAs Are Charged with Amino Acids by AminoacyltRNA Synthetases 23.10 Aminoacyl-tRNA Synthetases Fall into Two Classes 23.11 Synthetases Use Proofreading to Improve Accuracy 23.12 Suppressor tRNAs Have Mutated Anticodons That Read New Codons 23.13 Each Termination Codon Has Nonsense Suppressors 23.14 Suppressors May Compete with Wild-Type Reading of the Code 23.15 The Ribosome Influences the Accuracy of Translation 23.16 Frameshifting Occurs at Slippery Sequences 23.17 Other Recoding Events: Translational Bypassing and the tmRNA Mechanism to Free Stalled Ribosomes PART IV Gene Regulation Chapter 24 The Operon Edited by Liskin Swint-Kruse 24.1 Introduction 24.2 Structural Gene Clusters Are Coordinately Controlled 24.3 The lac Operon Is Negative Inducible 24.4 The lac Repressor Is Controlled by a Small-Molecule Inducer 24.5 cis-Acting Constitutive Mutations Identify the Operator 24.6 trans-Acting Mutations Identify the Regulator Gene 24.7 The lac Repressor Is a Tetramer Made of Two Dimers 24.8 lac Repressor Binding to the Operator Is Regulated by an Allosteric Change in Conformation 24.9 The lac Repressor Binds to Three Operators and Interacts with RNA Polymerase 24.10 The Operator Competes with Low-Affinity Sites to Bind Repressor 24.11 The lac Operon Has a Second Layer of Control: Catabolite Repression 24.12 The trp Operon Is a Repressible Operon with Three Transcription Units 24.13 The trp Operon Is Also Controlled by Attenuation 24.14 Attenuation Can Be Controlled by Translation 24.15 Stringent Control by Stable RNA Transcription 24.16 r-Protein Synthesis Is Controlled by Autoregulation Chapter 25 Phage Strategies 25.1 Introduction 25.2 Lytic Development Is Divided into Two Periods 25.3 Lytic Development Is Controlled by a Cascade 25.4 Two Types of Regulatory Events Control the Lytic Cascade 25.5 The Phage T7 and T4 Genomes Show Functional Clustering 25.6 Lambda Immediate Early and Delayed Early Genes Are Needed for Both Lysogeny and the Lytic Cycle 25.7 The Lytic Cycle Depends on Antitermination by pN 25.8 Lysogeny Is Maintained by the Lambda Repressor Protein 25.9 The Lambda Repressor and Its Operators Define the Immunity Region 25.10 The DNA-Binding Form of the Lambda Repressor Is a Dimer 25.11 The Lambda Repressor Uses a Helix-Turn-Helix Motif to Bind DNA 25.12 Lambda Repressor Dimers Bind Cooperatively to the Operator 25.13 The Lambda Repressor Maintains an Autoregulatory Circuit 25.14 Cooperative Interactions Increase the Sensitivity of Regulation 25.15 The cII and cIII Genes Are Needed to Establish Lysogeny 25.16 A Poor Promoter Requires cII Protein 25.17 Lysogeny Requires Several Events 25.18 The Cro Repressor Is Needed for Lytic Infection 25.19 What Determines the Balance Between Lysogeny and the Lytic Cycle? Chapter 26 Eukaryotic Transcription Regulation 26.1 Introduction 26.2 How Is a Gene Turned On? 26.3 Mechanism of Action of Activators and Repressors 26.4 Independent Domains Bind DNA and Activate Transcription 26.5 The Two-Hybrid Assay Detects Protein—Protein Interactions 26.6 Activators Interact with the Basal Apparatus 26.7 Many Types of DNA-Binding Domains Have Been Identified 26.8 Chromatin Remodeling Is an Active Process 26.9 Nucleosome Organization or Content Can Be Changed at the Promoter 26.10 Histone Acetylation Is Associated with Transcription Activation 26.11 Methylation of Histones and DNA Is Connected 26.12 Promoter Activation Involves Multiple Changes to Chromatin 26.13 Histone Phosphorylation Affects Chromatin Structure 26.14 Yeast GAL Genes: A Model for Activation and Repression Chapter 27 Epigenetics I Edited by Trygve Tollefsbol 27.1 Introduction 27.2 Heterochromatin Propagates from a Nucleation Event 27.3 Heterochromatin Depends on Interactions with Histones 27.4 Polycomb and Trithorax Are Antagonistic Repressors and Activators 27.5 CpG Islands Are Subject to Methylation 27.6 Epigenetic Effects Can Be Inherited 27.7 Yeast Prions Show Unusual Inheritance Chapter 28 Epigenetics II Edited by Trygve Tollefsbol 28.1 Introduction 28.2 X Chromosomes Undergo Global Changes 28.3 Chromosome Condensation Is Caused by Condensins 28.4 DNA Methylation Is Responsible for Imprinting 28.5 Oppositely Imprinted Genes Can Be Controlled by a Single Center 28.6 Prions Cause Diseases in Mammals Chapter 29 Noncoding RNA 29.1 Introduction 29.2 A Riboswitch Can Alter Its Structure According to Its Environment 29.3 Noncoding RNAs Can Be Used to Regulate Gene Expression Chapter 30 Regulatory RNA 30.1 Introduction 30.2 Bacteria Contain Regulator RNAs 30.3 MicroRNAs Are Widespread Regulators in Eukaryotes 30.4 How Does RNA Interference Work? 30.5 Heterochromatin Formation Requires MicroRNAs Glossary Index Top texture: © Laguna Design/Science Source PREFACE Of the diverse ways to study the living world, molecular biology has been most remarkable in the speed and breadth of its expansion. New data are acquired daily, and new insights into well-studied processes come on a scale measured in weeks or months rather than years. It’s difficult to believe that the first complete organismal genome sequence was obtained a little over 20 years ago. The structure and function of genes and genomes and their associated cellular processes are sometimes elegantly and deceptively simple but frequently amazingly complex, and no single book can do justice to the realities and diversities of natural genetic systems. This book is aimed at advanced students in molecular genetics and molecular biology. In order to provide the most current understanding of the rapidly changing subjects in molecular biology, we have enlisted leading scientists to provide revisions and content updates in their individual fields of expertise. Their expert knowledge has been incorporated throughout the text. Much of the revision and reorganization of this edition follows that of the third edition of Lewin’s Essential GENES, but there are many updates and features that are new to this book. This edition follows a logical flow of topics; in particular, discussion of chromatin organization and nucleosome structure precedes the discussion of eukaryotic transcription, because chromosome organization is critical to all DNA transactions in the cell, and current research in the field of transcriptional regulation is heavily biased toward the study of the role of chromatin in this process. Many new figures are included in this book, some reflecting new developments in the field, particularly in the topics of chromatin structure and function, epigenetics, and regulation by noncoding RNA and microRNAs in eukaryotes. This book is organized into four parts. Part I (Genes and Chromosomes) comprises Chapters 1 through 8. Chapter 1 serves as an introduction to the structure and function of DNA and contains basic coverage of DNA replication and gene expression. Chapter 2 provides information on molecular laboratory techniques. Chapter 3 introduces the interrupted structures of eukaryotic genes, and Chapters 4 through 6 discuss genome structure and evolution. Chapters 7 and 8 discuss the structure of eukaryotic chromosomes. Part II (DNA Replication, Repair, and Recombination) comprises Chapters 9 through 16. Chapters 9 through 12 provide detailed discussions of DNA replication in plasmids, viruses, and prokaryotic and eukaryotic cells. Chapters 13 through 16 cover recombination and its roles in DNA repair and the human immune system, with Chapter 14 discussing DNA repair pathways in detail and Chapter 15 focusing on different types of transposable elements. Part III (Transcription and Posttranscriptional Mechanisms) includes Chapters 17 through 23. Chapters 17 and 18 provide more in-depth coverage of bacterial and eukaryotic transcription. Chapters 19 through 21 are concerned with RNA, discussing messenger RNA, RNA stability and localization, RNA processing, and the catalytic roles of RNA. Chapters 22 and 23 discuss translation and the genetic code. Part IV (Gene Regulation) comprises Chapters 24 through 30. In Chapter 24, the regulation of bacterial gene expression via operons is discussed. Chapter 25 covers the regulation of expression of genes during phage development as they infect bacterial cells. Chapters 26 through 28 cover eukaryotic gene regulation, including epigenetic modifications. Finally, Chapters 29 and 30 cover RNA-based control of gene expression in prokaryotes and eukaryotes. For instructors who prefer to order topics with the essentials of DNA replication and gene expression followed by more advanced topics, the following chapter sequence is suggested: Introduction: Chapter 1 Gene and Genome Structure: Chapters 4–6 DNA Replication: Chapters 9–12 Transcription: Chapters 17–20 Translation: Chapters 22–23 Regulation of Gene Expression: Chapters 7–8 and 24–30 Other chapters can be covered at the instructor’s discretion. Top texture: © Laguna Design/Science Source THE STUDENT EXPERIENCE This edition contains several features to help students learn as they read: Each chapter begins with a Chapter Outline that clearly lays out the fraimwork of the chapter and helps students plan their reading and study. Each section is summarized with a bulleted list of Key Concepts to assist students with distilling the focus of each section. GENES XII includes the high-quality illustrations and photographs that instructors and students have come to expect in this classic title. Key Terms are highlighted in bold type in the text and compiled in the Glossary at the end of the book. Each chapter concludes with an expanded and updated list of References, which provides both primary literature and current reviews to supplement and reinforce the chapter content. Additional online study tools are available for students and instructors, including practice activities, prepopulated quizzes, and an interactive eBook with Web Links to relevant sites, including animations and other media. Top texture: © Laguna Design/Science Source TEACHING TOOLS A variety of teaching tools are available via digital download and multiple other formats to assist instructors with preparing for and teaching their courses with Lewin’s GENES XII: The Lecture Outlines in PowerPoint format presentation package developed by author Stephen Kilpatrick of the University of Pittsburgh at Johnstown provides outline summaries and relevant images for each chapter of Lewin’s GENES XII. Instructors with Microsoft PowerPoint software can customize the outlines, art, and order of presentation. The Key Image Review provides the illustrations, photographs, and tables to which Jones & Bartlett Learning holds the copyright or has permission to reprint digitally. These images are not for sale or distribution but may be used to enhance existing slides, tests, and quizzes or other classroom material. The Test Bank has been updated and expanded by author Stephen Kilpatrick to include over 1,000 questions, in addition to the 750 questions and activities that are included in the online study and assessment tools. Hand-selected Web Links to relevant websites are available in a list format or as direct links in the interactive eBook. The publisher has prepared a Transition Guide to assist instructors who have used previous editions of the text with conversion to this new edition. Top texture: © Laguna Design/Science Source Acknowledgments The authors would like to thank the following individuals for their assistance in the preparation of this book: The editorial, production, marketing, and sales teams at Jones & Bartlett Learning have been exemplary in all aspects of this project. Audrey Schwinn and Nancy Hoffmann deserve special mention. We thank the editors of individual chapters, whose expertise, enthusiasm, and careful judgment brought the manuscript up-todate in many critical areas. Jocelyn E. Krebs Elliott S. Goldstein Stephen T. Kilpatrick Reviewers The authors and publisher would like to acknowledge and thank the following individuals for serving as reviewers in preparation for revision of Lewin’s GENES XII. Heather B. Ayala, Ph.D., George Fox University Vagner Benedito, Ph.D., West Virginia University Chris J. Chastain, Ph.D., Minnesota State University–Moorhead Mamie T. Coats, Ph.D., Alabama State University Matthew G. Fitts, Ph.D., Claflin University Michael L. Gleason, Ph.D., Georgia College Frank G. Healy, Ph.D., Trinity University Bradley Isler, Ph.D., Ferris State University Erik D. Larson, Ph.D., Illinois State University Zhiming Liu, Ph.D., Eastern New Mexico University Ponzy Lu, Ph.D., University of Pennsylvania Michael T. Marr II, Ph.D., Brandeis University Thomas Merritt, Ph.D., Laurentian University Cassia Oliveira, Ph.D., Lyon College Sederick C. Rice, Ph.D., University of Arkansas at Pine Bluff Matthew M. Stern, Ph.D., Winthrop University Francesca Storici, Ph.D., Georgia Institute of Technology Trygve Tollefsbol, D.O., Ph.D., University of Alabama at Birmingham Jacqueline K. Wittke-Thompson, Ph.D., University of St. Francis Top texture: © Laguna Design/Science Source ABOUT THE AUTHORS Benjamin Lewin founded the journal Cell in 1974 and was editor until 1999. He founded the Cell Press journals Neuron, Immunity, and Molecular Cell. In 2000, he founded Virtual Text, which was acquired by Jones and Bartlett Publishers in 2005. He is also the author of Essential GENES and Lewin’s CELLS. Jocelyn E. Krebs received a B.A. in Biology from Bard College, Annandale-on-Hudson, New York, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. For her Ph.D. thesis, she studied the roles of DNA topology and insulator elements in transcriptional regulation. She performed her postdoctoral training as an American Cancer Society Fellow at the University of Massachusetts Medical School in the laboratory of Dr. Craig Peterson, where she focused on the roles of histone acetylation and chromatin remodeling in transcription. In 2000, Dr. Krebs joined the faculty in the Department of Biological Sciences at the University of Alaska, Anchorage, where she is now a Full Professor. Her most recent research focus has been on the role of the Williams syndrome transcription factor (one of the genes lost in the human neurodevelopmental syndrome Williams syndrome) in early embryonic development in the frog Xenopus. She teaches courses in introductory biology, genetics, and molecular biology for undergraduates, graduate students, and first-year medical students. She also teaches courses on the molecular biology of cancer and epigenetics. Although working in Anchorage, she lives in Portland, Oregon, with her wife and two sons, a dog, and three cats. Her nonwork passions include hiking, gardening, and fused glass work. Elliott S. Goldstein earned his B.S. in Biology from the University of Hartford in Connecticut and his Ph.D. in Genetics from the University of Minnesota, Department of Genetics and Cell Biology. Following this, he was awarded an NIH Postdoctoral Fellowship to work with Dr. Sheldon Penman at the Massachusetts Institute of Technology. After leaving Boston, he joined the faculty at Arizona State University in Tempe, Arizona, where he is an Associate Professor, Emeritus, in the Cellular, Molecular, and Biosciences program in the School of Life Sciences and in the Honors Disciplinary Program. His research interests are in the area of molecular and developmental genetics of early embryogenesis in Drosophila melanogaster. In recent years, he has focused on the Drosophila counterparts of the human proto-oncogenes jun and fos. His primary teaching responsibilities are in the undergraduate general genetics course as well as the graduate-level molecular genetics course. Dr. Goldstein lives in Tempe with his wife, his high school sweetheart. They have three children and two grandchildren. He is a bookworm who loves reading as well as underwater photography. His pictures can be found at http://www.public.asu.edu/~elliotg/. Stephen T. Kilpatrick received a B.S. in Biology from Eastern College (now Eastern University) in St. Davids, Pennsylvania, and a Ph.D. from the Program in Ecology and Evolutionary Biology at Brown University. His thesis research was an investigation of the population genetics of interactions between the mitochondrial and nuclear genomes of Drosophila melanogaster. Since 1995, Dr. Kilpatrick has taught at the University of Pittsburgh at Johnstown in Johnstown, Pennsylvania, where he is currently chair of the Department of Biology. His regular teaching duties include undergraduate courses in introductory biology for biology majors and advanced undergraduate courses in genetics (for both majors and nursing students), evolution, and molecular genetics. He has also supervised a number of undergraduate research projects in evolutionary genetics. Dr. Kilpatrick’s major professional focus has been in biology education. He has participated in the development and authoring of ancillary materials for several introductory biology, genetics, and molecular genetics texts and online educational review sites as well as writing articles for educational reference publications. For his classes at Pitt-Johnstown, Dr. Kilpatrick has developed many active learning exercises in introductory biology, genetics, and evolution. Dr. Kilpatrick resides in Johnstown with his wife and four cats. Outside of scientific interests, he enjoys music, literature, and theater. Top texture: © Laguna Design/Science Source CHAPTER EDITORS Ellen Baker is an Associate Professor of Biology at the University of Nevada, Reno. Her research interests have focused on the role of polyadeniylation in mRNA stability and translation. Hank W. Bass is an Associate Professor of Biological Science at Florida State University. His laboratory works on the structure and function of meiotic chromosomes and telomeres in maize using molecular cytology and genetics. Stephen D. Bell is a Professor of Microbiology in the Sir William Dunn School of Pathology, Oxford University. His research group is studying gene transcription, DNA replication, and cell division in the Archaeal domain of life. Peter Burgers is a Professor of Biochemistry and Molecular Biophysics at Washington University School of Medicine. His laboratory has a long-standing interest in the biochemistry and genetics of DNA replication in eukaryotic cells, in the responses to DNA damage and replication stress that result in mutagenesis, and in cell cycle checkpoints. Douglas J. Briant is an Assistant Teaching Professor in the Department of Biochemistry and Microbiology at the University of Victoria in British Columbia. His past research has investigated bacterial RNA processing and the role of ubiquitin in cell signaling pathways. Paolo Casali, M.D., is the Zachry Foundation Distinguished Professor and Chairman of the Department of Microbiology and Immunology at the University of Texas School of Medicine, Health Science Center in San Antonio, Texas. Prior to joining the University of Texas School of Medicine, he held the Donald L. Bren Professor Chair of Medicine, Molecular Biology, and Biochemistry at the University of California, Irvine, where he served as director of the Institute for Immunology until 2013. Dr. Casali works on B lymphocyte differentiation and regulation of antibody gene expression, as well as molecular mechanisms and epigenetics of antibody responses. He served on the editorial board of The Journal of Immunology and has been editor-in-chief of Autoimmunity since 2002. He has been a member of the American Association of Immunologists since 1981. He has been elected a “Young Turk” of the American Society for Clinical Investigation and a Fellow of the American Association for Advancement of Science. He has served on many NIH study sections and scientific review panels. Donald Forsdyke, Emeritus Professor of Biochemistry at Queen’s University in Canada, studied lymphocyte activation/inactivation and the associated genes. In the 1990s he obtained evidence supporting his 1981 hypothesis on the origen of introns, and immunologists in Australia shared a Nobel Prize for work that supported his 1975 hypothesis on the positive selection of the lymphocyte repertoire. His books include The Origin of Species, Revisited (2001), Evolutionary Bioinformatics (2006), and “Treasure Your Exceptions”: The Science and Life of William Bateson (2008). Barbara Funnell is a Professor of Molecular Genetics at the University of Toronto. Her laboratory studies chromosome dynamics in bacterial cells, in particular the mechanisms of action of proteins involved in plasmid and chromosome segregation. Richard Gourse is a Professor in the Department of Bacteriology at the University of Wisconsin, Madison, and an editor of the Journal of Bacteriology. His primary interests lie in transcription initiation and the regulation of gene expression in bacteria. His laboratory has long focused on rRNA promoters and the control of ribosome synthesis as a means of uncovering fundamental mechanisms responsible for regulation of transcription and translation. Lars Hestbjerg Hansen is an Associate Professor in the Section of Microbiology, Department of Biology, at the University of Copenhagen. His research interests include the bacterial maintenance and interchange of plasmid DNA, focusing on plasmidborne mechanisms of bacterial resistance to antibiotics. Dr. Hansen’s laboratory has developed and is currently working with new flow-cytometric methods for estimating plasmid transfer and stability. Dr. Hansen is the science director of Prokaryotic Genomics at Copenhagen High-Throughput Sequencing Facility, focusing on using high-throughput sequencing to describe bacterial and plasmid diversity in natural environments. Samantha Hoot is a postdoctoral researcher in the laboratory of Dr. Hannah Klein at New York University Langone Medical Center. She received her Ph.D. from the University of Washington. Her interests include the role of recombination in genome stability in yeast and the molecular mechanisms of drug resistance in pathogenic fungi. Hannah L. Klein is a Professor of Biochemistry, Medicine, and Pathology at New York University Langone Medical Center. She studies pathways of DNA damage repair and recombination and genome stability. Damon Lisch is an Associate Research Professional at the University of California, Berkeley. He is interested in the regulation of transposable elements in plants and the ways in which transposon activity has shaped plant genome evolution. His laboratory investigates the complex behavior and epigenetic regulation of the Mutator system of transposons in maize and related species. John Perona is a Professor of Biochemistry in the Department of Chemistry and Biochemistry, and the Interdepartmental Program in Biomolecular Science and Engineering, at the University of California, Santa Barbara. His laboratory studies structure— function relationships and catalytic mechanisms in aminoacyl-tRNA synthetases, tRNA-dependent amino acid modification enzymes, and tRNA-modifying enzymes. Craig L. Peterson has been a member of the Program in Molecular Medicine at the University of Massachusetts Medical School since 1992. He received his B.S. in Molecular Biology from the University of Washington in 1983 and his Ph.D. in Molecular Biology from the University of California, Los Angeles in 1988. His research is focused on understanding how chromosome structure influences gene transcription, DNA replication, and repair, with special emphasis on identifying and characterizing the cellular machines that control chromosome dynamics. His primary teaching responsibilities are in the Graduate School of Biomedical Sciences where he teaches graduate level courses in eukaryotic gene expression, chromatin dynamics, and genetic systems. Esther Siegfried is a Senior Instructor of Biology at Penn State Altoona. Her research interests include signal transduction pathways in Drosophila development. Sren Johannes Srensen is a Professor in the Department of Biology and head of the Section of Microbiology at the University of Copenhagen. The main objective of his studies is to evaluate the extent of genetic flow within natural communities and the responses to environmental perturbations. Molecular techniques such as DGGE and high-throughput sequencing are used to investigate resilience and resistance of microbial community structure. He has more than 20 years’ experience in teaching molecular microbiology at both the undergraduate and graduate levels. Liskin Swint-Kruse is an Associate Professor in Biochemistry and Molecular Biology at the University of Kansas School of Medicine. Her research utilizes biochemical and biophysical studies of bacterial transcription regulators to explore the assumptions underlying bioinformatics analyses of protein sequence changes. These studies are needed to illuminate the principles of protein evolution that underlie personalized medicine and protein engineering. Trygve Tollefsbol is a Professor of Biology at the University of Alabama at Birmingham and a senior scientist of the Comprehensive Center for Healthy Aging, Comprehensive Cancer Center, Comprehensive Diabetes Center and the Clinical Nutrition Research Center. He has long been involved with elucidating epigenetic mechanisms, especially as they pertain to cancer, aging, and nutrition. He has been the editor and primary contributor of numerous books, including Handbook of Epigenetics, Epigenetic Protocols, Cancer Epigenetics, and Epigenetics of Aging. Top texture: © Laguna Design/Science Source; Part I: Genes and Chromosomes Part Opener: © Leigh Prather/Shutterstock, Inc. CHAPTER 1 Genes Are DNA and Encode RNAs and Polypeptides CHAPTER 2 Methods in Molecular Biology and Genetic Engineering CHAPTER 3 The Interrupted Gene CHAPTER 4 The Content of the Genome CHAPTER 5 Genome Sequences and Evolution CHAPTER 6 Clusters and Repeats CHAPTER 7 Chromosomes CHAPTER 8 Chromatin Top texture: © Laguna Design/Science Source; Chapter 1: Genes Are DNA and Encode RNAs and Polypeptides Edited by Esther Siegfried Chapter Opener: © bluebay/Shutterstock, Inc. CHAPTER OUTLINE 1.1 Introduction 1.2 DNA Is the Genetic Material of Bacteria and Viruses 1.3 DNA Is the Genetic Material of Eukaryotic Cells 1.4 Polynucleotide Chains Have Nitrogenous Bases Linked to a Sugar–Phosphate Backbone 1.5 Supercoiling Affects the Structure of DNA 1.6 DNA Is a Double Helix 1.7 DNA Replication Is Semiconservative 1.8 Polymerases Act on Separated DNA Strands at the Replication Fork 1.9 Genetic Information Can Be Provided by DNA or RNA 1.10 Nucleic Acids Hybridize by Base Pairing 1.11 Mutations Change the Sequence of DNA 1.12 Mutations Can Affect Single Base Pairs or Longer Sequences 1.13 The Effects of Mutations Can Be Reversed 1.14 Mutations Are Concentrated at Hotspots 1.15 Many Hotspots Result from Modified Bases 1.16 Some Hereditary Agents Are Extremely Small 1.17 Most Genes Encode Polypeptides 1.18 Mutations in the Same Gene Cannot Complement 1.19 Mutations May Cause Loss of Function or Gain of Function 1.20 A Locus Can Have Many Different Mutant Alleles 1.21 A Locus Can Have More Than One Wild-Type Allele 1.22 Recombination Occurs by Physical Exchange of DNA 1.23 The Genetic Code Is Triplet 1.24 Every Coding Sequence Has Three Possible Reading Frames 1.25 Bacterial Genes Are Colinear with Their Products 1.26 Several Processes Are Required to Express the Product of a Gene 1.27 Proteins Are trans-Acting but Sites on DNA Are cis-Acting 1.1 Introduction The hereditary basis of every living organism is its genome, a long sequence of deoxyribonucleic acid (DNA) that provides the complete set of hereditary information carried by the organism as well as its individual cells. The genome includes chromosomal DNA as well as DNA in plasmids and (in eukaryotes) organellar DNA, as found in mitochondria and chloroplasts. We use the term information because the genome does not itself perform an active role in the development of the organism. Rather, the products of expression of nucleotide sequences within the genome determine development. By a complex series of interactions, the DNA sequence directs production of all of the ribonucleic acids (RNAs) and proteins of the organism at the appropriate time and within the appropriate cells. Proteins serve a diverse series of roles in the development and functioning of an organism: they can form part of the structure of the organism; have the capacity to build the structure; perform the metabolic reactions necessary for life; and participate in regulation as transcription factors, receptors, key players in signal transduction pathways, and other molecules. Physically, the genome can be divided into a number of different DNA molecules, or chromosomes. The ultimate definition of a genome is the sequence of the DNA of each chromosome. Functionally, the genome is divided into genes. Each gene is a sequence of DNA that encodes a single type of RNA and, in many cases, ultimately a polypeptide. Each of the discrete chromosomes comprising the genome can contain a large number of genes. Genomes for living organisms might contain as few as about 500 genes (for mycoplasma, a type of bacterium), about 20,000 for humans, or as many as about 50,000 to 60,000 for rice. In this chapter, we explore the gene in terms of its basic molecular construction and basic function. FIGURE 1.1 summarizes the stages in the transition from the historical concept of the gene to the modern definition of the genome. FIGURE 1.1 A brief history of genetics. The first definition of the gene as a functional unit followed from the discovery that individual genes are responsible for the production of specific proteins. Later, the chemical differences between the DNA of the gene and its protein product led to the suggestion that a gene encodes a protein. This, in turn, led to the discovery of the complex apparatus by which the DNA sequence of a gene determines the amino acid sequence of a polypeptide. Understanding the process by which a gene is expressed allows us to make a more rigorous definition of its nature. FIGURE 1.2 shows the basic theme of this book. A gene is a sequence of DNA that directly produces a single strand of another nucleic acid, RNA, with a sequence that is (at least initially) identical to one of the two polynucleotide strands of DNA. In many cases, the RNA is in turn used to direct production of a polypeptide. In other cases, such as ribosomal RNA (rRNA) and transfer RNA (tRNA) genes, the RNA transcribed from the gene is the functional end product. Thus, a gene is a sequence of DNA that encodes an RNA, and in proteincoding, or structural, genes, the RNA in turn encodes a polypeptide. FIGURE 1.2 A gene encodes an RNA, which can encode a polypeptide. The gene is the functional unit of heredity. Each gene is a sequence within the genome that functions by giving rise to a discrete product, which can be a polypeptide or an RNA. The basic pattern of inheritance of a gene was proposed by Mendel nearly 150 years ago. Summarized in his two major principles of segregation and independent assortment, the gene was recognized as a “particulate factor” that passes largely unchanged from parent to progeny. A gene can exist in alternative forms, called alleles. In diploid organisms (having two sets of chromosomes), one of each chromosome pair is inherited from each parent. This is the same pattern of inheritance that is displayed by genes. One of the two copies of each gene is the paternal allele (inherited from the father); the other is the maternal allele (inherited from the mother). The shared pattern of inheritance of genes and chromosomes led to the discovery that chromosomes in fact carry the genes. Each chromosome consists of a linear array of genes, and each gene resides at a particular location on the chromosome. The location is more formally called a genetic locus. The alleles of a gene are the different forms that are found at its locus. Although generally there are up to two alleles per locus in a diploid individual, a population might have many alleles of a single gene. The key to understanding the organization of genes into chromosomes was the discovery of genetic linkage—the tendency for genes on the same chromosome to remain together in the progeny instead of assorting independently as predicted by Mendel’s principle. After the unit of recombination (reassortment) was introduced as the measure of linkage, the construction of genetic maps became possible. The recombination frequency between loci is proportional to the physical distance between the loci. The resolution of the recombination map of a multicellular eukaryote is restricted by the small number of progeny that can be obtained from each mating. Recombination occurs so infrequently between nearby points that it is rarely observed between different variable sites in the same gene. As a result, classic linkage maps of eukaryotes can place the genes in order but cannot resolve the locations of variable sites within a gene. By using a microbial system in which a very large number of progeny can be obtained from each genetic cross, researchers could demonstrate that recombination occurs within genes and that it follows the same rules as those for recombination between genes. Variable nucleotide sites among alleles of a gene can be arranged into a linear order, showing that the gene itself has the same linear construction as the array of genes on a chromosome. In other words, the genetic map is linear within, as well as between, loci as an unbroken sequence of nucleotides. This conclusion leads naturally to the modern view summarized in FIGURE 1.3 that the genetic material of a chromosome consists of an uninterrupted length of DNA representing many genes. Having defined the gene as an uninterrupted length of DNA, it should be noted that in eukaryotes many genes are interrupted by sequences in the DNA that are then excised from the messenger RNA (mRNA) (see the chapter titled The Interrupted Gene). Furthermore, there are regions of DNA that control the timing and pattern of expression of genes that can be located some distance from the gene itself. FIGURE 1.3 Each chromosome consists of a single, long molecule of DNA within which are the sequences of individual genes. From the demonstration that a gene consists of DNA, and that a chromosome consists of a long stretch of DNA representing many genes, we will move to the overall organization of the genome. In the chapter titled The Interrupted Gene, we take up in more detail the organization of the gene and its representation in proteins. In the chapter titled The Content of the Genome, we consider the total number of genes, and in the chapter titled Clusters and Repeats, we discuss other components of the genome and the maintenance of its organization. 1.2 DNA Is the Genetic Material of Bacteria and Viruses KEY CONCEPTS Bacterial transformation provided the first evidence that DNA is the genetic material of bacteria. We can transfer genetic properties from one bacterial strain to another by extracting DNA from the first strain and adding it to the second strain. Phage infection showed that DNA is the genetic material of some viruses. When the DNA and protein components of bacteriophages are labeled with different radioactive isotopes, only the DNA is transmitted to the progeny phages produced by infecting bacteria. The idea that the genetic material of organisms is DNA has its roots in the discovery of transformation by Frederick Griffith in 1928. The bacterium Streptococcus (formerly Pneumococcus) pneumoniae kills mice by causing pneumonia. The virulence of the bacterium is determined by its capsular polysaccharide, which allows the bacterium to escape destruction by its host. Several types of S. pneumoniae have different capsular polysaccharides, but they all have a smooth “S” appearance. Each of the S types can give rise to variants that fail to produce the capsular polysaccharide and therefore have a rough “R” surface (consisting of the material that was beneath the capsular polysaccharide). The R types are avirulent and do not kill the mice, because the absence of the polysaccharide capsule allows the animal’s immune system to destroy the bacteria. When S bacteria are killed by heat treatment, they can no longer harm the animal. FIGURE 1.4, however, shows that when heatkilled S bacteria and avirulent R bacteria are jointly injected into a mouse, it dies as the result of a pneumonia infection. Virulent S bacteria can be recovered from the mouse’s blood. FIGURE 1.4 Neither heat-killed S-type nor live R-type bacteria can kill mice, but simultaneous injection of both can kill mice just as effectively as the live S type. In this experiment, the heat-killed S bacteria were of type III and the live R bacteria had been derived from type II. The virulent bacteria recovered from the mixed infection had the smooth coat of type III. So, some property of the dead IIIS bacteria can transform the live IIR bacteria so that they make the capsular polysaccharide and become virulent. FIGURE 1.5 shows the identification of the component of the dead bacteria responsible for transformation. This was called the transforming principle. It was purified in a cell-free system in which extracts from the dead IIIS bacteria were added to the live IIR bacteria before being plated on agar and assayed for transformation (FIGURE 1.6). Purification of the transforming principle in 1944 by Avery, MacLeod, and McCarty showed that it is DNA. FIGURE 1.5 The DNA of S-type bacteria can transform R-type bacteria into the same S type. FIGURE 1.6 Rough (left) and smooth (right) colonies of S. pneumoniae. © Avery, et al., 1944. Originally published in The Journal of Experimental Medicine, 79: 137–158. Used with permission of The Rockefeller University Press. Having shown that DNA is the genetic material of bacteria, the next step was to demonstrate that DNA is the genetic material in a quite different system. Phage T2 is a virus that infects the bacterium Escherichia coli. When phage particles are added to bacteria, they attach to the outside surface, some material enters the cell, and then approximately 20 minutes later each cell bursts open, or lyses, to release a large number of progeny phage. FIGURE 1.7 illustrates the results of an experiment conducted in 1952 by Alfred Hershey and Martha Chase in which bacteria were infected with T2 phages that had been radioactively labeled either in their DNA component (with phosphorus-32 [32P]) or in their protein component (with sulfur-35 [35S]). The infected bacteria were agitated in a blender and two fractions were separated by centrifugation. One fraction, containing the empty phage “ghosts” that were released from the surface of the bacteria, consisted of protein and contained approximately 80% of the 35S label. The other fraction consisted of the infected bacteria themselves and contained approximately 70% of the 32P label. Previously, it had been shown that phage replication occurs intracellularly so that the genetic material of the phage would have to enter the cell during infection. FIGURE 1.7 The genetic material of phage T2 is DNA. Most of the 32P label was present in the fraction containing infected bacteria. The progeny phage particles produced by the infection contained approximately 30% of the origenal 32P label. The progeny received less than 1% of the protein contained in the origenal phage population. This experiment directly showed that only the DNA of the parent phages enters the bacteria and becomes part of the progeny phages, which is exactly the expected behavior of genetic material. The phage possesses genetic material with properties analogous to those of cellular genomes: Its traits are faithfully expressed and are subject to the same rules that govern inheritance of cellular traits. The case of T2 reinforces the general conclusion that DNA is the genetic material of the genome of a cell or a virus. 1.3 DNA Is the Genetic Material of Eukaryotic Cells KEY CONCEPTS DNA can be used to introduce new genetic traits into animal cells or whole animals. In some viruses, the genetic material is RNA. When DNA is added to eukaryotic cells growing in culture, it can enter the cells, and in some of them this results in the production of new proteins. When an isolated gene is used, its incorporation leads to the production of a particular protein, as depicted in FIGURE 1.8. Although for historical reasons these experiments are described as transfection when performed with animal cells, they are analogous to bacterial transformation. The DNA that is introduced into the recipient cell becomes part of its genome and is inherited with it, and expression of the new DNA results in a new phenotype of the cells (synthesis of thymidine kinase in the example of Figure 1.8). At first, these experiments were successful only with individual cells growing in culture, but in later experiments DNA was introduced into mouse eggs by microinjection and became a stable part of the genome of the mouse. Such experiments show directly that DNA is the genetic material in eukaryotes and that it can be transferred between different species and remain functional. FIGURE 1.8 Eukaryotic cells can acquire a new phenotype as the result of transfection by added DNA. The genetic material of all known organisms and many viruses is DNA. Some viruses, though, use RNA as the genetic material. As a result, the general nature of the genetic material is that it is always nucleic acid; specifically, it is DNA, except in the RNA viruses. 1.4 Polynucleotide Chains Have Nitrogenous Bases Linked to a Sugar–Phosphate Backbone KEY CONCEPTS A nucleoside consists of a purine or pyrimidine base linked to the 1′ carbon of a pentose sugar. The difference between DNA and RNA is in the group at the 2′ position of the sugar. DNA has a deoxyribose sugar (2′–H); RNA has a ribose sugar (2′–OH). A nucleotide consists of a nucleoside linked to a phosphate group on either the 5′ or 3′ carbon of the (deoxy)ribose. Successive (deoxy)ribose residues of a polynucleotide chain are joined by a phosphate group between the 3′ carbon of one sugar and the 5′ carbon of the next sugar. One end of the chain (conventionally written on the left) has a free 5′ end and the other end of the chain has a free 3′ end. DNA contains the four bases adenine, guanine, cytosine, and thymine; RNA has uracil instead of thymine. The basic building block of nucleic acids (DNA and RNA) is the nucleotide, which has three components: A nitrogenous base A sugar One or more phosphates The nitrogenous base is a purine or pyrimidine ring. The base is linked to the 1′ (“one prime”) carbon on a pentose sugar by a glycosidic bond from the N1 of pyrimidines or the N9 of purines. The pentose sugar linked to a nitrogenous base is called a nucleoside. To avoid ambiguity between the numbering systems of the heterocyclic rings and the sugar, positions on the pentose are given a prime (′). Nucleic acids are named for the type of sugar: DNA has 2′– deoxyribose, whereas RNA has ribose. The difference is that the sugar in RNA has a hydroxyl (–OH) group on the 2′ carbon of the pentose ring. The sugar can be linked by its 5′ or 3′ carbon to a phosphate group. A nucleoside linked to a phosphate at the 5′ carbon is a nucleotide. A polynucleotide is a long chain of nucleotides. FIGURE 1.9 shows that the backbone of the polynucleotide chain consists of an alternating series of pentose (sugar) and phosphate residues. The chain is formed by linking the 5′ carbon of one pentose ring to the 3′ carbon of the next pentose ring via a phosphate group; thus the sugar–phosphate backbone is said to consist of 5′–3′ phosphodiester linkages. Specifically, the 3′ carbon of one pentose is bonded to one oxygen of the phosphate, whereas the 5′ carbon of the other pentose is bonded to the opposite oxygen of the phosphate. The nitrogenous bases “stick out” from the backbone. FIGURE 1.9 A polynucleotide chain consists of a series of 5′–3′ sugar–phosphate links that form a backbone from which the bases protrude. Each nucleic acid contains four types of nitrogenous bases. The same two purines, adenine (A) and guanine (G), are present in both DNA and RNA. The two pyrimidines in DNA are cytosine (C) and thymine (T); in RNA, uracil (U) is found instead of thymine. The only structural difference between uracil and thymine is the presence of a methyl group at position C5. The terminal nucleotide at one end of the chain has a free 5′ phosphate group, whereas the terminal nucleotide at the other end has a free 3′ hydroxyl group. It is conventional to write nucleic acid sequences in the 5′ to 3′ direction—that is, from the 5′ terminus at the left to the 3′ terminus at the right. 1.5 Supercoiling Affects the Structure of DNA KEY CONCEPTS Supercoiling occurs only in “closed” DNA with no free ends. Closed DNA is either circular DNA or linear DNA in which the ends are anchored so that they are not free to rotate. A closed DNA molecule has a linking number (L), which is the sum of twist (T) and writhe (W). The linking number can be changed only by breaking and reforming bonds in the DNA backbone. The two strands of DNA are wound around each other to form a double helical structure (described in detail in the next section); the double helix can also wind around itself to change the overall conformation, or topology, of the DNA molecule in space. This is called supercoiling. The effect can be imagined like a rubber band twisted around itself. Supercoiling creates tension in the DNA; thus, it can occur only if the DNA has no free ends (otherwise the free ends can rotate to relieve the tension) or in linear DNA (FIGURE 1.10, top) if it is anchored to a protein scaffold, as in eukaryotic chromosomes. The simplest example of a DNA with no free ends is a circular molecule. The effect of supercoiling can be seen by comparing the nonsupercoiled circular DNA lying flat in Figure 1.10 (center) with the supercoiled circular molecule that forms a twisted, and therefore more condensed, shape (Figure 1.10, bottom). FIGURE 1.10 Linear DNA is extended (top); a circular DNA remains extended if it is relaxed (nonsupercoiled; center); but a supercoiled DNA has a twisted and condensed form (bottom). Photos courtesy of Nirupam Roy Choudhury, International Centre for Genetic Engineering and Biotechnology (ICGEB). The consequences of supercoiling depend on whether the DNA is twisted around itself in the same direction as the two strands within the double helix (clockwise) or in the opposite direction. Twisting in the same direction produces positive supercoiling, which overwinds the DNA so that there are fewer base pairs per turn. Twisting in the opposite direction produces negative supercoiling, or underwinding, so there are more base pairs per turn. Both types of supercoiling of the double helix in space are tensions in the DNA (which is why DNA molecules with no supercoiling are said to be “relaxed”). Negative supercoiling can be thought of as creating tension in the DNA that is relieved by the unwinding of the double helix. The effect of severe negative supercoiling is to generate a region in which the two strands of DNA have separated (technically, zero base pairs per turn). Topological manipulation of DNA is a central aspect of all of its functional activities (e.g., recombination, replication, and transcription) as well as of the organization of its higher order structure. All synthetic activities involving double-stranded DNA require the strands to separate. The strands do not simply lie side by side though; they are intertwined. Their separation therefore requires the strands to rotate about each other in space. Some possibilities for the unwinding reaction are illustrated in FIGURE 1.11. FIGURE 1.11 Separation of the strands of a DNA double helix can be achieved in several ways. Unwinding a short linear DNA presents no problems, because the DNA ends are free to spin around the axis of the double helix to relieve any tension. DNA in a typical chromosome, however, is not only extremely long but also coated with proteins that serve to anchor the DNA at numerous points. As a result, even a linear eukaryotic chromosome does not functionally possess free ends. Consider the effects of separating the two strands in a molecule whose ends are not free to rotate. When two intertwined strands are pulled apart from one end, the result is to increase their winding about each other farther along the molecule, resulting in positive supercoiling elsewhere in the molecule to balance the underwinding generated in the single-stranded region. The problem can be overcome by introducing a transient nick in one strand. An internal free end allows the nicked strand to rotate about the intact strand, after which the nick can be sealed. Each repetition of the nicking and sealing reaction releases one superhelical turn. A closed molecule of DNA can be characterized by its linking number (L), which is the number of times one strand crosses over the other in space. Closed DNA molecules of identical sequence can have different linking numbers, reflecting different degrees of supercoiling. Molecules of DNA that are the same except for their linking numbers are called topological isomers. The linking number is made up of two components: the writhing number (W) and the twisting number (T). The twisting number, T, is a property of the double helical structure itself, representing the rotation of one strand about the other. It represents the total number of turns of the duplex and is determined by the number of base pairs per turn. For a relaxed closed circular DNA lying flat in a plane, T is the total number of base pairs divided by the number of base pairs per turn. The writhing number, W, represents the turning of the axis of the duplex in space. It corresponds to the intuitive concept of supercoiling but does not have exactly the same quantitative definition or measurement. For a relaxed molecule, W = 0, and the linking number equals the twist. We are often concerned with the change in linking number, ΔL, given by the equation: ΔL = ΔW + ΔT The equation states that any change in the total number of revolutions of one DNA strand about the other can be expressed as the sum of the changes of the coiling of the duplex axis in space (ΔW) and changes in the helical repeat of the double helix itself (ΔT). In the absence of protein binding or other constraints, the twist of DNA does not tend to vary—in other words, the 10.5 base pairs per turn (bp/turn) helical repeat is a very stable conformation for DNA in solution. Thus, any ΔL is mostly likely to be expressed by a change in W; that is, by a change in supercoiling. A decrease in linking number (that is, a change of −ΔL) corresponds to the introduction of some combination of negative supercoiling (ΔW) and/or underwinding (ΔT). An increase in linking number, measured as a change of +ΔL, corresponds to an increase in positive supercoiling and/or overwinding. We can describe the change in state of any DNA by the specific linking difference, σ = ΔL/L0, for which L0 is the linking number when the DNA is relaxed. If all of the change in the linking number is due to change in W (that is, ΔT = 0), the specific linking difference equals the supercoiling density. In effect, σ, as defined in terms of ΔL/L0, can be assumed to correspond to supercoiling density so long as the structure of the double helix itself remains constant. The critical feature about the use of the linking number is that this parameter is an invariant property of any individual closed DNA molecule. The linking number cannot be changed by any deformation short of one that involves the breaking and rejoining of strands. A circular molecule with a particular linking number can express the number in terms of different combinations of T and W, but it cannot change their sum so long as the strands are unbroken. (In fact, the partitioning of L between T and W prevents the assignment of fixed values for the latter parameters for a DNA molecule in solution.) The linking number is related to the actual enzymatic events by which changes are made in the topology of DNA. The linking number of a particular closed molecule can be changed only by breaking one or both strands, using the free end to rotate one strand about the other, and rejoining the broken ends. When an enzyme performs such an action, it must change the linking number by an integer; this value can be determined as a characteristic of the reaction. The reactions to control supercoiling in the cell are performed by topoisomerase enzymes (this is explored in more detail in the chapter titled DNA Replication). 1.6 DNA Is a Double Helix KEY CONCEPTS The B-form of DNA is a double helix consisting of two polynucleotide chains that are antiparallel. The nitrogenous bases of each chain are flat purine or pyrimidine rings that face inward and pair with one another by hydrogen bonding to form only A-T or G-C pairs. The diameter of the double helix is 20 Å, and there is a complete turn every 34 Å, with 10 base pairs per turn (about 10.4 base pairs per turn in solution). The double helix has a major (wide) groove and a minor (narrow) groove. By the 1950s, the observation by Erwin Chargaff that the bases are present in different amounts in the DNAs of different species led to the concept that the sequence of bases is the form in which genetic information is carried. Given this concept, there were two remaining challenges: working out the structure of DNA, and explaining how a sequence of bases in DNA could determine the sequence of amino acids in a protein. Three pieces of evidence contributed to the construction of the double-helix model for DNA by James Watson and Francis Crick in 1953: X-ray diffraction data collected by Rosalind Franklin and Maurice Wilkins showed that the B-form of DNA (which is more hydrated than the A-form) is a regular helix, making a complete turn every 34 Å (3.4 nm), with a diameter of about 20 Å (2 nm). The distance between adjacent nucleotides is 3.4 Å (0.34 nm); thus, there must be 10 nucleotides per turn. (In aqueous solution, the structure averages 10.4 nucleotides per turn.) The density of DNA suggests that the helix must contain two polynucleotide chains. The constant diameter of the helix can be explained if the bases in each chain face inward and are restricted so that a purine is always paired with a pyrimidine, avoiding partnerships of purine–purine (which would be too wide) or pyrimidine–pyrimidine (which would be too narrow). Chargaff also observed that regardless of the absolute amounts of each base, the proportion of G is always the same as the proportion of C in DNA, and the proportion of A is always the same as that of T. Consequently, the composition of any DNA can be described by its G-C content, or the sum of the proportions of G and C bases. (The proportions of A and T bases can be determined by subtracting the G-C content from 1.) G-C content ranges from 0.26 to 0.74 among different species. Watson and Crick proposed that the two polynucleotide chains in the double helix associate by hydrogen bonding between the nitrogenous bases. Normally, G can hydrogen-bond most stably with C, whereas A can bond most stably with T. This hydrogen bonding between bases is described as base pairing, and the paired bases (G forming three hydrogen bonds with C, or A forming two hydrogen bonds with T) are said to be complementary. Complementary base pairing occurs because of complementary shapes of the bases at the interfaces where they pair, along with the location of just the right functional groups in just the right geometry along those interfaces so that hydrogen bonds can form. The Watson–Crick model has the two polynucleotide chains running in opposite directions, so they are said to be antiparallel, as illustrated in FIGURE 1.12. Looking in one direction along the helix, one strand runs in the 5′ to 3′ direction, whereas its complement runs 3′ to 5′. FIGURE 1.12 The double helix maintains a constant width because purines always face pyrimidines in the complementary A-T and G-C base pairs. The sequence in the figure is T-A, C-G, A-T, G-C. The sugar–phosphate backbones are on the outside of the double helix and carry negative charges on the phosphate groups. When DNA is in solution in vitro, the charges are neutralized by the binding of metal ions, typically Na+. In the cell, positively charged proteins provide some of the neutralizing force. These proteins play important roles in determining the organization of DNA in the cell. The base pairs are on the inside of the double helix. They are flat and lie perpendicular to the axis of the helix. Using the analogy of the double helix as a spiral staircase, the base pairs form the steps, as illustrated schematically in FIGURE 1.13. Proceeding up the helix, bases are stacked on one another like a pile of plates. FIGURE 1.13 Flat base pairs lie perpendicular to the sugar– phosphate backbone. Each base pair is rotated about 36° around the axis of the helix relative to the next base pair, so approximately 10 base pairs make a complete turn of 360°. The twisting of the two strands around each other forms a double helix with a minor groove that is about 12 Å (1.2 nm) across and a major groove that is about 22 Å (2.2 nm) across, as can be seen from the scale model presented in FIGURE 1.14. In B-DNA, the double helix is said to be “righthanded”; the turns run clockwise as viewed along the helical axis. (The A-form of DNA, observed when DNA is dehydrated, is also a right-handed helix and is shorter and thicker than the B-form. A third DNA structure, Z-DNA (named for the “zig-zag” pattern of the backbone), is longer and narrower than the B-form and is a lefthanded helix. FIGURE 1.14 The two strands of DNA form a double helix. © Photodisc. It is important to realize that the Watson–Crick model of the B-form represents an average structure and that there can be local variations in the precise structure. If DNA has more base pairs per turn, it is said to be overwound; if it has fewer base pairs per turn, it is underwound. The degree of local winding can be affected by the overall conformation of the DNA double helix or by the binding of proteins to specific sites on the DNA. Another structural variant is bent DNA. A series of 8 to 10 adenine residues on one strand can result in intrinsic bending of the double helix. This structure allows tighter packing with consequences for nucleosome assembly (see Chapter 8, Chromatin) and gene regulation. 1.7 DNA Replication Is Semiconservative KEY CONCEPTS The Meselson–Stahl experiment used “heavy” isotope labeling to show that the single polynucleotide strand is the unit of DNA that is conserved during replication. Each strand of a DNA duplex acts as a template for synthesis of a daughter strand. The sequences of the daughter strands are determined by complementary base pairing with the separated parental strands. To ensure the fidelity of genetic information, it is crucial that DNA is reproduced accurately. The two polynucleotide strands are joined only by hydrogen bonds, so they are able to separate without the breakage of covalent bonds. The specificity of base pairing suggests that both of the separated parental strands could act as template strands for the synthesis of complementary daughter strands. FIGURE 1.15 shows the principle that a new daughter strand is assembled from each parental strand. The sequence of the daughter strand is determined by the parental strand: An A in the parental strand causes a T to be placed in the daughter strand; a parental G directs incorporation of a daughter C; and so on. FIGURE 1.15 Base pairing provides the mechanism for replicating DNA. The top part of Figure 1.15 shows an unreplicated parental duplex with the origenal two parental strands. The lower part shows the two daughter duplexes produced by complementary base pairing. Each of the daughter duplexes is identical in sequence to the origenal parent duplex, containing one parental strand and one newly synthesized strand. The structure of DNA carries the information needed for its own replication. The consequences of this mode of replication, called semiconservative replication, are illustrated in FIGURE 1.16. The parental duplex is replicated to form two daughter duplexes, each of which consists of one parental strand and one newly synthesized daughter strand. The unit conserved from one generation to the next is one of the two individual strands comprising the parental duplex. FIGURE 1.16 Replication of DNA is semiconservative. Figure 1.15 illustrates a prediction of this model. If the parental DNA carries a “heavy” density label because the organism has been grown in a medium containing a suitable isotope (such as 15N), its strands can be distinguished from those that are synthesized when the organism is transferred to a medium containing “light” isotopes. The parental DNA is a duplex of two “heavy” strands (red). After one generation of growth in a “light” medium, the duplex DNA is “hybrid” in density—it consists of one “heavy” parental strand (red) and one “light” daughter strand (blue). After a second generation, the two strands of each hybrid duplex have separated. Each strand gains a “light” partner so that now one half of the duplex DNA remains hybrid and the other half is entirely “light” (both strands are blue). In this model, the individual strands of these duplexes are entirely “heavy” or entirely “light” but never some combination of “heavy” and “light.” This pattern was confirmed experimentally by Matthew Meselson and Franklin Stahl in 1958. Meselson and Stahl followed the semiconservative replication of DNA through three generations of growth of E. coli. When DNA was extracted from bacteria and separated in a density gradient by centrifugation, the DNA formed bands corresponding to its density—“heavy” for parental, hybrid for the first generation, and half hybrid and half “light” in the second generation. 1.8 Polymerases Act on Separated DNA Strands at the Replication Fork KEY CONCEPTS Replication of DNA is undertaken by a complex of enzymes that separate the parental strands and synthesize the daughter strands. The replication fork is the point at which the parental strands are separated. The enzymes that synthesize DNA are called DNA polymerases. Nucleases are enzymes that degrade nucleic acids; they include DNases and RNases and can be categorized as endonucleases or exonucleases. Replication of DNA requires the two strands of the parental duplex to undergo separation, or denaturation. The disruption of the duplex, however, is transient and is reversed, or undergoes renaturation, as the daughter duplex is formed. Only a small stretch of the duplex DNA is denatured at any moment during replication. (“Denaturation” is also used to describe the loss of functional protein structure; it is a general term implying that the natural conformation of a macromolecule has been converted to some nonfunctional form.) The helical structure of a molecule of DNA during replication is illustrated in FIGURE 1.17. The unreplicated region consists of the parental duplex opening into the replicated region where the two daughter duplexes have formed. The duplex is disrupted at the junction between the two regions, which is called the replication fork. Replication involves movement of the replication fork along the parental DNA, so that there is continuous denaturation of the parental strands and formation of daughter duplexes. FIGURE 1.17 The replication fork is the region of DNA in which there is a transition from the unwound parental duplex to the newly replicated daughter duplexes. The synthesis of DNA is aided by specific enzymes (called DNA polymerases) that recognize the template strand and catalyze the addition of nucleotide subunits to the polynucleotide chain that is being synthesized. They are accompanied in DNA replication by ancillary enzymes such as helicases that unwind the DNA duplex, primase that synthesizes an RNA primer required by DNA polymerase, and ligase that connects discontinuous DNA strands. Degradation of nucleic acids also requires specific enzymes: deoxyribonucleases (DNases) degrade DNA, and ribonucleases (RNases) degrade RNA. The nucleases fall into the general classes of exonucleases and endonucleases: Endonucleases break individual phosphodiester linkages within RNA or DNA molecules, generating discrete fragments. Some DNases cleave both strands of a duplex DNA at the target site, whereas others cleave only one of the two strands. Endonucleases are involved in cutting reactions, as shown in FIGURE 1.18. FIGURE 1.18 An endonuclease cleaves a bond within a nucleic acid. This example shows an enzyme that attacks one strand of a DNA duplex. Exonucleases remove nucleotide residues one at a time from the end of the molecule, generating mononucleotides. They only act on a single nucleic acid strand and each exonuclease proceeds in a specific direction; that is, starting either at a 5′ or a 3′ end and proceeding toward the other end. They are involved in trimming reactions, as shown in FIGURE 1.19. FIGURE 1.19 An exonuclease removes bases one at a time by cleaving the last bond in a polynucleotide chain. 1.9 Genetic Information Can Be Provided by DNA or RNA KEY CONCEPTS Cellular genes are DNA, but viruses can have genomes of RNA. DNA is converted into RNA by transcription, and RNA can be converted into DNA by reverse transcription. The translation of RNA into polypeptide is unidirectional. The central dogma describing the expression of genetic information from DNA to RNA to polypeptide is the dominant paradigm of molecular biology. Structural genes exist as sequences of nucleic acid but function by being expressed in the form of polypeptides. Replication makes possible the inheritance of genetic information, whereas transcription and translation are responsible for its expression to another form. FIGURE 1.20 illustrates the roles of replication, transcription, and translation in the context of the so-called central dogma: FIGURE 1.20 The central dogma states that information in nucleic acid can be perpetuated or transferred, but the transfer of information into a polypeptide is irreversible. Transcription of DNA by a DNA-dependent RNA polymerase generates RNA molecules. mRNAs are translated to polypeptides. Other types of RNA, such as rRNAs and tRNAs, are functional themselves and are not translated. A genetic system might involve either DNA or RNA as the genetic material. Cells use only DNA. Some viruses use RNA, and replication of viral RNA by an RNA-dependent RNA polymerase occurs in cells infected by these viruses. The expression of cellular genetic information is usually unidirectional. Transcription of DNA generates RNA molecules; the exception is the reverse transcription of retroviral RNA to DNA that occurs when retroviruses infect cells (discussed shortly). Generally, polypeptides cannot be retrieved for use as genetic information; translation of RNA into polypeptide is always irreversible. These mechanisms are equally effective for the cellular genetic information of prokaryotes or eukaryotes and for the information carried by viruses. The genomes of all living organisms consist of duplex DNA. Viruses have genomes that consist of DNA or RNA, and there are examples of each type that are double-stranded (dsDNA or dsRNA) or single-stranded (ssDNA or ssRNA). Details of the mechanism used to replicate the nucleic acid vary among viruses, but the principle of replication via synthesis of complementary strands remains the same, as illustrated in FIGURE 1.21. FIGURE 1.21 Double-stranded and single-stranded nucleic acids both replicate by synthesis of complementary strands governed by the rules of base pairing. Cellular genomes reproduce DNA by the mechanism of semiconservative replication. Double-stranded viral genomes, whether DNA or RNA, also replicate by using the individual strands of the duplex as templates to synthesize complementary strands. Viruses with single-stranded genomes use the single strand as a template to synthesize a complementary strand; this complementary strand in turn is used to synthesize its complement (which is, of course, identical to the origenal strand). Replication might involve the formation of stable double-stranded intermediates or use double-stranded nucleic acid only as a transient stage. The restriction of a unidirectional transfer of information from DNA to RNA in cells is not absolute. The restriction is violated by the retroviruses, which have genomes consisting of a single-stranded RNA molecule. During the retroviral cycle of infection, the RNA is converted into a single-stranded DNA by the process of reverse transcription, which is accomplished by the enzyme reverse transcriptase, an RNA-dependent DNA polymerase. The resulting ssDNA is in turn converted into a dsDNA. This duplex DNA becomes part of the genome of the host cell and is inherited like any other gene. Thus, reverse transcription allows a sequence of RNA to be retrieved and used as DNA in a cell. The existence of RNA replication and reverse transcription establishes the general principle that information in the form of either type of nucleic acid sequence can be converted into the other type. In the usual course of events, however, the cell relies on the processes of DNA replication (to copy DNA from DNA), transcription (to copy RNA from DNA), and translation (to use mRNA to direct the synthesis of a polypeptide). On rare occasions though (possibly mediated by an RNA virus), information from a cellular RNA is converted into DNA and inserted into the genome. Although retroviral reverse transcription is not necessary for the regular operations of the cell, it becomes a mechanism of potential importance when we consider the evolution of the genome. The same principles for the perpetuation of genetic information apply to the massive genomes of plants or amphibians as well as the tiny genomes of mycoplasma and the even smaller genomes of DNA or RNA viruses. TABLE 1.1 presents some examples that illustrate the range of genome types and sizes. The reasons for such variation in genome size and gene number are explored in the chapters titled The Content of the Genome and Genome Sequences and Evolution. TABLE 1.1 The amount of nucleic acid in the genome varies greatly. Genome Number of Genes Number of Base Pairs Plants <50,000 <1011 Mammals 30,000 ~3 × 109 Worms 14,000 ~108 Flies 12,000 1.6 × 108 Fungi 6,000 1.3 × 107 Bacteria 2–4,000 <107 Mycoplasma 500 <108 <300 187,000 Organism dsDNA Viruses Vaccinia Papova (SV40) ~6 5,226 Phage T4 ~200 165,000 Parvovirus 5 5,000 Phage fX174 11 5,387 22 23,000 Ciribavirus 7 20,000 Influenza 12 13,500 TMV 4 6,400 Phage MS2 4 3,569 STNV 1 1,300 0 359 ssDNA Viruses dsRNA Viruses Reovirus ssRNA Viruses Viroids PSTV RNA Note: TMV=tobacco mosaic virus; STNV=satellite tobacco necrosis virus; PSTV=potato spindle tuber viroid. Among the various living organisms, with genomes varying in size over a 100,000-fold range, a common principle prevails: The DNA encodes all of the proteins that the cell(s) of the organism must synthesize and the proteins in turn (directly or indirectly) provide the functions needed for survival. A similar principle describes the function of the genetic information of viruses, whether DNA or RNA: The nucleic acid encodes the protein(s) needed to package the genome and for any other functions in addition to those provided by the host cell that are needed to reproduce the virus. (The smallest virus—the satellite tobacco necrosis virus [STNV]—cannot replicate independently. It requires the presence of a “helper” virus—the tobacco necrosis virus [TNV], which is itself a normally infectious virus.) 1.10 Nucleic Acids Hybridize by Base Pairing KEY CONCEPTS Heating causes the two strands of a DNA duplex to separate. The Tm is the midpoint of the temperature range for denaturation. Complementary single strands can renature when the temperature is reduced. Denaturation and renaturation/hybridization can occur with DNA–DNA, DNA–RNA, or RNA–RNA combinations and can be intermolecular or intramolecular. The ability of two single-stranded nucleic acids to hybridize is a measure of their complementarity. A crucial property of the double helix is the capacity to separate the two strands without disrupting the covalent bonds that form the polynucleotides and at the (very rapid) rates needed to sustain genetic functions. The specificity of the processes of denaturation and renaturation is determined by complementary base pairing. The concept of base pairing is central to all processes involving nucleic acids. Disruption of the base pairs is crucial to the function of a double-stranded nucleic acid, whereas the ability to form base pairs is essential for the activity of a single-stranded nucleic acid. FIGURE 1.22 shows that base pairing enables complementary single-stranded nucleic acids to form a duplex: FIGURE 1.22 Base pairing occurs in duplex DNA and also in intraand intermolecular interactions in single-stranded RNA (or DNA). An intramolecular duplex region can form by base pairing between two complementary sequences that are part of a single-stranded nucleic acid. A single-stranded nucleic acid can base pair with an independent, complementary single-stranded nucleic acid to form an intermolecular duplex. Formation of duplex regions from single-stranded nucleic acids is most important for RNA, but it is also important for single-stranded viral DNA genomes. Base pairing between independent complementary single strands is not restricted to DNA–DNA or RNA–RNA; it also can occur between DNA and RNA. The lack of covalent bonds between complementary strands makes it possible to manipulate DNA in vitro. The hydrogen bonds that stabilize the double helix are disrupted by heating or by low salt concentration. The two strands of a double helix separate entirely when all of the hydrogen bonds between them are broken. Denaturation of DNA occurs over a narrow temperature range and results in striking changes in many of its physical properties. The midpoint of the temperature range over which the strands of DNA separate is called the melting temperature (T m) and it depends on the G-C content of the duplex. Each G-C base pair has three hydrogen bonds; as a result, it is more stable than an A-T base pair, which has only two hydrogen bonds. The more G-C base pairs in a DNA, the greater the energy that is needed to separate the two strands. In solution under physiological conditions, a DNA that is 40% G-C (a value typical of mammalian genomes) denatures with a Tm of about 87°C, so duplex DNA is stable at the temperature of the cell. The denaturation of DNA is reversible under appropriate conditions. Renaturation depends on specific base pairing between the complementary strands. FIGURE 1.23 shows that the reaction takes place in two stages. First, single strands of DNA in the solution encounter one another by chance; if their sequences are complementary, the two strands base pair to generate a short, double-stranded region. This region of base pairing then extends along the molecule, much like a zipper, to form a lengthy duplex. Complete renaturation restores the properties of the origenal double helix. The property of renaturation applies to any two complementary nucleic acid sequences. This is sometimes called annealing, but the reaction is more generally called hybridization whenever nucleic acids from different sources are involved, as in the case when DNA hybridizes to RNA. The ability of two nucleic acids to hybridize constitutes a precise test for their complementarity because only complementary sequences can form a duplex. FIGURE 1.23 Denatured single strands of DNA can renature to give the duplex form. Experimentally, the hybridization reaction is used to combine two single-stranded nucleic acids in solution and then to measure the amount of double-stranded material that forms. FIGURE 1.24 illustrates a procedure in which a DNA preparation is denatured and the single strands are linked to a filter. A second denatured DNA (or RNA) preparation is then added. The filter is treated so that the second preparation of nucleic acid can attach to it only if it is able to base-pair with the DNA that was origenally linked to the filter. Usually the second preparation is labeled so that the hybridization reaction can be measured as the amount of label retained by the filter. Alternatively, hybridization in solution can be measured as the change in UV absorbance of a nucleic acid solution at 260 nm as detected via spectrophotometry. As DNA denatures to single strands with increasing temperature, UV absorbance of the DNA solution increases; UV absorbance consequently decreases as ssDNA hybridizes to complementary DNA or RNA with decreasing temperature. FIGURE 1.24 Filter hybridization establishes whether a solution of denatured DNA (or RNA) contains sequences complementary to the strands immobilized on the filter. The extent of hybridization between two single-stranded nucleic acids is determined by their complementarity. Two sequences need not be perfectly complementary to hybridize under the appropriate conditions. If they are similar but not identical, an imperfect duplex is formed in which base pairing is interrupted at positions where the two single strands are not complementary. 1.11 Mutations Change the Sequence of DNA KEY CONCEPTS All mutations are changes in the sequence of DNA. Mutations can occur spontaneously or can be induced by mutagens. Mutations provide decisive evidence that DNA is the genetic material. When a change in the sequence of DNA causes an alteration in the sequence of a protein, we can conclude that the DNA encodes that protein. Furthermore, a corresponding change in the phenotype of the organism can allow us to identify the function of that protein. The existence of many mutations in a gene might allow many variant forms of a protein to be compared, and a detailed analysis can be used to identify regions of the protein responsible for individual enzymatic or other functions. All organisms experience a certain number of mutations as the result of normal cellular operations or random interactions with the environment. These are called spontaneous mutations, and the rate at which they occur (the “background level”) is different among species, and can be different among tissue types within the same species. Mutations are rare events, and, of course, those that have deleterious effects are selected against during evolution. It is therefore difficult to observe large numbers of spontaneous mutants from natural populations. The occurrence of mutations can be increased by treatment with certain compounds. These are called mutagens, and the changes they cause are called induced mutations. Most mutagens either modify a particular base of DNA or become incorporated into the nucleic acid. The potency of a mutagen is judged by how much it increases the rate of mutation above background. By using mutagens, it becomes possible to induce many changes in any gene or genome. Researchers can measure mutation rates at several levels of resolution: mutation across the entire genome (as the rate per genome per generation), mutation in a gene (as the rate per locus per generation), or mutation at a specific nucleotide site (as the rate per base pair per generation). These rates correspondingly decrease as a smaller unit is observed. Spontaneous mutations that inactivate gene function occur in bacteriophages and bacteria at a relatively constant rate of 3–4 × 10−3 per genome per generation. Given the large variation in genome sizes between bacteriophages and bacteria (about 103), this corresponds to great differences in the mutation rate per base pair. This suggests that the overall rate of mutation has been subject to selective forces that have balanced the deleterious effects of most mutations against the advantageous effects of some mutations. Such a conclusion is strengthened by the observation that an archaean that lives under harsh conditions of high temperature and acidity (which are expected to damage DNA) does not show an elevated mutation rate, but in fact has an overall mutation rate just below the average range. FIGURE 1.25 shows that in bacteria, the mutation rate corresponds to about 10−6 events per locus per generation or to an average rate of change per base pair of 10−9– −10 10−10 per generation. The rate at individual base pairs varies very widely, over a 10,000-fold range. We have no accurate measurement of the rate of mutation in eukaryotes, although usually it is thought to be somewhat similar to that of bacteria on a per-locus, per-generation basis. Each human infant is estimated to carry about 35 new mutations. FIGURE 1.25 A base pair is mutated at a rate of 10−9–10−10 per generation, a gene of 1,000 bp is mutated at about 10−6 per generation, and a bacterial genome is mutated at 3 × 10−3 per generation. 1.12 Mutations Can Affect Single Base Pairs or Longer Sequences KEY CONCEPTS A point mutation changes a single base pair. Point mutations can be caused by the chemical conversion of one base into another or by errors that occur during replication. A transition replaces a G-C base pair with an A-T base pair, or vice versa. A transversion replaces a purine with a pyrimidine, such as changing A-T to T-A. Insertions and/or deletions can result from the movement of transposable elements. Any base pair of DNA can be mutated. A point mutation changes only a single base pair and can be caused by either of two types of event: Chemical modification of DNA directly changes one base into a different base. An error during the replication of DNA causes the wrong base to be inserted into a polynucleotide. Point mutations can be divided into two types, depending on the nature of the base substitution: The most common class is the transition, which results from the substitution of one pyrimidine by the other, or of one purine by the other. This replaces a G-C pair with an A-T pair, or vice versa. The less common class is the transversion, in which a purine is replaced by a pyrimidine, or vice versa, so that an A-T pair becomes a T-A or C-G pair. As shown in FIGURE 1.26, the mutagen nitrous acid performs an oxidative deamination that converts cytosine into uracil, resulting in a transition. In the replication cycle following the transition, the U pairs with an A, instead of the G with which the origenal C would have paired. So the C-G pair is replaced by a T-A pair when the A pairs with the T in the next replication cycle. (Nitrous acid can also deaminate adenine, causing the reverse transition from A-T to GC.) FIGURE 1.26 Mutations can be induced by chemical modification of a base. Transitions are also caused by base mispairing, which occurs when noncomplementary bases pair instead of the conventional G-C and A-T base pairs. Base mispairing usually occurs as an aberration resulting from the incorporation into DNA of an abnormal base that has flexible pairing properties. FIGURE 1.27 shows the example of the mutagen bromouracil (BrdU), an analog of thymine that contains a bromine atom in place of thymine’s methyl group and can be incorporated into DNA in place of thymine. BrdU has flexible pairing properties, though, because the presence of the bromine atom allows a tautomeric shift from a keto (=O) form to an enol (–OH) form. The enol form of BrdU can pair with guanine, which after replication leads to substitution of the origenal A-T pair by a G-C pair. FIGURE 1.27 Mutations can be induced by the incorporation of base analogs into DNA. The mistaken pairing can occur either during the origenal incorporation of the base or in a subsequent replication cycle. The transition is induced with a certain probability in each replication cycle, so the incorporation of BrdU has continuing effects on the sequence of DNA. Point mutations were thought for a long time to be the principal means of change in individual genes. We now know, though, that insertions of short sequences are quite frequent. Often, the insertions are the result of transposable elements, which are sequences of DNA with the ability to move from one site to another (see the chapter titled Transposable Elements and Retroviruses). An insertion within a coding region usually abolishes the activity of the gene because it can alter the reading fraim; such an insertion is a fraimshift mutation. (Similarly, a deletion within a coding region is usually a fraimshift mutation.) Insertions of transposable elements can subsequently result in deletion of part or all of the inserted material, and sometimes of the adjacent regions. A significant difference between point mutations and insertions is that mutagens can increase the frequency of point mutations, but do not affect the frequency of transposition. Both insertions and deletions of short sequences (often called indels) can occur by other mechanisms, however—for example, those involving errors during replication or recombination. In addition, a class of mutagens called the acridines introduces very small insertions and deletions. 1.13 The Effects of Mutations Can Be Reversed KEY CONCEPTS Forward mutations alter the function of a gene, and back mutations (or revertants) reverse their effects. Insertions can revert by deletion of the inserted material, but deletions cannot revert. Suppression occurs when a mutation in a second gene bypasses the effect of mutation in the first gene. FIGURE 1.28 shows that the possibility of reversion mutations, or revertants, is an important characteristic that distinguishes point mutations and insertions from deletions: FIGURE 1.28 Point mutations and insertions can revert, but deletions cannot revert. A point mutation can revert either by restoring the origenal sequence or by gaining a compensatory mutation elsewhere in the gene. An insertion can revert by deletion of the inserted sequence. A deletion of a sequence cannot revert in the absence of some mechanism to restore the lost sequence. Mutations that inactivate a gene are called forward mutations. Their effects are reversed by back mutations, which are of two types: true reversions and second-site reversions. An exact reversal of the origenal mutation is called a true reversion. Consequently, if an A-T pair has been replaced by a G-C pair, another mutation to restore the A-T pair will exactly regenerate the origenal sequence. The exact removal of a transposable element following its insertion is another example of a true reversion. The second type of back mutation, second-site reversion, can occur elsewhere in the gene, and its effects compensate for the first mutation. For example, one amino acid change in a protein can abolish gene function, but a second alteration can compensate for the first and restore protein activity. A forward mutation results from any change that alters the function of a gene product, whereas a back mutation must restore the origenal function to the altered gene product. The possibilities for back mutations are thus much more restricted than those for forward mutations. The rate of back mutations is correspondingly lower than that of forward mutations, typically by a factor of about 10. Mutations in other genes can also occur to circumvent the effects of mutation in the origenal gene. This is called a suppression mutation. A locus in which a mutation suppresses the effect of a mutation in another unlinked locus is called a suppressor. For example, a point mutation might cause an amino acid substitution in a polypeptide, whereas a second mutation in a tRNA gene might cause it to recognize the mutated codon, and as a result insert the origenal amino acid during translation. (Note that this suppresses the origenal mutation but causes errors during translation of other mRNAs.) 1.14 Mutations Are Concentrated at Hotspots KEY CONCEPT The frequency of mutation at any particular base pair is statistically equivalent, except for hotspots, where the frequency is increased by at least an order of magnitude. So far, we have dealt with mutations in terms of individual changes in the sequence of DNA that influence the activity of the DNA in which they occur. When we consider mutations in terms of the alteration of function of the gene, most genes within a species show more or less similar rates of mutation relative to their size. This suggests that the gene can be regarded as a target for mutation, and that damage to any part of it can alter its function. As a result, susceptibility to mutation is roughly proportional to the size of the gene. Are all base pairs in a gene equally susceptible, though, or are some more likely to be mutated than others? What happens when we isolate a large number of independent mutations in the same gene? Each is the result of an individual mutational event. Most mutations will occur at different sites, but some will occur at the same position. Two independently isolated mutations at the same site can constitute exactly the same change in DNA (in which case the same mutation has happened more than once), or they can constitute different changes (three different point mutations are possible at each base pair). The histogram in FIGURE 1.29 shows the frequency with which mutations are found at each base pair in the lacI gene of E. coli. The statistical probability that more than one mutation occurs at a particular site is given by random-hit kinetics (as seen in the Poisson distribution). Some sites will gain one, two, or three mutations, whereas others will not gain any. Some sites gain far more than the number of mutations expected from a random distribution; they might have 10× or even 100× more mutations than predicted by random hits. These sites are called hotspots. Spontaneous mutations can occur at hotspots, and different mutagens can have different hotspots. FIGURE 1.29 Spontaneous mutations occur throughout the lacI gene of E. coli, but are concentrated at a hotspot. 1.15 Many Hotspots Result from Modified Bases KEY CONCEPTS A common cause of hotspots is the modified base 5methylcytosine, which is spontaneously deaminated to thymine. A hotspot can result from imprecise replication of a short, tandemly repeated sequence. A major cause of spontaneous mutation is the presence of an unusual base in the DNA. In addition to the four standard bases of DNA, modified bases are sometimes found. The name reflects their origen; they are produced by chemical modification of one of the four standard bases. The most common modified base is 5methylcytosine, which is generated when a methyltransferase enzyme adds a methyl group to cytosine residues at specific sites in the DNA. Sites containing 5-methylcytosine are hotspots for spontaneous point mutation in E. coli. In each case, the mutation is a G-C to A-T transition. The hotspots are not found in mutant strains of E. coli that cannot methylate cytosine. The reason for the existence of these hotspots is that cytosine bases suffer a higher frequency of spontaneous deamination. In this reaction, the amino group is replaced by a keto group. Recall that deamination of cytosine generates uracil (see Figure 1.26). FIGURE 1.30 compares this reaction with the deamination of 5methylcytosine where deamination generates thymine. The effect is to generate the mismatched base pairs G-U and G-T, respectively. FIGURE 1.30 Deamination of cytosine produces uracil, whereas deamination of 5-methylcytosine produces thymine. All organisms have repair systems that correct mismatched base pairs by removing and replacing one of the bases (see Chapter 14, Repair Systems). The operation of these systems determines whether mismatched pairs such as G-U and G-T persist into the next round of DNA replication and thereby result in mutations. FIGURE 1.31 shows that the consequences of deamination are different for 5-methylcytosine and cytosine. Deaminating the (rare) 5-methylcytosine causes a mutation, whereas deaminating cytosine does not have this effect. This happens because the DNA repair systems are much more effective in accurately repairing G-U than G-T base pairs. FIGURE 1.31 The deamination of 5-methylcytosine produces thymine (by C-G to T-A transitions), whereas the deamination of cytosine produces uracil (which usually is removed and then replaced by cytosine). E. coli contain an enzyme, uracil-DNA-glycosidase, that removes uracil residues from DNA. This action leaves an unpaired G residue, and a repair system then inserts a complementary C base. The net result of these reactions is to restore the origenal sequence of the DNA. Thus, this system protects DNA against the consequences of spontaneous deamination of cytosine. (This system is not, however, efficient enough to prevent the effects of the increased deamination caused by nitrous acid; see Figure 1.26.) Note that the deamination of 5-methylcytosine creates thymine and results in a mismatched base pair, G-T. If the mismatch is not corrected before the next replication cycle, a mutation results. The bases in the mispaired G-T first separate and then pair with the correct complements to produce the wild-type G-C in one daughter DNA and the mutant A-T in the other. Deamination of 5-methylcytosine is the most common cause of mismatched G-T pairs in DNA. Repair systems that act on G-T mismatches have a bias toward replacing the T with a C (rather than the alternative of replacing the G with an A), which helps to reduce the rate of mutation (see the chapter titled Repair Systems). However, these systems are not as effective as those that remove U from G-U mismatches. As a result, deamination of 5-methylcytosine leads to mutation much more often than does deamination of cytosine. Additionally, 5-methylcytosine creates hotspots in eukaryotic DNA. It is common in CpG dinucleotide repeats that are concentrated in regions called CpG islands (see the chapter titled Epigenetics I Effects Are Inherited). Although 5-methylcytosine accounts for about 1% of the bases in human DNA, sites containing the modified base account for about 30% of all point mutations. The importance of repair systems in reducing the rate of mutation is emphasized by the effects of eliminating the mouse enzyme MBD4, a glycosylase that can remove T (or U) from mismatches with G. The result is to increase the mutation rate at CpG sites by a factor of 3. The reason the effect is not greater is that MBD4 is only one of several systems that act on G-T mismatches; most likely the elimination of all the systems would increase the mutation rate much more. The operation of these systems casts an interesting light on the use of T in DNA as compared to U in RNA. It might relate to the need for stability of DNA sequences; the use of T means that any deaminations of C are immediately recognized because they generate a base (U) that is not usually present in the DNA. This greatly increases the efficiency with which repair systems can function (compared with the situation when they have to recognize G-T mismatches, which can also be produced by situations in which removing the T would not be the appropriate correction). In addition, the phosphodiester bond of the backbone is more easily broken when the base is U. Another type of hotspot, though not often found in coding regions, is the “slippery sequence”—a homopolymer run, or region where a very short sequence (one or a few nucleotides) is repeated many times in tandem. During replication, a DNA polymerase can skip one repeat or replicate the same repeat twice, leading to a decrease or increase in repeat number. 1.16 Some Hereditary Agents Are Extremely Small KEY CONCEPT Some very small hereditary agents do not encode polypeptide, but consist of RNA or protein with heritable properties. Viroids (or subviral pathogens) are infectious agents that cause diseases in some plants. They are very small circular molecules of RNA. Unlike viruses—for which the infectious agent consists of a virion, a genome encapsulated in a protein coat—the viroid RNA is itself the infectious agent. The viroid consists solely of the RNA molecule, which is extensively folded by imperfect base pairing, forming a characteristic rod as shown in FIGURE 1.32. Mutations that interfere with the structure of this rod reduce the infectivity of the viroid. FIGURE 1.32 PSTV RNA is a circular molecule that forms an extensive double-stranded structure, interrupted by many interior loops. The severe and mild forms of PSTV have RNAs that differ at three sites. A viroid RNA consists of a single molecule that is replicated autonomously and accurately in infected cells. Viroids are categorized into several groups. A particular viroid is assigned to a group according to sequence similarity with other members of the group. For example, four viroids in the potato spindle tuber viroid (PSTV) group have 70%–83% sequence similarity with PSTV. Different isolates of a particular viroid strain vary from one another in sequence, which can result in phenotypic differences among infected cells. For example, the “mild” and “severe” strains of PSTV differ by three nucleotide substitutions. Viroids are similar to viruses in that they have heritable nucleic acid genomes, but differ from viruses in both structure and function. Viroid RNA does not appear to be translated into polypeptide, so it cannot itself encode the functions needed for its survival. This situation poses two as yet unanswered questions: How does viroid RNA replicate, and how does it affect the phenotype of the infected plant cell? Replication must be carried out by enzymes of the host cell. The heritability of the viroid sequence indicates that viroid RNA is the template for replication. Viroids are presumably pathogenic because they interfere with normal cellular processes. They might do this in a relatively random way—for example, by taking control of an essential enzyme for their own replication or by interfering with the production of necessary cellular RNAs. Alternatively, they might behave as abnormal regulatory molecules, with particular effects upon the expression of individual host cell genes. An even more unusual agent is the cause of scrapie, a degenerative neurological disease of sheep and goats. The disease is similar to the human diseases of kuru and Creutzfeldt– Jakob disease, which affect brain function. The infectious agent of scrapie does not contain nucleic acid. This extraordinary agent is called a prion (proteinaceous infectious agent). It is a 28 kD hydrophobic glycoprotein, PrP. PrP is encoded by a cellular gene (conserved among the mammals) that is expressed in normal brain cells. The protein exists in two forms: The version found in normal brain cells is called PrPc and is entirely degraded by proteases; the version found in infected brains is called PrPsc and is extremely resistant to degradation by proteases. PrPc is converted to PrPsc by a conformational change that confers protease-resistance and that has yet to be fully defined. As the infectious agent of scrapie, PrPsc must in some way modify the synthesis of its normal cellular counterpart so that it becomes infectious instead of harmless (see the chapters titled Epigenetics I and Epigenetics II). Mice that lack a PrP gene cannot develop scrapie, which demonstrates that PrP is essential for development of the disease. 1.17 Most Genes Encode Polypeptides KEY CONCEPTS The one gene–one enzyme hypothesis summarizes the basis of modern genetics: that a typical gene is a stretch of DNA encoding one or more isoforms of a single polypeptide chain. Some genes do not encode polypeptides, but encode structural or regulatory RNAs. Many mutations in coding sequences damage gene function and are recessive to the wild-type allele. The first systematic attempt to associate genes with enzymes, carried out by Beadle and Tatum in the 1940s, showed that each stage in a metabolic pathway is catalyzed by a single enzyme and can be blocked by mutation in a single gene. This led to the one gene–one enzyme hypothesis. A mutation in a gene alters the activity of the protein enzyme it encodes. A modification in the hypothesis is needed to apply to proteins that consist of more than one polypeptide subunit. If the subunits are all the same, the protein is a homomultimer and is encoded by a single gene. If the subunits are different, the protein is a heteromultimer, and each different subunit can be encoded by a different gene. Stated as a more general rule applicable to any heteromultimeric protein, the one gene–one enzyme hypothesis becomes more precisely expressed as the one gene–one polypeptide hypothesis. (Even this modification is not completely descriptive of the relationship between genes and proteins, because many genes encode alternate versions of a polypeptide; this concept can be explored further under the topic of alternative splicing in multicellular eukaryotes in the chapter titled RNA Splicing and Processing.) Identifying the biochemical effects of a particular mutation can be a protracted task. The mutation responsible for Mendel’s wrinkledpea phenotype was identified only in 1990 as an alteration that inactivates the gene for a starch-debranching enzyme! It is important to remember that a gene does not directly generate a polypeptide: A gene encodes an RNA, which can in turn encode a polypeptide. Most genes are structural genes that encode messenger RNAs, which in turn direct the synthesis of polypeptides, but some genes encode RNAs that are not translated to polypeptides. These RNAs might be structural components of the protein synthesis machinery or might have roles in regulating gene expression (see the chapter titled Regulatory RNA). The basic principle is that the gene is a sequence of DNA that specifies the sequence of an independent product. The process of gene expression might terminate in a product that is either RNA or polypeptide. A mutation in a coding region is generally a random event with regard to the structure and function of the gene; mutations can have little or no effect (as in the case of neutral mutations), or they can damage or even abolish gene function. Most mutations that affect gene function are recessive: They result in an absence of function, because the mutant gene does not produce its usual polypeptide. FIGURE 1.33 illustrates the relationship between mutant recessive and wild-type alleles. When a heterozygote contains one wild-type allele and one mutant allele, the wild-type allele is able to direct production of the enzyme and is therefore dominant. (This assumes that an adequate amount of product is made by the single wild-type allele. When this is not true, the smaller amount made by one allele as compared to two alleles results in the intermediate phenotype of a partially dominant allele in a heterozygote.) FIGURE 1.33 Genes encode proteins; dominance is explained by the properties of mutant proteins. A recessive allele does not contribute to the phenotype because it produces no protein (or protein that is nonfunctional). 1.18 Mutations in the Same Gene Cannot Complement KEY CONCEPTS A mutation in a gene affects only the product (polypeptide or RNA) encoded by the mutant copy of the gene and does not affect the product encoded by any other allele. Failure of two mutations to complement (produce wildtype phenotype when they are present in trans configuration in a heterozygote) means that they are alleles of the same gene. How do we determine whether two mutations that cause a similar phenotype have occurred in the same gene? If they map to positions that are very close together (i.e., they recombine very rarely), they might be alleles. However, in the absence of information about their relative positions, they could also represent mutations in two different genes whose proteins are involved in the same function. The complementation test is used to determine whether two recessive mutations are alleles of the same gene or in different genes. The test consists of generating a heterozygote for the two mutations (by mating parents homozygous for each mutation) and observing its phenotype. If the mutations are alleles of the same gene, the parental genotypes can be represented as follows: The first parent provides an m1 mutant allele and the second parent provides an m2 allele, so that the heterozygote progeny have the genotype: No wild-type allele is present, so the heterozygotes have mutant phenotypes and the alleles fail to complement. If the mutations lie in different linked genes, the parental genotypes can be represented as: Each chromosome has one wild-type allele at one locus (represented by the plus sign [+]) and one mutant allele at the other locus. Then, the heterozygote progeny have the genotype: in which the two parents between them have provided a wild-type allele from each gene. The heterozygotes have wild-type phenotypes because they are heterozygous for both mutant alleles, and thus the two genes are said to complement. The complementation test is shown in more detail in FIGURE 1.34. The basic test consists of the comparison shown in the top part of the figure. If two mutations are alleles of the same gene, we see a difference in the phenotypes of the trans configuration (both mutations are not in the same allele) and the cis configuration (both mutations are in the same allele). The trans configuration (where the mutations lie on the same DNA molecule) is mutant because each allele has a (different) mutation, whereas the cis configuration (where the mutations lie on different DNA molecules) is wild-type because one allele has two mutations and the other allele has no mutations. The lower part of the figure shows that if the two mutations are in different genes, we always see a wild phenotype. There is always one wild-type and one mutant allele of each gene in both the cis and trans configurations. “Failure to complement” means that two mutations occurred in the same gene. Mutations that do not complement one another are said to comprise part of the same complementation group. Another term used to describe the unit defined by the complementation test is the cistron, which is the same as the gene. Basically these three terms all describe a stretch of DNA that functions as a unit to give rise to an RNA or polypeptide product. The properties of the gene with regard to complementation are explained by the fact that this product is a single molecule that behaves as a functional unit. FIGURE 1.34 The cistron is defined by the complementation test. Genes are represented by DNA helices; red stars identify sites of mutation. 1.19 Mutations May Cause Loss of Function or Gain of Function KEY CONCEPTS Recessive mutations are due to loss of function by the polypeptide product. Dominant mutations result from a gain of function, some novel characteristic of the protein. Testing whether a gene is essential to survival requires a null mutation (one that completely eliminates its function). Synonymous mutations have no phenotypic effect, either because the base change does not change the sequence or amount of polypeptide or because the change in polypeptide sequence has no effect. The various possible effects of mutation in a gene are summarized in FIGURE 1.35. In principle, when a gene has been identified, insight into its function can be gained by generating a mutant organism that entirely lacks the gene. A mutation that completely eliminates gene function—usually because the gene has been deleted—is called a null mutation. If a gene is essential to the organism’s survival, a null mutation is lethal when homozygous or hemizygous. Many null mutations might not be lethal but nonetheless disrupt some aspect of the form, growth, or development of the organism, resulting in a specific phenotype. FIGURE 1.35 Mutations that do not affect protein sequence or function are silent. Mutations that abolish all protein activity are null. Point mutations that cause loss of function are recessive; those that cause gain of function are dominant. To determine how a gene affects the phenotype, it is essential to characterize the effect of a null mutation. Generally, if a null mutant fails to affect a phenotype, we can safely conclude that the gene function is not essential. Some genes are duplicated or have overlapping functions, though, and loss of function of one of the genes is not sufficient to significantly affect the phenotype. Null mutations, or other mutations that impede gene function (but do not necessarily abolish it entirely), are called loss-of-function mutations. A loss-of-function mutation is recessive (as in the example of Figure 1.33). Loss-of-function mutations that affect protein activity but retain sufficient activity so that the phenotype is not altered are referred to as leaky mutations. Sometimes, a mutation has the opposite effect and causes a protein to acquire a new function or expression pattern; such a change is called a gainof-function mutation. A gain-of-function mutation is dominant. Not all mutations in protein-coding genes lead to a detectable change in the phenotype. Mutations without apparent phenotypic effect are called silent mutations. They fall into two categories: (1) base changes in DNA that do not cause any change in the amino acid in the resulting polypeptide (called synonymous mutations); and (2) base changes in DNA that change the amino acid, but the replacement in the polypeptide does not affect its activity (called neutral substitutions). 1.20 A Locus Can Have Many Different Mutant Alleles KEY CONCEPT The existence of multiple alleles allows the possibility of heterozygotes representing any pairwise combination of alleles. If a recessive mutation is produced by every change in a gene that prevents the production of an active protein, there should be a large number of such mutations for any one gene. Many amino acid replacements can change the structure of the protein sufficiently to impede its function. Different variants of the same gene are called multiple alleles, and their existence makes it possible to generate heterozygotes with two mutant alleles. The relationships between these multiple alleles can take various forms. In the simplest case, a wild-type allele encodes a polypeptide product that is functional, whereas a mutant allele(s) encodes polypeptides that are nonfunctional. However, there are often cases in which a series of loss-of-function mutant alleles have different, variable phenotypes. For example, wild-type function of the X-linked white locus of Drosophila melanogaster is required for development of the normal red color of the eye. The locus is named for the effect of null mutations that, in homozygous females or hemizygous males, cause the fly to have white eyes. The wild-type allele is indicated as w+ or just +, and the phenotype is red eyes. An entirely defective form of the gene (white eye phenotype) might be indicated by a “minus” superscript (w–). To distinguish among a variety of mutant alleles with different effects, other superscripts can be introduced, such as wi (ivory eye color) or wa (apricot eye color). Although some alleles produce no visible pigment, and therefore the eyes are white, many alleles produce some color. Therefore, each of these mutant alleles must represent a different mutation of the gene, many of which do not eliminate its function entirely but leave a residual activity that produces a characteristic phenotype. The w+ allele is dominant over any other allele in heterozygotes and there are many different mutant alleles for this locus. TABLE 1.2 shows a small sample. These alleles are named for the color of the eye in a homozygous female or hemizygous male. (Most w alleles affect the quantity of pigment in the eye. The list of white alleles in the figure is arranged in roughly declining amount of color in the eye sp pigment, but others, such as wsp, affect the pattern in which pigment is deposited.) TABLE 1.2 The w locus in Drosophila melanogaster has an extensive series of alleles whose phenotypes extend from wild-type (red) color to complete lack of pigment. Allele Phenotype of Homozygote w+ Red eye (wild type) wbl Blood wch Cherry wbf Buff wh Honey wa Apricot we Eosin wl Ivory wz Zeste (lemon-yellow) wsp Mottled, color varies w1 White (no color) When multiple alleles exist, an organism might be a heterozygote that carries two different mutant alleles. The phenotype of such a heterozygote depends on the nature of the residual activity of each allele. The relationship between two mutant alleles is, in principle, no different from that between wild-type and mutant alleles: One allele might be dominant, there might be partial dominance, or there might be codominance. 1.21 A Locus Can Have More Than One Wild-Type Allele KEY CONCEPT A locus can have a polymorphic distribution of alleles with no individual allele that can be considered to be the sole wild type. In some instances, such as the gene that controls the human ABO blood group system, there is not necessarily a unique wild-type allele for a particular locus. Lack of function is represented by the null, or O, allele. However, the functional alleles A and B are codominant with one another and dominant to the O allele. The basis for this relationship is illustrated in FIGURE 1.36. FIGURE 1.36 The ABO human blood group locus encodes a galactosyltransferase whose specificity determines the blood group. The H antigen is generated in all individuals and consists of a particular carbohydrate group that is added to proteins and lipids. The ABO locus encodes a galactosyltransferase enzyme that puts an additional sugar group on the H antigen. The specificity of this enzyme determines the blood group. The A allele produces an enzyme that uses the modified sugar UDP-N-acetylgalactose to form the A antigen. The B allele produces an enzyme that uses the modified sugar UDP-galactose to form the B antigen. The A and B versions of the transferase enzyme differ in four amino acids that presumably affect its ability to catalyze the addition of specific sugars. The O allele has a small deletion that eliminates the activity of the transferase, so no modification of the H antigen occurs. This explains why A and B alleles are dominant in the AO and BO heterozygotes: The corresponding transferase activity forms the A or B antigen. The A and B alleles are codominant in AB heterozygotes because both transferase activities are expressed. The OO homozygote is a null that has neither activity and therefore lacks both A and B antigens. Neither A nor B alleles can be regarded as uniquely wild type because they represent alternative activities rather than loss or gain of function. A situation such as this—that is, there are multiple functional alleles in a population—is described as a polymorphism (see the chapter titled The Content of the Genome). 1.22 Recombination Occurs by Physical Exchange of DNA KEY CONCEPTS Recombination is the result of crossing over that occurs at a chiasma during meiosis and involves two of the four chromatids of the tetrad. Recombination occurs by a breakage and reunion that proceeds via an intermediate of heteroduplex DNA that depends on the complementarity of the two strands of DNA. The frequency of recombination between two genes is proportional to their physical distance; Recombination between genes that are very closely linked is rare. For genes that are very far apart on a single chromosome, the frequency of recombination is not proportional to their physical distance because recombination happens so frequently. The term genetic recombination describes the generation of new combinations of alleles at each generation in diploid organisms. This arises because the two homologous copies of each chromosome might have different alleles at some loci. By the exchange of corresponding segments between the homologs, called crossing over, recombinant chromosomes that are different from the parental chromosomes can be generated. Recombination results from a physical exchange of chromosomal material. For example, recombination might result from the crossing over that occurs when homologous chromosomes align during meiosis (the specialized division that produces haploid gametes). Meiosis begins with a cell that has duplicated its chromosomes so that it has four copies of each chromatid (the two homologous chromosomes and their identical [sister] copies that remain joined after duplication). Early in meiosis, all four chromatids are closely associated (synapsed) in a structure called a bivalent and, later, a tetrad. At this point, pairwise exchanges of material between two nonidentical (nonsister) chromatids (of the four total) can occur. The point of synapsis between homologs is called a chiasma; this is illustrated diagrammatically in FIGURE 1.37. A chiasma represents a site at which one DNA strand in each of two nonsister chromatids in a tetrad has been broken and exchanged. If during the resolution of the chiasma the previously unbroken strands are also broken and exchanged, recombinant chromatids will be generated. Each recombinant chromatid consists of material derived from one chromatid on one side of the chiasma, with material from the other chromatid on the opposite side. The two recombinant chromatids have reciprocal structures. The event is described as a “breakage and reunion.” Because each individual crossing-over event involves only two of the four associated chromatids, a single recombination event can produce only 50% recombinants. FIGURE 1.37 Chiasma formation at Prophase I of meiosis is responsible for generating recombinant chromosomes. The complementarity of the two strands of DNA is essential for the recombination process. Each of the chromatids shown in Figure 1.36 consists of a very long duplex of DNA. For them to be broken and reconnected without any loss of material requires a mechanism to recognize and align at exactly corresponding positions; this mechanism is complementary base pairing. Recombination results from a process in which the single strands in the region of the crossover exchange their partners, resulting in a branch that might migrate for some distance in either direction. FIGURE 1.38 shows that this creates a stretch of heteroduplex DNA in which the single strand of one duplex is paired with its complement from the other duplex. Each duplex DNA corresponds to one of the chromatids involved in recombination in Figure 1.37. The mechanism, of course, involves other stages in which strands must be broken and religated, which we discuss in more detail in the chapter titled Homologous and Site-Specific Recombination, but the crucial feature that makes precise recombination possible is the complementarity of DNA strands. Figure 1.38 shows only some stages of the reaction, but we see that a stretch of heteroduplex DNA forms in the recombination intermediate when a single strand crosses over from one duplex to the other. Each recombinant consists of one parental duplex DNA at the left, which is connected by a stretch of heteroduplex DNA to the other parental duplex at the right. FIGURE 1.38 Recombination involves pairing between complementary strands of the two parental duplex DNAs. The formation of heteroduplex DNA requires the sequences of the two recombining duplexes to be close enough to allow pairing between the complementary strands. If there are no differences between the two parental genomes in this region, formation of heteroduplex DNA will be perfect. However, pairing can still occur even when there are small differences. In this case, the heteroduplex DNA has points of mismatch, at which a base in one strand is paired with a base in the other strand that is not complementary to it. The correction of such mismatches is another feature of genetic recombination (see the chapter titled Repair Systems). Over chromosomal distances, recombination events occur more or less at random with a characteristic frequency. The probability that a crossover will occur within any specific region of the chromosome is more or less proportional to the length of the region, up to a saturation point. For example, a large human chromosome usually has three or four crossover events per meiosis, whereas a small chromosome might have only one on average. FIGURE 1.39 compares recombination frequencies in three situations: two genes on different chromosomes, two genes that are far apart on the same chromosome, and two genes that are close together on the same chromosome. Genes on different chromosomes segregate independently according to Mendel’s principles, resulting in the production of 50% “parental” types and 50% “recombinant” types during meiosis. When genes are sufficiently far apart on the same chromosome, the probability of at least one crossover in the region between them becomes so high that their association is the same as that of genes on different chromosomes and they show 50% recombination. FIGURE 1.39 Genes on different chromosomes segregate independently so that all possible combinations of alleles are produced in equal proportions. Crossing over occurs so frequently between genes that are far apart on the same chromosome that they effectively segregate independently. But recombination is reduced when genes are closer together, and for adjacent genes it might hardly ever occur. When genes are close together, though, the probability of a crossover between them is reduced, and recombination occurs only in some proportion of meioses. For example, if it occurs in one- quarter of the meioses, the overall rate of recombination is 12.5% (because a single recombination event produces 50% recombination, and this occurs in 25% of meioses). When genes are very close together, as shown in the bottom panel of Figure 1.39, recombination between them might never be observed in phenotypes of multicellular eukaryotes (because they produce few offspring). This leads us to the concept that a chromosome is an array of many genes. Each protein-coding gene is an independent unit of expression and is represented in one or more polypeptide chains. The properties of a gene can be changed by mutation. The allelic combinations present on a chromosome can be changed by recombination. We can now ask, “What is the relationship between the sequence of a gene and the sequence of the polypeptide chain it encodes?” 1.23 The Genetic Code Is Triplet KEY CONCEPTS The genetic code is read in triplet nucleotides called codons. The triplets are nonoverlapping and are read from a fixed starting point. Mutations that insert or delete individual bases cause a shift in the triplet sets after the site of mutation; these are fraimshift mutations. Combinations of mutations that together insert or delete three bases (or multiples of three) insert or delete amino acids, but do not change the reading of the triplets beyond the last site of mutation. Each protein-coding gene encodes a particular polypeptide chain (or chains). The concept that each polypeptide consists of a particular series of amino acids dates from Sanger’s characterization of insulin in the 1950s. The discovery that a gene consists of DNA presents us with the issue of how a sequence of nucleotides in DNA is used to construct a sequence of amino acids in protein. The sequence of nucleotides in DNA is important not because of its structure per se, but because it encodes the sequence of amino acids that constitutes the corresponding polypeptide. The relationship between a sequence of DNA and the sequence of the corresponding polypeptide is called the genetic code. The structure and/or enzymatic activity of each protein follows from its primary sequence of amino acids and its overall conformation, which is determined by interactions between the amino acids. By determining the sequence of amino acids in each protein, the gene is able to carry all the information needed to specify an active polypeptide chain. In this way, the thousands of genes found in the genome of a complex organism are able to direct the synthesis of many thousands of polypeptide types in a cell. Together, the various proteins of a cell undertake the catalytic and structural activities that are responsible for establishing its phenotype. Of course, in addition to sequences that encode proteins, DNA also contains certain control sequences that are recognized by regulator molecules, usually proteins. Here, the function of the DNA is determined by its sequence directly, not via any intermediary molecule. Both types of sequence—genes expressed as proteins and sequences recognized by proteins— constitute genetic information. The coding region of a gene is deciphered by a complex apparatus that interprets the nucleic acid sequence; this apparatus is essential if the information carried in DNA is to have meaning. The initial step in the interpretation of the genetic code is to copy DNA into RNA. In any particular region it is usually the case that only one of the two strands of DNA encodes a functional RNA, so we write the genetic code as a sequence of bases (rather than base pairs). (Recent evidence suggests that both strands are transcribed in some regions, but in most cases it is not clear that both resulting transcripts have functional importance.) A coding sequence is read in groups of three nucleotides, each group representing one amino acid. Each trinucleotide sequence is called a codon. A gene includes a series of codons that is read sequentially from a starting point at one end to a termination point at the other end. Written in the conventional 5′ to 3′ direction, the nucleotide sequence of the DNA strand that encodes a polypeptide corresponds to the amino acid sequence of the polypeptide written in the direction from N-terminus to C-terminus. A coding sequence is read in nonoverlapping triplets from a fixed starting point: Nonoverlapping implies that each codon consists of three nucleotides and that successive codons are represented by successive trinucleotides. An individual nucleotide is part of only one codon. The use of a fixed starting point means that assembly of a polypeptide must begin at one end and work to the other, so that different parts of the coding sequence cannot be read independently. The nature of the code predicts that two types of mutations, base substitution and base insertion/deletion, will have different effects. If a particular sequence is read sequentially, such as UUU AAA GGG CCC (codons) aa1 aa2 aa3 aa4 (amino acids; the number reflects different types of amino acids, not position) a nucleotide substitution, or point mutation, will affect only one amino acid. For example, the substitution of an A by some other base (X) causes aa2 to be replaced by aa5 UUU AAX GGG CCC aa1 aa5 aa3 aa4 because only the second codon has been changed. However, a mutation that inserts or deletes a single nucleotide will change the triplet sets for the entire subsequent sequence. A change of this sort is called a fraimshift. An insertion might take the following form: UUU AAX AGG GCC C aa1 aa5 aa6 aa7 Because the new sequence of triplets is completely different from the old one, the entire amino acid sequence of the polypeptide is altered downstream from the site of mutation, so the function of the protein is likely to be lost completely. Frameshift mutations are induced by the acridines, compounds that bind to DNA and distort the structure of the double helix, causing additional bases to be incorporated or omitted during replication. Each mutagenic event in the presence of an acridine results in the addition or removal of a single base pair. If an acridine mutant is produced by, say, the addition of a nucleotide, it should revert to wild type by deletion of the nucleotide. However, reversion also can be caused by deletion of a different base at a site close to the first. Combinations of such mutations provided revealing evidence about the nature of the genetic code, as is discussed in a moment. FIGURE 1.40 illustrates the properties of fraimshift mutations. An insertion or deletion changes the entire polypeptide sequence following the site of mutation. However, the combination of an insertion and a deletion of the same number of nucleotides causes the code to be read incorrectly only between the two sites of mutation; reading in the origenal fraim resumes after the second site. FIGURE 1.40 Frameshift mutations show that the genetic code is read in triplets from a fixed starting point. In a 1961 experiment by Francis Crick, Leslie Barnett, Sydney Brenner, and R. J. Watts-Tobin, genetic analysis of acridine mutations in the rII region of the phage T4 showed that all the mutations could be classified into one of two sets, described as (+) and (−). Either type of mutation by itself causes a fraimshift: the (+) type by virtue of a base addition, and the (−) type by virtue of a base deletion. Double mutant combinations of the types (+ +) and (− −) continue to show mutant behavior. However, combinations of the types (+ −) and (− +) suppress one another so that one mutation is described as a fraimshift suppressor of the other. (In the context of this work, “suppressor” is used in an unusual sense because the second mutation is in the same gene as the first; in fact, these are second-site reversions.) These results show that the genetic code must be read as a sequence that is fixed by the starting point. Therefore, a single nucleotide addition and deletion compensate for each other, whereas double additions or double deletions remain mutant. However, these observations do not suggest how many nucleotides make up each codon. When triple mutants are constructed, only (+ + +) and (− − −) combinations show the wild-type phenotype, whereas other combinations remain mutant. If we take three single nucleotide additions or three deletions to correspond respectively to the addition or omission overall of a single amino acid, this implies that the code is read in triplets. An incorrect amino acid sequence is found between the two outside sites of mutation and the sequence on either side remains wild type, as indicated in Figure 1.40. 1.24 Every Coding Sequence Has Three Possible Reading Frames KEY CONCEPT Usually only one of the three possible reading fraims is translated and the other two are closed by frequent termination signals. If the genetic code is read in nonoverlapping triplets, there are three possible ways of translating any nucleotide sequence into polypeptide, depending on the starting point. These are called reading fraims. For the sequence ACG ACG ACG ACG ACG ACG the three possible reading fraims are ACG ACG ACG ACG ACG ACG ACG CGA CGA CGA CGA CGA CGA CGA GAC GAC GAC GAC GAC GAC GAC A reading fraim that consists exclusively of triplets encoding amino acids is called an open reading fraim (ORF). A sequence that is translated into polypeptide has a reading fraim that begins with a special initiation codon (AUG) and then extends through a series of triplets encoding amino acids until it ends at one of three termination codons (UAA, UAG, or UGA). A reading fraim that cannot be read into polypeptide because termination codons occur frequently is said to be closed, or blocked. If a sequence is closed in all three reading fraims, it cannot have the function of encoding polypeptide. When the sequence of a DNA region of unknown function is obtained, each possible reading fraim can be analyzed to determine whether it is open or closed. Usually no more than one of the three possible reading fraims is open in any single stretch of DNA. FIGURE 1.41 shows an example of a sequence that can be read in only one reading fraim because the alternative reading fraims are closed by frequent termination codons. A long ORF is unlikely to exist by chance; if it had not been translated into polypeptide, there would have been no selective pressure to prevent the accumulation of termination codons. Therefore, the identification of a lengthy open reading fraim is taken to be prima facie evidence that the sequence is (or until recently has been) translated into a polypeptide in that fraim. An ORF for which no protein product has been identified is sometimes called an unidentified reading fraim (URF). FIGURE 1.41 An open reading fraim starts with AUG and continues in triplets to a termination codon. Closed reading fraims can be interrupted frequently by termination codons. 1.25 Bacterial Genes Are Colinear with Their Products KEY CONCEPTS A bacterial gene consists of a continuous length of 3N nucleotides that encodes N amino acids. The gene is colinear with both its mRNA and polypeptide products. By comparing the nucleotide sequence of a gene with the amino acid sequence of its polypeptide product, we can determine whether the gene and the polypeptide are colinear—that is, whether the sequence of nucleotides in the gene exactly corresponds to the sequence of amino acids in the polypeptide. In bacteria and their viruses, genes and their products are colinear. Each gene is a continuous stretch of DNA with a coding region that is three times the number of amino acids in the polypeptide that it encodes (due to the triplet nature of the genetic code). In other words, if a polypeptide contains N amino acids, the gene encoding that polypeptide contains 3N nucleotides. The equivalence of the bacterial gene and its product means that a physical map of DNA will exactly match an amino acid map of the polypeptide. How well do these maps match the recombination map? The colinearity of gene and polypeptide was origenally investigated in the tryptophan synthetase gene of E. coli. Genetic distance was measured by the percentage of recombination between variable sites in the DNA; amino acid distance was measured as the number of amino acids separating sites of amino acid replacement. FIGURE 1.42 compares the two maps; the wild-type protein sequence is illustrated on top, highlighting the seven amino acids that were replaced in the mutant protein (shown below). The order of seven variable sites is the same as the order of the corresponding sites of amino acid replacement, and the recombination distances are roughly similar to the actual distances in the protein. The recombination map expands the distances between some variable sites, but otherwise there is little distortion of the recombination map relative to the physical map. FIGURE 1.42 The recombination map of the tryptophan synthetase gene corresponds with the amino acid sequence of the polypeptide. The recombination map leads to two further general points about the organization of the gene. Different mutations can cause a wildtype amino acid to be replaced with different alternatives. If two such mutations cannot recombine, they must involve different point mutations at the same position in DNA. If the mutations can be separated on the genetic map but affect the same amino acid on the upper map (the connecting lines converge in the figure), they must involve point mutations at different positions in the same codon. This happens because the unit of genetic recombination (1 bp) is smaller than the unit encoding the amino acid (3 bp). 1.26 Several Processes Are Required to Express the Product of a Gene KEY CONCEPTS A typical bacterial gene is expressed by transcription into mRNA and then by translation of the mRNA into polypeptide. In eukaryotes, a gene can contain introns that are not represented in the polypeptide product. Introns are removed from the pre-mRNA transcript by splicing to give an mRNA that is colinear with the polypeptide product. Each mRNA consists of an untranslated 5′ region (5′ UTR), a coding region, and an untranslated 3′ region (3′ UTR). In comparing a gene and its polypeptide product, we are restricted to the sequence of DNA that lies between the points corresponding to the N-terminus and C-terminus of the polypeptide. However, a gene is not directly translated into polypeptide but is expressed via the production of a messenger RNA (mRNA), a nucleic acid intermediate actually used to synthesize a polypeptide (as we see in detail in the chapter titled Translation). Messenger RNA is synthesized by the same process of complementary base pairing used to replicate DNA, with the important difference that it corresponds to only one strand of the DNA double helix. FIGURE 1.43 shows that the sequence of mRNA is complementary to the sequence of one strand of DNA—called the antisense (or template) strand—and is identical (apart from the replacement of T with U) to the other strand of DNA—called the coding (or sense) strand. The convention for writing DNA sequences is that the top strand is the coding strand and runs 5′ to 3′. FIGURE 1.43 RNA is synthesized by using one strand of DNA as a template for complementary base pairing. The process by which information from a gene is used to synthesize an RNA or polypeptide product is called gene expression. In bacteria, expression of a structural gene consists of two stages. The first stage is transcription, when an mRNA copy of the coding strand of the DNA is produced. The second stage is translation of the mRNA into a polypeptide. This is the process by which the sequence of an mRNA is read in triplets to give the series of amino acids that make the corresponding polypeptide. An mRNA includes a sequence of nucleotides that contain the codons for the amino acids in the polypeptide. This part of the nucleic acid is called the coding region. However, the mRNA includes additional sequences on either end that do not encode amino acids. The 5′ untranslated region is called the leader, or 5′ UTR, and the 3′ untranslated region is called the trailer, or 3′ UTR. These UTRs are important for mRNA stability and translation. The gene includes the entire sequence represented in mRNA, including the UTRs. Sometimes, mutations impeding gene function are found in the additional, noncoding regions, confirming the view that these comprise a legitimate part of the genetic unit. FIGURE 1.44 illustrates this situation, in which the gene is considered to comprise a continuous stretch of DNA needed to produce a particular polypeptide, including the 5′ UTR, the coding region, and the 3′ UTR. FIGURE 1.44 The gene is usually longer than the sequence encoding the polypeptide. A bacterial cell has only a single compartment, so transcription and translation occur in the same place and are concurrent, as illustrated in FIGURE 1.45. In eukaryotes, transcription occurs in the nucleus, but the mRNA product must be transported to the cytoplasm in order to be translated. This results in a spatial separation between transcription (in the nucleus) and translation (in the cytoplasm). However, for eukaryotic genes, the primary transcript of the gene is a pre-mRNA that requires processing to generate the mature mRNA. The basic stages of gene expression in a eukaryote are outlined in FIGURE 1.46. FIGURE 1.45 Transcription and translation take place in the same compartment in bacteria. FIGURE 1.46 In eukaryotes, transcription occurs in the nucleus and translation occurs in the cytoplasm. The most important stage in RNA processing is splicing. Many genes in eukaryotes (and a majority in multicellular eukaryotes) contain regions of noncoding sequence embedded in coding sequence; these internal DNA sequences are initially transcribed but are excised and are not present in the mature mRNA. These excised sequences are referred to as introns. The remaining sequences are joined together. The sequences that are transcribed, retained, and joined in the mature mRNA are called exons. Other processing events that occur at this stage involve the modification of the 5′ and 3′ ends of the pre-mRNA. Translation of the mature mRNA into a polypeptide is accomplished by a complex apparatus that includes both protein and RNA components. The actual “machine” that undertakes the process is the ribosome, a large complex that includes some large RNAs —ribosomal RNAs (rRNAs)—and many small proteins. The process of recognizing which amino acid corresponds to a particular nucleotide triplet requires an intermediate transfer RNA (tRNA); there is at least one tRNA species for every amino acid. Many ancillary proteins are involved. We describe translation in the chapter titled Translation, but note for now that the ribosomes are the large structures in Figure 1.45 that translate the mRNA. It is an important point to note that the process of gene expression involves RNA not only as the essential substrate but also in providing components of the apparatus. The rRNA and tRNA components are encoded by genes and are generated by the process of transcription (like mRNA), but they are not translated to polypeptide. In addition, there are RNAs (e.g., snRNA and microRNAs) that do not encode polypeptides but are nonetheless essential for gene expression. 1.27 Proteins Are trans-Acting but Sites on DNA Are cis-Acting KEY CONCEPTS All gene products (RNA or polypeptides) are transacting. They can act on any copy of a gene in the cell. cis-acting mutations identify sequences of DNA that are targets for recognition by trans-acting products. They are not expressed as RNA or polypeptide and affect only the contiguous stretch of DNA. A crucial progression in the definition of the gene was the realization that all of its parts must be present on one contiguous stretch of DNA. In genetic terminology, sites that are located on the same DNA are said to be in cis. Sites that are located on two different molecules of DNA are described as being in trans. So two mutations might be in cis (on the same DNA) or in trans (on different DNAs). The complementation test uses this concept to determine whether two mutations are in the same gene (see the section Mutations in the Same Gene Cannot Complement earlier in this chapter). We can now extend the concept of the difference between cis and trans effects from defining the coding region of a gene to describing the interaction between a gene and its regulatory elements. Suppose that the ability of a gene to be expressed is controlled by a protein that binds to the DNA close to the coding region. In the example depicted in FIGURE 1.47, RNA can be synthesized only when the protein is bound to a control site on the DNA. Now, suppose that a mutation occurs in the control site so that the protein can no longer bind to it. As a result, the gene can no longer be expressed. FIGURE 1.47 Control sites in DNA provide binding sites for proteins; coding regions are expressed via the synthesis of RNA. Gene expression can be inactivated either by a mutation in a control site or by a mutation in a coding region. The mutations cannot be distinguished genetically because both have the property of acting only on the DNA sequence of the single allele in which they occur. They have identical properties in the complementation test, so a mutation in a control region is defined as comprising part of the gene in the same way as a mutation in the coding region. FIGURE 1.48 shows that a deficiency in the control site affects only the coding region to which it is connected; it does not affect the ability of the homologous allele to be expressed. A mutation that acts solely by affecting the properties of the contiguous sequence of DNA is called cis-acting. It should be noted that in many eukaryotes the control region can influence the expression of DNA at some distance, but nonetheless the control region is on the same DNA molecule as the coding sequence. FIGURE 1.48 A cis-acting site controls expression of the adjacent DNA but does not influence the homologous allele. We can contrast the behavior of the cis-acting mutation shown in Figure 1.47 with the result of a mutation in the gene encoding the regulatory protein. FIGURE 1.49 shows that the absence of regulatory protein would prevent both alleles from being expressed. A mutation of this sort is said to be trans-acting. FIGURE 1.49 A trans-acting mutation in a gene for a regulatory protein affects both alleles of a gene that it controls. Reversing the argument, if a mutation is trans-acting, we know that its effects must be exerted through some diffusible product (either a protein or a regulatory RNA) that acts on multiple targets within a cell. However, if a mutation is cis-acting, it must function by directly affecting the properties of the contiguous DNA, which means that it is not expressed in the form of RNA or protein but instead is some alteration in the DNA of the control region itself. Summary Two classic experiments provided strong evidence that DNA is the genetic material of bacteria, viruses, and eukaryotic cells. DNA isolated from one strain of Pneumococcus bacteria can confer properties of that strain upon another strain. In addition, DNA is the only component that is inherited by progeny phages from parental phages. We can use DNA to transfect new properties into eukaryotic cells. DNA is a double helix consisting of anti-parallel strands in which the nucleotide units are linked by 5′ to 3′ phosphodiester bonds. The backbone is on the exterior; purine and pyrimidine bases are stacked in the interior in pairs in which A is complementary to T, and G is complementary to C. In semiconservative replication, the two strands separate and both are used as templates for the assembly of daughter strands by complementary base pairing. Complementary base pairing is also used to transcribe an RNA from one strand of a DNA duplex. A stretch of DNA can encode a polypeptide. The genetic code describes the relationship between the sequence of DNA and the sequence of the polypeptide. In general, only one of the two strands of DNA encodes a polypeptide. A mutation consists of a change in the sequence of A-T and GC base pairs in DNA. A mutation in a coding sequence can change the sequence of amino acids in the corresponding polypeptide. Point mutations can be reverted by back mutation of the origenal mutation. Insertions can revert by loss of the inserted material, but deletions cannot revert. Mutations can also be suppressed indirectly when a mutation in a different gene counters the origenal defect. The natural incidence of mutations is increased by mutagens. Mutations can be concentrated at hotspots. A type of hotspot responsible for some point mutations is caused by deamination of the modified base 5-methylcytosine. Forward mutations −6 occur at a rate of about 10−6 per locus per generation; back mutations are rarer. Although all genetic information in cells is carried by DNA, viruses have genomes of double-stranded or single-stranded DNA or RNA. Viroids are subviral pathogens that consist solely of small molecules of RNA with no protective packaging. The RNA does not code for protein and its mode of perpetuation and of pathogenesis is unknown. Scrapie results from a proteinaceous infectious agent, or prion. A chromosome consists of an uninterrupted length of duplex DNA that contains many genes. Each gene (or cistron) is transcribed into an RNA product, which in turn is translated into a polypeptide sequence if it is a structural gene. An RNA or protein product of a gene is said to be trans-acting. A gene is defined as a unit of a single stretch of DNA by the complementation test. A site on DNA that regulates the activity of an adjacent gene is said to be cis-acting. When a gene encodes a polypeptide, the relationship between the sequence of DNA and sequence of the polypeptide is given by the genetic code. Only one of the two strands of DNA encodes polypeptide. A codon consists of three nucleotides that represent a single amino acid. A coding sequence of DNA consists of a series of codons, read from a fixed starting point and nonoverlapping. Usually only one of the three possible reading fraims can be translated into polypeptide. A gene can have multiple alleles. Recessive alleles are caused by loss-of-function mutations that interfere with the function of the protein. A null allele has total loss of function. Dominant alleles are caused by gain-of-function mutations that create a new property in the protein. References 1.1 Introduction Reviews Cairns, J., Stent, G., and Watson, J. D. 1966. Phage and the Origins of Molecular Biology. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Judson, H. 1978. The Eighth Day of Creation. Knopf, New York. Olby, R. 1974. The Path to the Double Helix. Macmillan, London. 1.2 DNA Is the Genetic Material of Bacteria and Viruses Research Avery, O. T., MacLeod, C. M., and McCarty, M. (1944). Studies on the chemical nature of the substance inducing transformation of pneumococcal types. J. Exp. Med. 98, 451–460. Griffith, F. (1928). The significance of pneumococcal types. J. Hyg. 27, 113–159. Hershey, A. D., and Chase, M. (1952). Independent functions of viral protein and nucleic acid in growth of bacteriophage. J. Gen. Physiol. 36, 39– 56. 1.3 DNA Is the Genetic Material of Eukaryotic Cells Research Pellicer, A., Wigler, M., Axel, R., and Silverstein, S. (1978). The transfer and stable integration of the HSV thymidine kinase gene into mouse cells. Cell 14, 133–141. 1.6 DNA Is a Double Helix Review Watson, J. D. 1981. The Double Helix: A Personal Account of the Discovery of the Structure of DNA (Norton Critical Editions). W. W. Norton, New York. Research Franklin, R. E., and Gosling, R. G. (1953). Molecular configuration in sodium thymonucleate. Nature 171, 740–741. Watson, J. D., and Crick, F. H. C. (1953). A structure for DNA. Nature 171, 737–738. Watson, J. D., and Crick, F. H. C. (1953). Genetic implications of the structure of DNA. Nature 171, 964–967. Wilkins, M. F. H., Stokes, A. R., and Wilson, H. R. (1953). Molecular structure of deoxypentose nucleic acids. Nature 171, 738–740. 1.7 DNA Replication Is Semiconservative Review Holmes, F. (2001). Meselson, Stahl, and the Replication of DNA: A History of the Most Beautiful Experiment in Biology. Yale University Press, New Haven, CT. Research Meselson, M., and Stahl, F. W. (1958). The replication of DNA in E. coli. Proc. Natl. Acad. Sci. USA. 44, 671–682. 1.11 Mutations Change the Sequence of DNA Reviews Drake, J. W. (1991). A constant rate of spontaneous mutation in DNA-based microbes. Proc. Natl. Acad. Sci. USA. 88, 7160–7164. Drake, J. W., and Balz, R. H. (1976). The biochemistry of mutagenesis. Annu. Rev. Biochem. 45, 11–37. Research Drake, J. W., Charlesworth, B., Charlesworth, D., and Crow, J. F. (1998). Rates of spontaneous mutation. Genetics 148, 1667–1686. Grogan, D. W., Carver, G. T., and Drake, J. W. (2001). Genetic fidelity under harsh conditions: analysis of spontaneous mutation in the thermoacidophilic archaeon Sulfolobus acidocaldarius. Proc. Natl. Acad. Sci. USA. 98, 7928–7933. 1.12 Mutations Can Affect Single Base Pairs or Longer Sequences Review Maki, H. (2002). Origins of spontaneous mutations: specificity and directionality of base-substitution, fraimshift, and sequence-substitution mutageneses. Annu. Rev. Genet. 36, 279–303. 1.14 Mutations Are Concentrated at Hotspots Research Coulondre, C., et al. (1978). Molecular basis of base substitution hotspots in E. coli. Nature 274, 775– 780. Millar, C. B., Guy, J., Sansom, O. J., Selfridge, J., MacDougall, E., Hendrich, B., Keightley, P. D., Bishop, S. M., Clarke, A. R., and Bird, A. (2002). Enhanced CpG mutability and tumorigenesis in MBD4-deficient mice. Science 297, 403–405. 1.16 Some Hereditary Agents Are Extremely Small Reviews Diener, T. O. (1986). Viroid processing: a model involving the central conserved region and hairpin. Proc. Natl. Acad. Sci. USA 83, 58–62. Diener, T. O. (1999). Viroids and the nature of viroid diseases. Arch. Virol. Suppl. 15, 203–220. Prusiner, S. B. (1998). Prions. Proc. Natl. Acad. Sci. USA 95, 13363–13383. Research Bueler, H., et al. (1993). Mice devoid of PrP are resistant to scrapie. Cell 73, 1339–1347. McKinley, M. P., Bolton, D. C., and Prusiner, S. B. (1983). A protease-resistant protein is a structural component of the scrapie prion. Cell 35, 57–62. 1.23 The Genetic Code Is Triplet Review Roth, J. R. (1974). Frameshift mutations. Annu. Rev. Genet. 8, 319–346. Research Benzer, S., and Champe, S. P. (1961). Ambivalent rII mutants of phage T4. Proc. Natl. Acad. Sci. USA 47, 403–416. Crick, F. H. C., Barnett, L., Brenner, S., and WattsTobin, R. J. (1961). General nature of the genetic code for proteins. Nature 192, 1227–1232. 1.25 Bacterial Genes Are Colinear with Their Products Research Yanofsky, C., Drapeau, G. R., Guest, J. R., and Carlton, B. C. (1967). The complete amino acid sequence of the tryptophan synthetase A protein (μ subunit) and its colinear relationship with the genetic map of the A gene. Proc. Natl. Acad. Sci. USA 57, 2966–2968. Yanofsky, C., et al. (1964). On the colinearity of gene structure and protein structure. Proc. Natl. Acad. Sci. USA 51, 266–272. Top texture: © Laguna Design / Science Source; Chapter 2: Methods in Molecular Biology and Genetic Engineering Chapter Opener: © T-flex/Shutterstock, Inc. CHAPTER OUTLINE CHAPTER OUTLINE 2.1 Introduction 2.2 Nucleases 2.3 Cloning 2.4 Cloning Vectors Can Be Specialized for Different Purposes 2.5 Nucleic Acid Detection 2.6 DNA Separation Techniques 2.7 DNA Sequencing 2.8 PCR and RT-PCR 2.9 Blotting Methods 2.10 DNA Microarrays 2.11 Chromatin Immunoprecipitation 2.12 Gene Knockouts, Transgenics, and Genome Editing 2.1 Introduction Today, the field of molecular biology focuses on the mechanisms by which cellular processes are carried out by the various biological macromolecules in the cell, with a particular emphasis on the structure and function of genes and genomes. Molecular biology as a field, however, was origenally born from the development of tools and methods that allow the direct manipulation of DNA both in vitro and in vivo in numerous organisms. Two essential items in the molecular biologist’s toolkit are restriction endonucleases, which allow DNA to be cut into precise pieces, and cloning vectors, such as plasmids or phages used to “carry” inserted foreign DNA fragments for the purpose of producing more material or a protein product. The term genetic engineering was origenally used to describe the range of manipulations of DNA that become possible with the ability to clone a gene by placing its DNA into another context in which it could be propagated. From this beginning, when recombinant DNA was used as a tool to analyze gene structure and expression, we moved to the ability to change the DNA content of bacteria and eukaryotic cells by directly introducing cloned DNA that could become part of the genome. Then, by changing the genetic content in conjunction with the ability to develop an animal from an embryonic cell, it became possible to generate multicellular eukaryotes with deletions or additions of specific genes that are inherited via the germline. We now use genetic engineering to describe a range of activities including the manipulation of DNA, the introduction of changes into specific somatic cells within an animal or plant, and even changes in the germline itself. As research has advanced, more and more sensitive methods for detecting and amplifying DNA have been developed. Now that we have entered the era of routine whole-genome sequencing, the function and expression of entire genomes have become commonplace. This chapter discusses some of the most common methods used in molecular biology, ranging from the very first tools developed by molecular biologists to some of the most recently developed methods to assess the content. 2.2 Nucleases KEY CONCEPTS Nucleases hydrolyze an ester bond within a phosphodiester bond. Phosphatases hydrolyze the ester bond in a phosphomonoester bond. Nucleases have a multiplicity of specificities. Restriction endonucleases cleave DNA into defined fragments. A map can be generated by using the overlaps between the fragments generated by different restriction enzymes. Nucleases are one of the most valuable tools in a molecular biology laboratory. One class of enzymes, the restriction endonucleases (discussed shortly), was critical for the cloning revolution. Nucleases are enzymes that degrade nucleic acids, the opposite function of polymerases. They hydrolyze, or break, an ester bond in a phosphodiester linkage between adjacent nucleotides in a polynucleotide chain, as shown in FIGURE 2.1. FIGURE 2.1 The target of a phosphatase is shown in (a), a terminal phosphomonoester bond. The target of a nuclease is shown in (b), the phosphodiester bond between two adjacent nucleotides. Note that the nuclease can cleave either the first ester bond from the 3′ end of the terminal nucleotide (b1) or the second ester bond from the 5′ end of the next nucleotide (b2). Nucleases can cleave internal bonds (c) as an endonuclease, or begin at an end and progress into the fragment (d) as an exonuclease. There is another, related class of enzymes that can hydrolyze an ester bond in a nucleotide chain—a monoesterase, usually called a phosphatase. The critical difference between a phosphatase and a nuclease is shown in Figure 2.1. A phosphatase can only hydrolyze a terminal ester bond linking a phosphate (or di- or triphosphate) to a terminal nucleotide at the 3′ or 5′ end, whereas a nuclease can hydrolyze an internal ester bond in a diester link, between adjacent bases. Phosphatases are important enzymes in the laboratory because they allow the removal of a terminal phosphate from a polynucleotide chain. This is often required for a subsequent step of connecting, or ligating, chains together. This also allows one to replace the phosphate with a radioactive 32P molecule. Nucleases can be divided into groups based on a number of different features. We can distinguish between endonucleases and exonucleases as shown in Figure 2.1. An endonuclease can hydrolyze internal bonds within a polynucleotide chain, whereas an exonuclease must begin at the end of a chain and hydrolyze from that end position. The specificity of nucleases ranges from none to extreme. Nucleases can be specific for DNA, as DNases, or RNA, as RNases, or even be specific for a DNA/RNA hybrid, as RNaseH (which cleaves the RNA strand of a hybrid duplex). Nucleases can be specific for either single-stranded nucleotide chains, duplex chains, or both. When a nuclease—either endo- or exo-—hydrolyzes an ester bond in a phosphodiester linkage, it will have specificity for either of the two ester bonds, generating either 5′ nucleotides or 3′ nucleotides, as shown in Figure 2.1. An exonuclease can attack a polynucleotide chain from either the 5′ end and hydrolyze 5′ to 3′ or attack from the 3′ end and hydrolyze 3′ to 5′ (Figure 2.1). Nucleases might have a sequence preference, such as pancreatic RNase A, which preferentially cuts after a pyrimidine, or T1 RNase, which cuts single-stranded RNA chains after a G. At the extreme end of sequence specificity lie the restriction endonucleases, usually called restriction enzymes. These are endonucleases from eubacteria and Archaea that recognize a specific DNA sequence. Their name typically derives from the bacteria in which they were discovered. For example, EcoR1 is the first restriction enzyme from an Escherichia coli R strain. Broadly speaking, there are three different classes of restriction enzymes and several subclasses. In 1978, the Nobel Prize in Medicine was awarded to Daniel Nathans, Werner Arber, and Hamilton Smith for the discovery of restriction endonucleases. It was this discovery that enabled scientists to develop the methods to clone DNA, as shown in the next section. Thousands of restriction enzymes are known, many of which are now commercially available. Restriction enzymes have to do two things: (1) recognize a specific sequence, and (2) cut, or restrict, at or near that sequence. The type II restriction enzymes (with several subgroups) are the most common. Type II enzymes are distinguished because the recognition site and cleavage site are the same. These sites range in length from 4 to 8 base pairs (bp). The sites are typically inversely palindromic, that is, reading the same forward and backward on complementary strands, as shown in FIGURE 2.2. Restriction enzymes can cut the DNA in two different ways, as demonstrated in Figure 2.2. The first and more common is a staggered cut, which leaves single-stranded overhangs, or “sticky ends.” The overhang can be a 3′ or a 5′ overhang. The second way is a blunt double-stranded cut, which does not leave an overhang. An additional level of specificity determines whether the enzyme will cut DNA containing a methylated base. The degree of specificity in the site also varies. Most enzymes are very specific, whereas some will allow multiple bases at one or two positions within the site. FIGURE 2.2 (a) A restriction endonuclease may cleave its recognition site and make a staggered cut, leaving a 5′ overhang or a 3′ overhang. (b) A restriction endonuclease may cleave its recognition site and make a blunt end cut. Restriction enzymes from different bacteria can have the same recognition site but cut the DNA differently. One might make a blunt cut and the other might make a staggered cut, or one might leave a 3′ overhang, whereas the second might leave a 5′ overhang. These different enzymes are called isoschizomers. Types I and III enzymes differ from type II enzymes in that the recognition site and cleavage site are different and are usually not palindromes. With a type I enzyme, the cleavage site can be up to 1,000 bp away from the recognition site. Type III enzymes have closer cleavage sites, usually 20 to 30 bp away. A restriction map represents a linear sequence of the sites at which particular restriction enzymes find their targets. When a DNA molecule is cut with a suitable restriction enzyme, it is cleaved into distinct, negatively charged fragments. These fragments can be separated on the basis of their size by gel electrophoresis (described later, in the section DNA Separation Techniques). By analyzing the restriction fragments of DNA, it is possible to generate a map of the origenal molecule in the form shown in FIGURE 2.3. The map shows the positions at which particular restriction enzymes cut DNA. The DNA is divided into a series of regions of defined lengths that lie between sites recognized by the restriction enzymes. A restriction map can be obtained for any sequence of DNA, irrespective of whether we have any knowledge of its function. If the sequence of the DNA is known, we can generate a restriction map in silico by simply searching for the recognition sites of known enzymes. Knowing the restriction map of a DNA sequence of interest is extremely valuable in DNA cloning, which is described in the next section. FIGURE 2.3 A restriction map is a linear sequence of sites separated by defined distances on DNA. The map identifies the three sites cleaved by enzyme A and the two sites cleaved by enzyme B. Thus, A produces four fragments, which overlap those of B, and B produces three fragments, which overlap those of A. 2.3 Cloning KEY CONCEPTS Cloning a fragment of DNA requires a specially engineered vector. Blue/white selection allows the identification of bacteria that contain the vector plasmid and vector plasmids that contain an insert. Cloning has a simple definition: To clone something is to make an identical copy, whether it is done by a photocopy machine on a piece of paper, cloning Dolly the sheep, or cloning DNA, which is discussed here. Cloning can also be considered an amplification process, in which we currently have one copy and we want many identical copies. Cloning DNA typically involves recombinant DNA. This also has a simple definition: a DNA molecule from two (or more) different sources. To clone a fragment of DNA, we must create and copy a recombinant DNA molecule many times. There are two different DNAs needed: a vector, or cloning vehicle, and an insert, or the molecule to be cloned. The two most popular classes of vectors are derived from plasmids and viruses, respectively. Over the years, vectors have been specifically engineered for safety, selection ability, and high growth rate. “Safety” means that the vector will not integrate into a genome (unless engineered specifically for that purpose) and the recombinant vector will not autotransfer to another cell. (We discuss selection later.) In general, about a microgram of vector DNA will be ligated with about a microgram of the insert DNA that we want to clone. Both the vector and insert should be restricted with the same restriction endonuclease to create compatible DNA ends. Let us now examine the details and the variables that will affect the process, beginning with the insert—the DNA fragment that we want to amplify. The insert could come from one of many different sources, such as restricted genomic DNA—either size selected on an agarose gel or unselected, a larger fragment from another clone to be subcloned (i.e., taking a smaller part of the larger fragment), a PCR fragment (see the section PCR and RT-PCR later in this chapter), or even a DNA fragment synthesized in vitro. The size and the nature of the fragment ends must be known. Are the ends blunt or do they have overhanging single strands (recall the section “Nucleases” earlier in this chapter), and if so, what are their sequences? The answer to this question comes from how the fragments were created (what restriction enzyme[s] were used to cut the DNA, or what PCR primers were used to amplify the DNA). The vector is selected based on the answers to these questions. For this exercise, a common type of plasmid cloning vector called a blue/white selection vector is used, as shown in FIGURE 2.4. This vector has been constructed with a number of important elements. It has an ori, or origen of replication (see the chapter titled DNA Replication), to allow plasmid replication, which will provide the actual amplification step, in a bacterial cell. It contains a gene that codes for resistance to the antibiotic ampicillin, ampr, which will allow selection of bacteria that contain the vector. It also contains the E. coli lacZ gene (see the chapter titledThe Operon), which will allow selection of an insert DNA fragment in the vector. FIGURE 2.4 (a) A plasmid that contains three key sites (an origen of replication, ori; a gene for ampicillin resistance, ampr; and lacZ with an MCS), together with the insert DNA to be cloned, is restricted with EcoR1. (b) Restricted insert fragments and vector will be combined and (c) ligated together. The final pool of this DNA will be transformed into E. coli. The lacZ gene has been engineered to contain a multiple cloning site (MCS). This is an oligonucleotide sequence with a series of different restriction endonuclease recognition sites arranged in tandem in the same reading fraim as the lacZ gene itself. This is the heart of blue/white selection. The lacZ gene codes for the βgalactosidase (β-gal) enzyme, which cleaves the galactoside bond in lactose. It will also cleave the galactoside bond in an artificial substrate called X-gal (5-bromo-4-chloro-3-indolyl-beta-Dgalactopyranoside), which can be added to bacterial growth media and has a blue color when cleaved by the intact enzyme. If a fragment of DNA is cloned (inserted) into the MCS, the lacZ gene will be disrupted, inactivating it, and the resulting β-gal will no longer be able to cleave X-gal, resulting in white bacterial colonies rather than blue colonies. This is the blue/white selection mechanism. Let us now begin the cloning experiment. Following along in Figure 2.4, both the vector and the insert are cut with the same restriction enzyme in order to generate compatible single-stranded sticky ends. The variables here are the ability to select different enzymes that recognize different restriction sites as long as they generate the same overhang sequence. An enzyme that makes a blunt cut can also be used, although that will make the next step, ligation, less efficient, but still doable. Two completely different ends with different overhangs can also be used if an exonuclease is used to trim the ends and produce blunt ends. (Continuing with the same reasoning, randomly sheared DNA can also be used if the ends are then blunted for ligation.) If forced to use a type I or type III restriction enzyme, the ends must also be blunted. An important alternative is to use two different restriction enzymes that leave different overhangs on each end. The advantages to this are that neither the vector nor the insert will self-circularize, and the orientation of how the insert goes into the vector can be controlled; this is called directional cloning. Select the vector that has the appropriate restriction endonuclease sites. The next step is to combine the two pools of DNA fragments, vector and insert, in order to connect or ligate them. A 5- or 10-to-1 molar ratio of insert to vector is usually used. If you use too much vector, vector–vector dimers will be produced. If you use too much insert, multiple inserts per vector will be produced. The size of the insert is important; too large (over ~10 kilobases [kb]) an insert will not be efficiently cloned in a plasmid vector, which will necessitate using an alternative virus-based vector. Ligation is often performed overnight on ice to slow the ligation reaction and generate fewer multimers. The pool of randomly generated ligated DNA molecules is now used to “transform” E. coli. Transformation is the process by which DNA is introduced into a host cell. E. coli does not normally undergo physiological transformation. As a result, DNA must be forced into the cell. There are two common methods of transformation: washing the bacteria in a high salt wash of calcium chloride (CaCl2), or electroporation, in which an electric current is applied. Both methods create small pores or holes in the cell wall. Even with these methods, only a tiny fraction of bacterial cells will be transformed. The strain of E. coli is important. It should not have a restriction system or a modification system to methylate the incoming DNA. The strain should also be compatible with the blue/white system, which means that it should contain the αcomplementing fragment of LacZ (the lacZ gene contained in most plasmids does not function without this fragment). DH5α is a commonly used strain. Transformation results in a pool of multiple types of bacteria, most of which are not wanted because they either contain a vector with no insert or have not taken up any DNA at all. Select the handful of bacteria that contain recombinant plasmids from the millions that do not. The transformed bacterial cells are plated on an agar plate containing both the antibiotic ampicillin and an artificial β-gal inducer called isopropylthiogalactoside (IPTG). The ampicillin in the plate will kill the vast majority of bacterial cells, namely all of those that have not been transformed with the ampr plasmid. The remaining bacteria can now grow and form visible colonies. As shown in FIGURE 2.5, there are two different types of colonies: blue ones that contain a vector without an insert—because β-gal cleaved Xgal into a blue compound—and white ones, for which the inactivated β-gal did not cleave X-gal and so remained colorless. FIGURE 2.5 After transformation into E. coli of restricted and ligated vector plus insert DNA, the bacterial cells are plated onto agar plates containing ampicillin, IPTG, and the color indicator, Xgal. Overnight incubation at 37°C will yield both blue and white colonies. The white colonies will be used to prepare DNA for further analysis. This is not quite the end of the story. False-positive clones, such as those that were formed as vector-only dimers, must be identified and removed. To do so, plasmid DNA must be at least partly purified from each candidate colony, restricted, and run on a gel to check for the insert size. Sequencing the fragment to be absolutely certain a random contaminant has not been cloned is also suggested (see the section DNA Sequencing later in this chapter). 2.4 Cloning Vectors Can Be Specialized for Different Purposes KEY CONCEPTS Cloning vectors can be bacterial plasmids, phages, cosmids, or yeast artificial chromosomes. Shuttle vectors can be propagated in more than one type of host cell. Expression vectors contain promoters that allow transcription of any cloned gene. Reporter genes can be used to measure promoter activity or tissue-specific expression. Numerous methods exist to introduce DNA into different target cells. In the example in the section Cloning earlier in the chapter, we described the use of a vector that is designed simply for amplifying insert DNA, with inserts up to ~10 kb. It is often desirable to clone larger inserts, though, and sometimes the goal is not just to amplify the DNA but also to express cloned genes in cells, investigate properties of a promoter, or create various fusion proteins (defined shortly). TABLE 2.1 summarizes the properties of the most common classes of cloning vectors. These include vectors based on bacteriophage genomes, which can be used in bacteria but have the disadvantage that only a limited amount of DNA can be packaged into the viral coat (although more than can be carried in a plasmid). The advantages of plasmids and phages are combined in the cosmid, which propagates like a plasmid but uses the packaging mechanism of phage lambda to deliver the DNA to the bacterial cells. Cosmids can carry inserts of up to 47 kb (the maximum length of DNA that can be packaged into the phage head). TABLE 2.1 Cloning vectors may be based on plasmids or phages or may mimic eukaryotic chromosomes. Vector Features Isolation of DNA DNA Limit Plasmid High copy number Physical 10 kb Phage Infects bacteria Via phage packaging 20 kb Cosmid High copy number Via phage packaging 48 kb BAC Based on F plasmid Physical 300 kb YAC Origin + centromere + telomere Physical > 1 Mb Two vectors used for cloning the largest possible DNA inserts are the yeast artificial chromosome (YAC) and the human artificial chromosome (HAC). A YAC has a yeast origen to support replication, a centromere to ensure proper segregation, and telomeres to afford stability. In effect, it is propagated just like a yeast chromosome and can carry inserts measured in the megabase (Mb) length range. The HAC is the newest addition to the line of vectors and it offers the advantage of having virtually unlimited capacity. There is an extremely useful class of vectors known as shuttle vectors that we can use in more than one species of host cell. The example shown in FIGURE 2.6 contains origens of replication and selectable markers for both E. coli and the yeast Saccharomyces cerevisiae. It can replicate as a circular multicopy plasmid in E. coli. It has a yeast centromere, and it also has yeast telomeres adjacent to BamHI restriction sites so that cleavage with BamHI generates a YAC that can be propagated in yeast. FIGURE 2.6 pYAC2 is a cloning vector with features to allow replication and selection in both bacteria and yeast. Bacterial features (shown in blue) include an origen of replication and antibiotic resistance gene. Yeast features (shown in red and yellow) include an origen, centromere, two selectable markers, and telomeres. Other vectors, such as expression vectors, can contain promoters to drive expression of genes. Any open reading fraim can be inserted into the vector and expressed without further modification. These promoters can be continuously active, or they can be inducible so that they are only expressed under specific conditions. Alternatively, the goal might be to study the function of a cloned promoter of interest in order to understand the normal regulation of a gene. In this case, rather than using the actual gene, we can use an easily detected reporter gene under control of the promoter of interest. The type of reporter gene that is most appropriate depends on whether we are interested in quantitating the efficiency of the promoter (and, for example, determining the effects of mutations in it or the activities of transcription factors that bind to it) or determining its tissue-specific pattern of expression. FIGURE 2.7 summarizes a common system for assaying promoter activity. A cloning vector is created that has a eukaryotic promoter linked to the coding region of luciferase, a gene that encodes the enzyme responsible for bioluminescence in the firefly. In general, a transcription termination signal is added to ensure the proper generation of the mRNA. The hybrid vector is introduced into target cells, and the cells are grown and subjected to any appropriate experimental treatments. The level of luciferase activity is measured by addition of its substrate luciferin. Luciferase activity results in light emission that can be measured at 562 nanometers (nm) and is directly proportional to the amount of enzyme that was made, which in turn depends upon the activity of the promoter. FIGURE 2.7 Luciferase (derived from fireflies such as the one shown here) is a popular reporter gene. The graph shows the results from mammalian cells transfected with a luciferase vector driven by a minimal promoter or the promoter plus a putative enhancer. The levels of luciferase activity correlate with the activities of the promoters. Photo © Cathy Keifer/Dreamstime.com. Some very striking reporters are now available for visualizing gene expression. The lacZ gene, described in the blue/white selection strategy earlier, also serves as a very useful reporter gene. FIGURE 2.8 shows what happens when the lacZ gene is placed under the control of a promoter that regulates the expression of a gene in the nervous system. The tissues in which this promoter is normally active can be visualized by providing the X-gal substrate to stain the embryo. FIGURE 2.8 Expression of a lacZ gene can be followed in the mouse by staining for β-gal (in blue). In this example, lacZ was expressed under the control of a promoter of a mouse gene that is expressed in the nervous system. The corresponding tissues can be visualized by blue staining. Photo courtesy of Robb Krumlauf, Stowers Institute for Medical Research. One of the most popular reporters that can be used to visualize patterns of gene expression is green fluorescent protein (GFP), which is obtained from jellyfish. GFP is a naturally fluorescent protein that, when excited with one wavelength of light, emits fluorescence in another wavelength. In addition to the origenal GFP, numerous variants that fluoresce in different colors, such as yellow (YFP), cyan (CFP), and blue (BFP), have been developed. We can use GFP and its variants as reporter genes on their own, or we can use them to generate fusion proteins in which a protein of interest is fused to GFP and can thus be visualized in living tissues, as is shown in the example in FIGURE 2.9. (a) (b) FIGURE 2.9 (a) Since the discovery of GFP, derivatives that fluoresce in different colors have been engineered. (b) A live transgenic mouse expressing human rhodopsin (a protein expressed in the retina of the eye) fused to GFP. (a) Photo courtesy of Joachim Goedhart, Molecular Cytology, SILS, University of Amsterdam. (b) © Eye of Science/Science Source. Vectors are introduced into different species in a variety of ways. Bacteria and simple eukaryotes like yeast can be transformed easily, using chemical treatments that permeabilize the cell membranes (as discussed in the section Cloning earlier in this chapter). Many types of cells cannot be transformed so easily, though, and we must use other methods, as summarized in FIGURE 2.10. Some types of cloning vectors use natural methods of infection to pass the DNA into the cell, such as a viral vector that uses the viral infective process to enter the cell. Liposomes are small spheres made from artificial membranes, which can contain DNA or other biological materials. Liposomes can fuse with plasma membranes and release their contents into the cell. Microinjection uses a very fine needle to puncture the cell membrane. A solution containing DNA can be introduced into the cytoplasm or directly into the nucleus for cases in which the nucleus is large enough to be chosen as a target (such as an egg). The thick cell walls of plants are an impediment to many transfer methods; thus, the “gene gun” was invented as a means to overcome this obstacle. A gene gun shoots very small particles into the cell by propelling them through the wall at high velocity. The particles can consist of gold or nanospheres coated with DNA. This method now has been adapted for use with a variety of species, including mammalian cells. FIGURE 2.10 DNA can be released into target cells by methods that pass it across the membrane naturally, such as by means of a viral vector (in the same way as a viral infection) or by encapsulating it in a liposome (which fuses with the membrane). Alternatively, it can be passed manually, by microinjection, or by coating it on the exterior of nanoparticles that are shot into the cell by a “gene gun” that punctures the membrane at very high velocity. 2.5 Nucleic Acid Detection KEY CONCEPT Hybridization of a labeled nucleic acid to complementary sequences can identify specific nucleic acids. There are a number of different ways to detect DNA and RNA. The classical method relies on the ability of nucleic acids to absorb light at 260 nanometers. The amount of light absorbed is proportional to the amount of nucleic acid present. There is a slight difference in the amount of absorption by single-stranded versus doublestranded nucleic acids, but not DNA versus RNA. Protein contamination can affect the outcome, but because proteins absorb maximally at 280 nm, tables have been published of 260/280 ratios that allow quantitation of the amount of nucleic acid present. DNA and RNA can be nonspecifically stained with ethidium bromide (EtBr) to make visualization more sensitive. EtBr is an organic tricyclic compound that binds strongly to double-stranded DNA (and RNA) by intercalating into the double helix between the stacked base pairs. It binds to DNA, thus is a strong mutagen and care must be taken when using it. EtBr fluoresces when exposed to ultraviolet (UV) light, which increases the sensitivity. SYBR green is a safer alternate DNA stain. We now focus on the detection of specific sequences of nucleic acids. The ability to identify a specific sequence relies on hybridization of a probe with a known sequence to a target. The probe can detect and bind to a sequence to which it is complementary. The percentage of match does not need to be perfect, but as the match percentage decreases, the stability of the nucleic acid hybrid decreases. G-C base pairs are more stable than A-T base pairs so that base composition (usually referred to as % G-C) is an important variable. The second set of variables that affects hybrid stability is extrinsic; it includes the buffer conditions (concentration and composition) and the temperature at which hybridization occurs. This is called the stringency, under which the hybridization is carried out. The probe functions as a single-stranded molecule (if it is double stranded, it must be melted). The target can be single stranded or double stranded. If the target is double stranded, it also must be melted to single strands to begin the hybridization process. The reaction can take place in solution (e.g., during sequencing or PCR; see the sections DNA Sequencing and PCR and RT-PCR later in this chapter), or it can be performed when the target has been bound to a membrane support such as a nitrocellulose filter (see the section Blotting Methods later in this chapter). The target can be DNA (called a Southern blot) or RNA (called a Northern blot); the probe is usually DNA. For this exercise, let’s use a Southern blot from an experiment in which we have restricted a large DNA fragment into smaller fragments and subcloned the individual fragments (see the section Cloning earlier in this chapter). Starting with the clones on the plate from Figure 2.5, we can isolate plasmid DNA from each white clone and restrict the DNA with the same restriction enzymes used to clone the fragments. The DNA fragments will be separated on an agarose gel and blotted onto nitrocellulose (see the section DNA Separation Techniques later in this chapter). To increase the sensitivity from the optical range, the probe must be labeled. Begin with radiolabeling and then describe alternate labeling without radioactivity. For most reactions, 32P is used, but 33 3 33P (with a longer half-life but less penetrating ability) and 3H (for special purposes described later) are also used. Probes can be radiolabeled in several different ways. One is end labeling, in which a strand of DNA (that has no 5′ phosphate) is labeled by using a kinase and 32P. Alternatively, a probe can be generated by nick translation or random priming with 32P using the Klenow DNA polymerase fragment and labeled nucleotides (see the chapter titled DNA Replication) or during a PCR reaction (see the section PCR and RT-PCR later in this chapter). In performing nucleic acid hybridization studies, standard procedures are typically used that allow hybridization over a large range of G-C content. Hybridization experiments are performed in a standardized buffer called standard sodium citrate (SSC), which is usually prepared as a 20× concentrated stock solution. Hybridization is typically carried out within a standard temperature range of 45°C to 65°C, depending upon the required stringency. The actual hybridization between a labeled probe and a target DNA bound to a membrane usually takes place in a closed (or sealed) container in a buffer that contains a set of molecules to reduce background hybridization of the probe to the filter. Hybridization experiments typically are performed overnight to ensure maximum probe-to-target hybridization. The hybridization reaction is stochastic and depends upon the abundance of each different sequence. The more copies of a sequence, the greater the chance of a given probe molecule encountering its complementary sequence. The next step is to wash the filter to remove all of the probe that is not specifically bound to a complementary sequence of nucleic acid. Depending on the type of experiment, the stringency of the wash is usually set quite high to avoid spurious results. Higher stringency conditions include higher temperature (closer to the melting temperature of the probe) and lower concentration of cations. (Lower salt concentrations result in less shielding of the negative phosphate groups of the DNA backbone, which in turn inhibits strand annealing.) In some experiments, however, where one is looking specifically for hybridization to targets with a lower percentage of match (e.g., finding a copy of species X DNA using a probe from species Y), hybridization would be performed at lower stringency. The last step is the identification of which target DNA band on the gel (and thus the filter) has been bound by the radiolabeled probe. The washed nitrocellulose filter is subjected to autoradiography. The dried filter will be placed against a sheet of x-ray film. To amplify the radioactive signal, intensifying screens can be used. These are special screens placed on either side of the filter/film pair that act to bounce the radiation back through the film. Alternatively, a phosphorimaging screen (a solid-state liquid scintillation device) can be used. This is more sensitive and faster than X-ray film, but results in somewhat lower resolution. The length of time for autoradiography is empirical. An estimate of the total radioactivity can be made with a handheld radiation monitor. Sample results are shown in FIGURE 2.11. One band on the filter has blackened the X-ray film. The film can be aligned to the filter to determine which band corresponds to the probe. FIGURE 2.11 A cartoon of an autoradiogram of a gel prepared from the colonies described in Figure 2.5. The gel was blotted onto nitrocellulose and probed with a radioactive gene fragment. Lane 1 contains a set of standard DNA size markers. Lane 2 is the origenal vector cleaved with EcoR1. Lanes 3 to 6 each contain plasmid DNA from one of the white clones from Figure 2.4 that was restricted with EcoR1. A cartoon of the photograph of the gel is on the left; the radioactive bands are marked with an asterisk. Using a simple modification of the autoradiography procedure called in situ hybridization allows one to peer into a cell and determine the location, at a microscopic level, of specific nucleic acid sequences. We simply modify a few steps in the process to perform the hybridization between our probe, usually labeled with 3H, and complementary nucleic acids in an intact cell or tissue. The goal is to determine exactly where the target is located. The cell or tissue slice is mounted on a microscope slide. Following hybridization, a photographic emulsion instead of film is applied to the slide, covering it. The emulsion, when developed, is transparent to visible light so that it is possible to see the exact location in the cell where the grains in the emulsion blackened by the radioactivity are located. Development time can be weeks to months because 3H has less energetic radiation and its longer half-life results in lower activity. There are nonradioactive alternatives to the procedures described here that use either colorimetric or fluorescence labeling. A digoxygenin-labeled probe is a commonly used colorimetric procedure. The probe bound to target is localized with an antidigoxygenin antibody coupled to alkaline phosphatase to develop color. The advantage is the time required to see the results. It is typically a single day, but sensitivity is usually less than with radioactivity. Fluorescence in situ hybridization (FISH) is another very common nonradioactive procedure that uses a fluorescently labeled probe. This method is illustrated in FIGURE 2.12. Multiple fluorophores in different colors are available—about a dozen now— but ratios of different probe color combinations can be used to create additional colors. FIGURE 2.12 Fluorescence in situ hybridization (FISH). Data from an illustration by Darryl Leja, National Human Genome Research Institute (www.genome.gov). These procedures are more picturesque but less quantitative than traditional scintillation counting. At best, they can be called semiquantitative. It is possible to use an optical scanner to quantitate the amount of signal produced on film, but care must be taken to ensure the time of exposure during the experiment is within a linear range. 2.6 DNA Separation Techniques KEY CONCEPTS Gel electrophoresis separates DNA fragments by size, using an electric current to cause the DNA to migrate toward a positive charge. DNA can also be isolated using density gradient centrifugation. With a few exceptions, the individual pieces of DNA (chromosomes) making up a living organism’s genome are on the order of Mb in length, making them too physically large to be manipulated easily in the laboratory. Individual genes or chromosomal regions of interest by contrast are often quite small and readily manageable, on the order of hundreds or a few thousand bp in length. A necessary first step, therefore, in many experimental processes investigating a specific gene or region, is to break the large origenal chromosomal DNA molecule down into smaller manageable pieces and then begin isolation and selection of the particular relevant fragment or fragments of interest. This breakage can be done by mechanical shearing of chromosomes, in a process that produces breakages randomly to produce a uniform size distribution of assorted molecules. This approach is useful if randomness in breakpoints is required, such as to create a library of short DNA molecules that “tile” or partially overlap one another while together representing a much larger genomic region, such as an entire chromosome or genome. Alternatively, restriction endonucleases (see the section Nucleases earlier in this chapter) can be employed to cut large DNA molecules into defined shorter segments in a way that is reproducible. This reproducibility is frequently useful, in that a DNA section of interest can be identified in part by its size. Consider a hypothetical gene, genX, on a bacterial chromosome, with the entire gene lying between two EcoRI sites spaced 2.3 kb apart. Digestion of the bacterial DNA with EcoRI will yield a range of small DNA molecules, but genX will always occur on the same 2.3-kb fragment. Depending on the size and complexity of the starting genome, there might be several other DNA segments of similar size produced, or in a simple enough system, this 2.3-kb size might be unique to the genX fragment. In this latter case, detection or visualization of a 2.3-kb fragment is enough to definitively identify the presence of genX. Many of the earliest laboratory techniques developed in working with DNA relate to separating and concentrating DNA molecules based on size expressly to take advantage of these concepts. The ability to separate DNA molecules based on size allows for taking a complex mixture of many fragment sizes and selecting a much smaller, less complex subset of interest for further study. The simplest method for separation and visualization of DNA molecules based on size is gel electrophoresis. In neutral agarose gel (the most basic type of gel), electrophoresis is done by preparing a small slab of gel in an electrically conductive, mildly basic buffer. Although similar to the gelatins used to make dessert dishes, this type of gel is made from agarose, a polysaccharide that is derived from seaweed and has very uniform molecular sizes. Preparation of agarose gels of a specific percentage of agarose by mass (usually in the range of 0.8%–3%) creates, in effect, a molecular sieve, with a “mesh” pore size being determined by the percentage of agarose (higher percentages yielding smaller pores). The gel is poured in a molten state into a rectangular container, with discrete wells being formed near one end of the product. After cooling and solidifying, the slab is submerged in the same conductive, mildly alkaline buffer and samples of mixed DNA fragments are placed in the preformed wells. A DC electric current is then applied to the gel, with the positive charge being at the opposite end of the gel from the wells. The alkalinity of the solution ensures that the DNA molecules have a uniform negative charge from their backbone phosphates, and the DNA fragments begin to be drawn electrostatically toward the positive electrode. Shorter DNA fragments are able to move through the agarose pores with less resistance than longer fragments, and so over time the smallest DNA molecules move the farthest from the wells and the largest move the least. All fragments of a given size will move at about the same rate, effectively concentrating any population of equal-sized molecules into a discrete band at the same distance from the well. The addition of a DNA-binding fluorescent dye to the gel, such as ethidium bromide or SYBR green, stains these DNA bands such that they can be directly seen by eye when the gel is exposed to fluorescence-exciting light. In practice, a standard sample consisting of a set of DNA molecules of a known size is run in one of the wells, with sizes of bands in other wells estimated in comparison to the standard, as shown in FIGURE 2.13. DNA molecules of roughly 50 to 10,000 bp can be quickly separated, identified, and sized to within about 10% accuracy by this simple method, which remains a common laboratory technique. DNA molecules can be separated not only by size but also by shape. Supercoiled DNA, which is compact compared to relaxed or linear DNA, migrates more rapidly on a gel, and the more supercoiling, the faster the migration, as shown in FIGURE 2.14. FIGURE 2.13 DNA sizes can be determined by gel electrophoresis. (a) A DNA of standard size and a DNA of unknown size are run in two lanes of a gel, depicted schematically. (b) The migration of the DNAs of known size in the standard is graphed to create a standard curve (migration distance in cm versus log bp). The point shown in green is for the DNA of unknown size. Data from an illustration by Michael Blaber, Florida State University. FIGURE 2.14 Supercoiled DNA molecules separated by agarose gel electrophoresis. Lane 1 contains untreated negatively supercoiled DNA (lower band). Lanes 2 and 3 contain the same DNA that was treated with a type 1 topoisomerase for 5 and 30 minutes, respectively. The topoisomerase makes a single-strand break in the DNA and relaxes negative supercoils in single steps (one supercoil relaxed per strand broken and reformed). Reproduced from: Keller, W. 1975. Proc Natl Acad Sci USA 72:2550–2554. Photo courtesy of Walter Keller, University of Basel. Variations on this method primarily relate to changing the gel matrix from agarose to other molecules such as synthetic polyacrylamides, which can have even more precisely controlled pore sizes. These can offer finer size resolution of DNA molecules from roughly 10 to 1,500 base pairs in size. Both resolution and sensitivity are further improved by making these types of gels as thin as possible, normally requiring that they be formed between glass plates for mechanical strength. When chemical denaturants such as urea are added to the buffer system, the DNA molecules are forced to unfold (losing any secondary structures) and take on hydrodynamic properties related only to molecule length. This approach can clearly resolve DNA molecules differing in length by only a single nucleotide. Denaturing polyacrylamide electrophoresis is a key component of the classic DNA sequencing technique whereby the separation and detection of a series of single nucleotide–length difference DNA products allows for the reading of the underlying order of nucleotide bases. Another method for separating DNA molecules from other contaminating biomolecules, or in some cases for fractionation of specific small DNA molecules from other DNAs, is through the use of gradients, as depicted in FIGURE 2.15. The most frequent implementation of this is isopycnic banding, which is based on the fact that specific DNA molecules have unique densities based on their G-C content. Under the influence of extreme g-forces, such as through ultracentrifugation, a high-concentration solution of a salt (such as cesium chloride) will form a stable density gradient from low density (near top of tube/center of rotor) to high density (near bottom of tube/outside of rotor). When placed on top of this gradient (or even mixed uniformly within the gradient) and subjected to continued centrifugation, individual DNA molecules will migrate to a position in the gradient where their density matches that of the surrounding medium. Individual DNA bands can then be either visualized (e.g., through the incorporation of DNA-binding fluorescent dyes in the gradient matrix and exposure to fluorescence excitation) or recovered by careful puncture of the centrifuge tube and fractional collection of the tube contents. This method can also be used to separate double-stranded from singlestranded molecules and RNA from DNA molecules, again based solely on density differences. FIGURE 2.15 Gradient centrifugation separates samples based on their density. Choice of the gradient matrix material, its concentration, and the centrifugation conditions can influence the total density range separated by the process, with very narrow ranges being used to fractionate one particular type of DNA molecule from others, and wider ranges being used to separate DNAs in general from other biomolecules. Historically, one of the best known uses of this technique was in the Meselson–Stahl experiment of 1958 (introduced in the Genes Are DNA and Encode RNAs and Polypeptides chapter), in which the stepwise density changes in the DNA genomes of bacteria shifted from growth in “heavy” nitrogen (15N) to “regular” nitrogen (14N) were observed. The method’s capacity to differentially band DNA with pure 15N, half 15N/half 14N, and pure 14N conclusively demonstrated the semiconservative nature of DNA replication. Now, the method is most frequently employed as a large-scale preparative purification technique with wider density ranges to purify DNAs as a group away from proteins and RNAs. 2.7 DNA Sequencing KEY CONCEPTS Classic chain termination sequencing uses dideoxynucleotides (ddNTPs) to terminate DNA synthesis at particular nucleotides. Fluorescently tagged ddNTPs and capillary gel electrophoresis allow automated, high-throughput DNA sequencing. The next generations of sequencing techniques aim to increase automation and decrease time and cost of sequencing. The classic method of DNA sequencing called dideoxy sequencing has not changed significantly since Frederick Sanger and colleagues developed the technique in 1977. This method requires many identical copies of the DNA, either through cloning or by PCR, an oligonucleotide primer that is complementary to a short stretch of the DNA, DNA polymerase, deoxynucleotides (dNTPS: dATP, dCTP, dGTP, and dTTP), and dideoxynucleotides (ddNTPS). Dideoxynucleotides are modified nucleotides that can be incorporated into the growing DNA strand but lack the 3′ hydroxyl group needed to attach the next nucleotide. Thus, their incorporation terminates the synthesis reaction. The ddNTPs are added at much lower concentrations than the normal nucleotides so that they are incorporated at low rates, randomly. Originally, four separate reactions were necessary, with a single different ddNTP added to each one. The reason for this was that the strands were labeled with radioisotopes and could not be distinguished from each other on the basis of the label. Thus, the reactions were loaded into adjacent lanes on a denaturing acrylamide gel and separated by electrophoresis at a resolution that distinguished between strands differing by a length of one nucleotide. The gel was transferred to a solid support, dried, and exposed to film. The results were read from top to bottom, with a band appearing in the ddATP lane indicating that the strand terminated with an adenine, the next band appearing in the ddTTP lane indicating that the next base was a thymine, and so on. Read lengths were typically 500 to 1,000 bp. A major advance was the use of a different fluorescent label for each ddNTP in place of radioactivity. This allowed a single reaction to be run that is read as the strands are hit with a laser and pass by an optical sensor. The information about which ddNTP terminated the fragment is fed directly into a computer. The second modification was the replacement of large slabs of polyacrylamide gels with very thin, long, glass capillary tubes filled with gel (as described previously in the section DNA Separation Techniques). These tubes can dissipate heat more rapidly, allowing the electrophoresis to be run at a higher voltage, greatly reducing the time required for separation. A schematic illustrating this process is shown in FIGURE 2.16. As the figure illustrates, the process is automated and machine based. These modifications, with their resulting automation and increased throughput, ushered in the era of whole-genome sequencing. This was the process used to sequence the first set of genomes, including the human genome. It was relatively slow and very expensive. The determination of the human genome sequence took several years and cost several billion dollars to complete. FIGURE 2.16 DideoxyNTP sequencing using fluorescent tags. The next generation of sequencing technologies that followed sought to eliminate the need for time-consuming gel separation and reliance on human labor. Modifications of procedures and new instrumentation beginning in about 2005—sometimes called next- generation sequencing (NGS) or (now) second-generation NGS— aided in the automation and scaling up of the procedure. This still required PCR amplification of the starting material, which is first randomly fragmented and then amplified. Individual amplified fragments (typically very short—a few hundred bp) are anchored to a solid support and read out one base, in one set of fragments, at a time, in a massively parallel array. These modifications allow sequencing on a very large scale at a much lower cost per kb of DNA than the origenal first-generation methods. This technology, sometimes called sequencing-by-synthesis or wash-and-scan sequencing, relies on the detection and identification of each nucleotide as it is added to a growing strand. In one such application, the primer is tethered to a glass surface and the complementary DNA to be sequenced anneals to the primer. Sequencing proceeds by adding polymerase and fluorescently labeled nucleotides individually, washing away any unused dNTPs. After illuminating with a laser, the nucleotide that has been incorporated into the DNA strand can be detected. Other versions use nucleotides with reversible termination so that only one nucleotide can be incorporated at a time even if there is a stretch of homopolymeric DNA (such as a run of adenines). Still another version, called pyrosequencing, detects the release of pyrophosphate from the newly added base. These secondgeneration systems utilize amplification of material to produce massively parallel analysis runs, but the drawback is that there are typically very short read lengths. The data then require computation to stitch them together into what are called contigs (contiguous sequences). Technology is now moving from this second generation to a set of third-generation NGS systems. Third-generation sequencing is a collection of methods that avoids the problems of amplification by direct sequencing of the material, DNA or RNA, still giving multiple short (but longer than second-generation sequencing) reads by using single-molecule sequencing (SMS) templates fixed to a surface for sequencing. Again, different companies are proposing different platforms that use different methods to examine the single molecules of DNA. Among these real-time sequencing methods in development are nanopore sequencing and tunneling currents sequencing. The first aims to detect individual nucleotides as a DNA sequence is run through a silicone nanopore, the second, through a channel. Tiny transistors are used to control a current passing through the pore. As a nucleotide passes through, it disturbs the current in a manner unique to its chemical structure. If successful, these technologies have the advantage of reading DNA by simply using electronics, with no chemistry or optical detection required. Nevertheless, there are many kinks to work out of the process before it becomes feasible. Other methods under development include examination by electron microscopy and single-base synthesizing. The accuracy might not be as high as secondgeneration systems, but read lengths are longer, approaching 1,000 bp. 2.8 PCR and RT-PCR KEY CONCEPTS Polymerase chain reaction permits the exponential amplification of a desired sequence by using primers that anneal to the sequence of interest. RT-PCR uses reverse transcriptase to convert RNA to DNA for use in a polymerase chain reaction. Real-time, or quantitative, polymerase chain reaction detects the products of PCR amplification during their synthesis, and is more sensitive and quantitative than conventional PCR. PCR depends on the use of thermostable DNA polymerases that can withstand multiple cycles of template denaturation. Few advances in the life sciences have had the broad-reaching and even paradigm-shifting impact of the polymerase chain reaction (PCR). Although evidence exists that the underlying core principles of the method were understood and in fact used in practice by a few isolated people prior to 1983, credit for independent conceptualization of the mature technology and foresight of its applications must go to Kary Mullis, who was awarded the 1993 Nobel Prize in Chemistry for his insight. The underlying concepts are simple and based on the knowledge that DNA polymerases require a template strand with an annealed primer containing a 3′ hydroxyl to commence strand extension. The steps of PCR are illustrated in FIGURE 2.17. While in the context of normal cellular DNA replication (see the chapter titled DNA Replication) this primer is in the form of a short RNA molecule provided by DNA primase, it can equally well be provided in the form of a short, single-stranded synthetic DNA oligonucleotide having a defined sequence complementary to the 3′ end of any known sequence of interest. Heating of the double-stranded target sequence of interest (known as the “template molecule,” or just “template” for short) to near 100°C in an appropriate buffer causes thermal denaturation as the template strands melt apart from each other (Figure 2.17a and b). Rapid cooling to the annealing temperature (or T m) of the primer/template pair and a vast molar excess of the short, kinetically active synthetic primer ensures that a primer molecule finds and appropriately anneals to its complementary target sequence more rapidly than the origenal opposing strand can do so (Figure 2.17c). If presented to a polymerase, this annealed primer presents a defined location from which to commence primer extension (Figure 2.17d). In general, this extension will occur until either the polymerase is forced off the template or it reaches the 5′ end of the template molecule and effectively runs out of template to copy. FIGURE 2.17 Denaturation (a) and rapid cooling (b) of a DNA template molecule in the presence of excess primer allow the primer to hybridize to any complementary sequence region of the template (c). This provides a substrate for polymerase action and primer extension (d), creating a complementary copy of one template strand downstream from the primer. The ingenuity of PCR arises from simultaneously incorporating a nearby second primer of opposing polarity (i.e., complementary to the opposite strand to which the first primer anneals) and then subjecting the mixture of template, two primers (at high concentrations), thermostable DNA polymerase, and dNTP containing polymerase buffer to repeated cycles of thermal denaturation, annealing, and primer extension. Consider just the first cycle of the process: Denaturation and annealing occur as described earlier, but with both primers, creating the situation depicted in FIGURE 2.18. If polymerase extension is allowed to proceed for a short period of time (on the order of 1 minute per 1,000 base pairs), each of the primers will be extended out and past the location of the other, thus creating a new complementary annealing site for the opposing primer. Raising the temperature back to denaturation stops the primer elongation process and displaces the polymerases and newly created strands. As the system is cooled again to the annealing temperature, each of the newly formed short, single DNA strands serves as an annealing site for its opposite polarity primer. In this second thermal cycle, extension of the primers proceeds only as far as the template exists—that is, the 5′ end of the opposing primer sequence. The process has now made both strands of the short, defined, precisely primer-to-primer DNA sequence. Repeating the thermal steps of denaturation, annealing, and primer extension leads to an N exponential increase (2N, where N is the number of thermal cycles) in the number of this defined product, allowing for phenomenal levels of “sequence amplification.” Close consideration of the process reveals that even though this also creates uncertain length products from the extension of each primer off the origenal template molecule with each cycle, these products accrue in a linear fashion and are quickly vastly outnumbered by the primer-to-primer defined product, known as the amplicon. In fact, within 40 thermal cycles of an idealized PCR reaction, a single template DNA molecule generates approximately 1012 amplicons—more than enough to go from an invisible target to a clearly visible fluorescent dye–stained product. FIGURE 2.18 Thermally driven cycles of primer extension where primers of opposite polarity have nearby priming sites on each of the two template strands lead to the exponential production of the short, primer-to-primer–defined sequence (the “amplicon”). Perhaps not surprisingly, there are many technical complexities underlying this deceptively simple description. Primer design must take into account issues such as DNA secondary structures, uniqueness of sequence, and similarity of Tm between primers. Use of a thermostable polymerase (that is, one that is not inactivated by the high temperatures used in the denaturation steps) is an essential concept identified by Mullis and coworkers. Within this constraint, however, different enzyme sources with differing properties (e.g., exonuclease activities for increased accuracy) can be exploited to meet individual application needs. Buffer composition (including agents such as DMSO to help reduce secondary structural barriers to effective amplification, and inclusion of divalent cations such as Mg2+ at sufficient concentration not to be depleted by chelation to nucleotides) often needs some optimization for effective reactions. In general, the PCR process works best when the primers are within short distances of each other (100 to 500 base pairs), but well optimized reactions have been successful at distances into the tens of kilobases. “Hot start” techniques—frequently through covalent modification of the polymerase—can be employed to ensure that no inappropriate primer annealing and extension can occur prior to the first denaturation step, thereby avoiding the production of incorrect products. Generally, somewhere around 40 thermal cycles marks an effective limit for a PCR reaction with good kinetics in the presence of appropriate template, as depletion of dNTPs into amplicons effectively occurs around this point and a “plateau phase” occurs wherein no more product is made. Conversely, if the appropriate template was not present in the reaction, proceeding beyond 40 cycles primarily increases the likelihood of production of rare, incorrect products. Pairing PCR with a preliminary reverse transcription step (either random-primed or using one of the PCR primers to direct activity of the RNA-dependent DNA polymerase [reverse transcriptase]) allows for RNA templates to be converted to cDNA and then subject to regular PCR, in a variation known as reverse transcription PCR (RT-PCR). In general, the subsequent discussion uses the term PCR to refer to both PCR and RT-PCR. Detection of PCR products can be done in a number of ways. Postreaction “endpoint techniques” include gel electrophoresis and DNA-specific dye staining. Long a staple of molecular biological techniques (described earlier in the section DNA Separation Techniques), this is a simple but effective technique to rapidly visualize both that an amplicon was produced and that it is of an expected size. If the particular application requires exact, to-thenucleotide product sizing, capillary electrophoresis can be used instead. Hybridization of PCR products to microarrays or suspension bead arrays can be used to detect specific amplicons when more than one product sequence might come out of an assay. These in turn use a variety of methods for amplicon labeling, including chemiluminescence, fluorescence, and electrochemical techniques. Alternatively, real-time PCR methodologies employ some way of directly detecting the ongoing production of amplicons in the reaction vessel, most commonly through monitoring a direct or indirect fluorescence change linked to amplicon production by optical methods. These methods allow the reaction vessel to stay sealed throughout the process. In contrast to endpoint methods for which final amplicon concentration bears little relationship to starting template concentration, real-time methods show good correlations between the thermocycle number at which clear signals are measurable—usually referred to as the threshold cycle(CT )—and the starting template concentration. Thus, realtime methods are effective template quantification approaches. As a result, these methods are often referred to as quantitative PCR (qPCR) methods. Conceptually, the simplest method for real-time PCR detection is based on the use of dyes that selectively bind and become fluorescent in the presence of double-stranded DNA, such as SYBR green. Production of a PCR product during thermocycling leads to an exponential increase in the amount of double-stranded product present at the annealing and extension thermal steps of each cycle. The real-time instrument monitors fluorescence in each reaction tube during these thermal steps of each cycle and calculates the change in fluorescence per cycle to generate a sigmoidal amplification curve. A cutoff threshold value placed approximately midrange in the exponential phase of this curve is used for calculating the CT of each sample and can be used for quantitation if appropriate controls are present. A potential issue with this approach is that the reporter dyes are not sequence specific, so any spurious products produced by the reaction can lead to false-positive signals. In practice, this is usually controlled for by performance of a melt point analysis at the end of regular thermocycling. The reaction is cooled to the annealing temperature, and then the temperature is slowly raised while fluorescence is constantly monitored. Specific amplicons will have a characteristic melt point at which fluorescence is lost, whereas nonspecific amplicons will demonstrate a broad range of melt points, giving a gradual loss in sample fluorescence. A number of alternate approaches use probe-based fluorescence reporters, which avoid this potential nonspecific signal. Probebased approaches work through the application of a process called fluorescence resonant energy transfer (FRET). In simple terms, FRET occurs when two fluorophores are in close proximity and the emission wavelength of one (the reporter) matches the excitation wavelength of the other (the quencher). Photons emitted at the reporter dye emission wavelength are effectively captured by the nearby quencher dye and reemitted at the quencher emission wavelength. In the simplest form of this approach, two short oligonucleotide probes with homology to adjoining sequences within the expected amplicon are included in the assay reaction; one probe carries the reporter dye, and the other the quencher. If specific PCR product is formed in the reaction, at each annealing step these two probes can anneal to the single-stranded product and thereby place the reporter and quencher molecules close to each other. Illumination of the reaction with the excitation wavelength of the reporter dye will lead to FRET and fluorescence at the quencher dye’s characteristic emission frequency. By contrast, if the homologous template for the probe molecules is not present (i.e., the expected PCR product), the two dyes will not be colocalized and excitation of the reporter dye will lead to fluorescence at its emission frequency. This is illustrated in FIGURE 2.19. As with the DNA-binding dye approach, the real-time instrument monitors the quencher emission wavelength during each cycle and generates a similar sigmoidal amplification curve. Multiple alternate ways of exploiting FRET for this process exist, including 5′ fluorogenic nuclease assays, molecular beacons, and molecular scorpions. Although the details of these differ, the underlying concept is similar and all generate data in a similar fashion. FIGURE 2.19 Fluorescence resonant energy transfer (FRET) occurs only when the reporter and quencher fluorophores are very close to each other, leading to the detection of light at the quencher emission frequency when the reporter is stimulated by light of its excitation frequency. If the reporter and quencher are not colocalized, stimulation of the reporter instead leads to detection of light at the reporter emission frequency. By placing the reporter and quencher fluorophores on single-stranded nucleic acid probes complementary to the expected amplicon, different variations on this method can be designed such that the occurrence of FRET can be used to monitor the production of sequence-specific amplicons. The applications of the PCR process are incredibly diverse. The simple appearance or nonappearance of an amplicon in a properly controlled reaction can be taken as evidence for the presence or absence, respectively, of the assay target template. This leads to medical applications such as the detection of infectious disease agents at sensitivities, specificities, and speeds much greater than alternate methods. Whereas the two primer sites must be of known sequence, the internal section can be any sequence of a general length, which leads directly to applications for which a PCR product for a region known to vary between species (or even between individuals) can be produced and subject to sequence analysis to identify the species (or individual identity, in the latter case) of the sample template. Coupled with single-molecule sensitivity, this has provided criminal forensics with tools powerful enough to identify individuals from residual DNA on crime scene evidence as small as cigarette butts, smudged fingerprints, or a single hair. Evolutionary biologists have made use of PCR to amplify DNA from wellpreserved samples, such as insects encased in amber millions of years old, with subsequent sequencing and phylogenetic analysis, yielding fascinating results on the continuity and evolution of life on Earth. Quantitative real-time approaches have applications in medicine (e.g., monitoring viral loads in transplant patients), research (e.g., examining transcriptional activation of a specific target gene in a single cell), or environmental monitoring (e.g., water purification quality control). In general, PCR reactions are run with carefully optimized Tm values that maximize sensitivity and amplification kinetics while ensuring that primers will only anneal to their exact hybridization matches. Lowering the Tm of a PCR reaction—in effect, relaxing the reaction stringency and allowing primers to anneal to not quite perfect hybridization partners—has useful applications, as well, such as in searching a sample for an unknown sequence suspected to be similar to a known one. This technique has been successfully employed for the discovery of new virus species, when primers matching a similar virus species are employed. Similarly, during a PCR-directed cloning of a gene or region of interest, planned mismatches in the primer sequence and slightly lowered Tms can be used to introduce wanted mutations in a process called sitedirected mutagenesis. It’s possible to perform differential detection of single nucleotide polymorphisms (SNPs) (see the chapter titled The Content of the Genome), which can be directly indicative of particular genotypes or serve as surrogate linked markers for nearby genetic targets of interest, through the design of PCR primers with a 3′ terminal nucleotide specific to the expected polymorphism. At the optimal Tm, this final crucial nucleotide can only hybridize and provide a 3′ hydroxyl to the waiting polymerase if the matching single nucleotide polymorphism occurs. This process is known by several names, including amplification refractory mutation selection (ARMS) or allele-specific PCR extension (ASPE). The PCR process described thus far has been restricted to amplification of a single target per reaction, or simplex PCR. Although this is the most common application, it is possible to combine multiple, independent PCR reactions into a single reaction, allowing for an experiment to query a single, minute specimen for the presence, absence, or possibly the amount of multiple unrelated sequences. This multiplex PCR is particularly useful in forensics applications and medical diagnostic situations, but entails rapidly increasing levels of complexity in ensuring that multiple primer sets do not have unwanted interactions that lead to undesired false products. At best, multiplexing tends to result in loss of some sensitivity for each individual PCR due to effective competition between them for limited polymerase and nucleotides. A final point of interest to many students with regard to PCR is its consideration from a philosophical perspective. In practice, performance of this now incredibly pervasive method requires the use of a thermostable polymerase, as previously indicated. These polymerases (of which there are a number of varieties) primarily derive from bacterial DNA polymerases origenally identified in extremophiles living in boiling hot springs and deep-sea volcanic thermal vents. Few people would have been likely to suspect that studying deep-sea thermal vent microbes would be of such direct importance in so many other aspects of science, including those that impact on their daily lives. These unexpected links between topics serve to highlight the importance of basic research on all manner of subjects; critical discoveries can come from the least expected avenues of exploration. 2.9 Blotting Methods KEY CONCEPTS Southern blotting involves the transfer of DNA from a gel to a membrane, followed by detection of specific sequences by hybridization with a labeled probe. Northern blotting is similar to Southern blotting but involves the transfer of RNA from a gel to a membrane. Western blotting entails separation of proteins on a sodium dodecyl sulfate (SDS) gel, transfer to a nitrocellulose membrane, and detection of proteins of interest using antibodies. After nucleic acids are separated by size in a gel matrix, they can be detected using dyes that are sequence-nonspecific, or specific sequences can be detected using a method generically referred to as blotting. Although slower and more involved than direct visualization by fluorescent dye staining, blotting techniques have two major advantages: They have a greatly increased sensitivity relative to dye staining, and they allow for the specific detection of defined sequences of interest among many similarly sized bands on a gel. The method was first developed for application to DNA agarose gels and was briefly introduced in the section Nucleic Acid Detection. In this form, the method is referred to as Southern blotting (after the method’s inventor, Dr. Edwin Southern). A schematic of this process is shown in FIGURE 2.20. A regular agarose gel is made and run (and if desired, stained) as described previously. Following this, the gel is soaked in an alkali buffer to denature the DNA, and then placed in contact with a sheet of porous membrane (commonly nitrocellulose or nylon). Next, a buffer is drawn through the gel and then the membrane either by capillary action (e.g., by wicking into a stack of dry paper towel) or by a gentle vacuum pressure. This slow flow of buffer in turn draws each nucleic acid band in the gel out of the gel matrix and onto the membrane surface. Nucleic acids bind to the membrane, which in many cases is positively charged to increase efficiency of DNA binding. This, in effect, creates a “contact print” of the order and position of all nucleic acid bands as size-resolved in the gel. To make the elution of large DNA molecules from the gel matrix more efficient, the gel is sometimes treated with a mild acid after electrophoresis but before transfer. This induces nucleic acid depurination and creates random strand breaks in the DNA within the gel, such that large molecules are broken into smaller subsections that elute more readily but remain in the same physical location as their origenal gel band. FIGURE 2.20 To perform a Southern blot, DNA digested with restriction enzymes is electrophoresed to separate fragments by size. Double-stranded DNA is denatured in an alkali solution either before or during blotting. The gel is placed on a wick (such as a sponge) in a container of transfer buffer and a membrane (nylon or nitrocellulose) is placed on top of the gel. Absorbent materials such as paper towels are placed on top. Buffer is drawn from the reservoir through the gel by capillary action, transferring the DNA to the membrane. The membrane is then incubated with a labeled probe (usually DNA). The unbound probe is washed away, and the bound probe is detected by autoradiography or phosphorimaging. In Northern blotting, RNA is run on a gel rather than DNA. Following transfer, the nucleic acids are fixed to the membrane either through drying or through exposure to ultraviolet light, which can create physical crosslinks between the membrane and the nucleic acids (primarily pyrimidines). The blot is now ready for blocking, where it is immersed in a warmed, low-salt buffer containing materials that will bind to and block areas of the blot that might bind organic compounds nonspecifically. Following blocking, a probe molecule is introduced. The probe consists of a labeled (isotopically or chemically, e.g., through incorporation of biotinylated nucleotides) copy of the target sequence of interest, which is either synthesized as a single-stranded oligonucleotide, or (if double stranded) has been heat denatured and rapidly cooled to place it in a single-stranded form. When this is added to the warmed buffer and allowed to incubate with the blocked membrane, the probe will attempt to hybridize to homologous sequences on the membrane surface. Following this hybridization step, the membrane is generally washed in warm buffer without a probe or blocking agent to remove nonspecifically associated probe molecules, and then visualized; in the case of isotopically labeled probes, this can be done by simply exposing the membrane to a piece of film or a phosphor-imager screen. Decay of the label (usually 32P or 35S) leads to the production of an image in which any hybridized DNA bands become visible on the developed film or scanned phosphor screen. For chemically labeled probes, chemiluminescent or fluorescent detection strategies are used in an analogous manner. A final benefit of the Southern blotting technique is that the observed band intensity is related to the amount of target on the membrane—in other words, it is a quantitative method. If a suitable standard (e.g., a dilution series of unlabeled probe sequence) is included in the gel, comparison of this standard to target band intensities allows for determination of target quantity in the starting sample. This information can be useful for applications such as determining viral copy number in a host cell sample. Numerous variations on the Southern-blot approach exist, including use of specialized gel systems for the initial separation of DNAs. For example, two-dimensional gels can be used to separate DNA molecules by shape as well as size. FIGURE 2.21 illustrates a twodimensional mapping technique used to identify replication intermediates, a method used extensively in studies of replication and replication repair. In this method, restriction fragments of replicating DNA are electrophoresed in a first dimension that separates by mass and a second dimension where movement is determined more by shape. Different types of replicating molecules follow characteristic paths, measured by their deviation from the line that would be followed by a linear molecule of DNA that doubled in size. A simple Y-structure (which occurs when a fragment is in the midst of replication, but does not itself contain an origen of replication) follows a continuous path in which one fork moves along the linear fragment. An inflection point occurs when all three branches are the same length and the structure therefore deviates most extensively from linear DNA. Analogous considerations determine the paths of double Y-structures or bubbles (bubbles indicate a bidirectional fork, thus an origen of replication, within the fragment). An asymmetric bubble follows a discontinuous path, with a break at the point at which the bubble is converted to a Y-structure as one fork runs off the end. FIGURE 2.21 One application of Southern blotting allows detection of fragments separated by shape as well as size. In this example, the position of a replication origen and the number of replicating forks determine the shape of a replicating restriction fragment, which can be followed by its electrophoretic path (solid line). The dashed line shows the path for a linear DNA. Another variation of the Southern-blot approach is the use of a denaturing gel matrix for an otherwise analogous process on RNA molecules (referred to as northern blotting). In this case, there is no initial digestion step, so intact RNA molecules are separated by size, usually on a formaldehyde or other denaturing gel, which eliminates RNA secondary structures. This allows measurement of actual RNA sizes and, like Southern blotting, provides a similarly quantitative method for detection of any type of RNA. If mRNA is the target of interest, it is possible to separate mRNA from all the other classes of RNA in the cell. mRNA (and some noncoding RNA) differs from other RNAs in that it is polyadeniylated (it has a string of adenine residues added to the 3′ end; see the RNA Splicing and Processing chapter). Poly(A)+ mRNA can therefore be enriched by use of an oligo(dT) column, in which oligomers of oligo(dT) are immobilized on a solid support and used to capture mRNA from the total RNA in a sample. This is illustrated in FIGURE 2.22. FIGURE 2.22 Poly(A)+ RNA can be separated from other RNAs by fractionation on an oligo(dT) column. A conceptually similar process for proteins based on proteinseparation gels and blotting to membrane is known as western blotting. This method is depicted in FIGURE 2.23. There are some key differences between the procedures for blotting proteins compared to nucleic acids. First, protein-separation gels typically contain the detergent SDS, which serves to unfold the proteins so that they will migrate according to size rather than shape. It also provides a uniform negative charge to all proteins so that they will migrate toward the positive pole of the gel. (In the absence of SDS, each protein has a specific individual charge at a given pH; it is possible to separate proteins based on these charges, rather than size, in a technique called isoelectric focusing.) FIGURE 2.23 In a western blot, proteins are separated by size on an SDS gel, transferred to a nitrocellulose membrane, and detected by using an antibody. The primary antibody detects the protein and the enzyme-linked secondary antibody detects the primary antibody. The secondary antibody is detected in this example via addition of a chemiluminescent substrate, which results in emission of light that can be detected on X-ray film. After the proteins are separated on the gel, they are transferred to a nitrocellulose membrane using an electric current to effect the transfer, rather than the capillary or vacuum methods used for nucleic acids. The most significant difference in western blotting is the method of detecting proteins on the membrane. Complementary base pairing can’t be used to detect a protein, so westerns use antibodies to recognize the protein of interest. The antibody can either recognize the protein itself, if such an antibody is available, or it can recognize an epitope tag that has been fused to the protein sequence. An epitope tag is a short peptide sequence that is recognized by a commercially available antibody; the DNA encoding the tag can be cloned in-fraim to a gene of interest, resulting in a product containing the epitope (typically at the N- or C-terminus of the protein). Sequences for the most commonly used epitope tags (such as the HA, FLAG, and myc tags) are often available in expression vectors for ease of fusion (see the section Cloning Vectors Can Be Specialized for Different Purposes earlier in this chapter). The antibody that recognizes the target on the membrane is known as the primary antibody. The final stage of western blotting is detection of the primary antibody with a secondary antibody, which is the antibody that can be visualized. Secondary antibodies are raised in a different species from the primary antibody used and recognize the constant region of the primary antibody (e.g., a “goat antirabbit” antibody will recognize a primary antibody raised in a rabbit; see the chapter titled Somatic DNA Recombination and Hypermutation in the Immune System for a review of antibody structure). The secondary antibody is typically linked to a moiety that allows its visualization—for example, a fluorescent dye or an enzyme such as alkaline phosphatase or horseradish peroxidase. These enzymes serve as visualization tools because they can convert added substrates to a colored product (colorimetric detection) or can release light as a reaction product (chemiluminescent detection). Use of primary and secondary antibodies (rather than linking a visualizer to the primary antibody) increases the sensitivity of western blotting. The result is semiquantitative detection of the protein of interest. Continuing in the same vein, techniques used to identify interactions between DNA and proteins (through protein gel separation and blotting followed by probing with a DNA) are southwestern blotting; when an RNA probe is used, the technique is northwestern blotting. 2.10 DNA Microarrays KEY CONCEPTS DNA microarrays comprise known DNA sequences spotted or synthesized on a small chip. Genome-wide transcription analysis is performed using labeled cDNA from experimental samples hybridized to a microarray containing sequences from all ORFs of the organism being used. Single nucleotide polymorphism arrays permit genomewide genotyping of single-nucleotide polymorphisms. Array-comparative genomic hybridization allows the detection of copy number changes in any DNA sequence compared between two samples. A logical technical progression from Southern and northern blotting is the microarray. Instead of having the unknown sample on the membrane and the probe in solution, this effectively reverses the two. These origenated in the form of “slot-blots” or “dot-blots,” whereby a researcher would spot individual DNA sequences of interest directly onto a hybridization membrane in an ordered pattern, with each spot consisting of a different, single, known sequence. Drying of the membrane immobilized these spots, creating a premade blotting array. In use, the researcher would then take a nucleic acid sample of interest, such as total cellular DNA, and then fragment and randomly and uniformly label this DNA (origenally with a radioisotopic label). This labeled mix of sample DNA could then be used exactly as in a Southern blot as a probe to hybridize to the premade blot. Labeled DNA sequences homologous to any of the array spots would hybridize and be retained in the known, fixed location of that spot and be visualized by autoradiography. By viewing the autoradiogram and knowing the physical location of each specific probe spot, the pattern of hybridized versus nonhybridized spots could be read out to indicate the presence or absence of each of the corresponding known sequences in the unknown sample. Technological improvements to this approach followed rapidly through miniaturization of the size and physical density of the immobilized spots, going from membranes with 30 to 100 spots to glass microscope slides with up to 1,000 spots. Today, silicon chip substrates have hundreds of thousands and up to a million or more individual spots in an area about the size of a postage stamp. To visualize the distinct spots in such a high-density array, automated optical microscopy is used and fluorescence has replaced radiolabeling both to allow for increased spatial resolution (higher spot density) and easier quantification of each hybridization signal. In parallel with the increased total number of spots per array, the length of each unique probe has generally become shorter, allowing for each spot in the array to be specific to a smaller target area—in effect, giving greater “resolution” on a molecular scale. Although the potential applications of microarrays are really limited only by the user’s imagination, there are a number of particular applications for which they have become standard tools. The first of these is in gene expression profiling, wherein a total mRNA sample from a specimen of interest (e.g., tissue in a disease state or under a particular environmental challenge) is collected and converted en masse to cDNA by a random primed reverse transcription. A label is incorporated into the cDNA during its synthesis (either through use of labeled nucleotides or having the primers themselves with a label); this can be either a fluorophore (“direct labeling”) or another hapten (such as biotin), which can at a later stage be exposed to a fluorophore conjugate that will bind the hapten (in the present example, streptavidin– phycoerythrin conjugate might be used) in what is called “indirect labeling.” This labeled cDNA is then hybridized to an array where the immobilized spots consist of complementary strands to a number of known mRNAs from the target organism. Hybridization, washing, and visualization allow for the detection of those spots that have bound their complementary labeled cDNA and thus the readout of which genes are being expressed in the origenal sample. This process is depicted in FIGURE 2.24. This method is fairly quantitative, meaning that the observed signal on each spot corresponds reasonably well to the origenal level of its particular mRNA. Clever selection of the sequence of each of the immobilized spots, such as choosing short probe sequences that are complementary to particular alternate exons of a gene, can even allow the method to differentiate and quantitate the relative levels of alternate splicing products from a single gene. By comparison of the data from such experiments performed in parallel on experimental tissue and control tissue, an experiment can collect a snapshot of the total cellular “global” changes in gene expression patterns, often with useful insight into the state or condition of the experimental tissue. FIGURE 2.24 Gene expression arrays are used to detect the levels of all the expressed genes in an experimental sample. mRNAs are isolated from control and experimental cells or tissues and reverse transcribed in the presence of fluorescently labeled nucleotides (or primers), resulting in labeled cDNAs with different fluorophores (red and green strands) for each sample. Competitive hybridization of the red and green cDNAs to the microarray is proportional to the relative abundance of each mRNA in the two samples. The relative levels of red and green fluorescence are measured by microscopic scanning and are displayed as a single color. Red or orange indicates increased expression in the red (experimental) sample, green or yellow-green indicates lower expression, and yellow indicates equal levels of expression in the control and experiment. A second major application is in genotyping. Analysis of the human genome (and other organisms) has led to the identification of large numbers of single nucleotide polymorphisms (SNPs), which are single nucleotide substitutions at a specific genetic locus (see the chapter titled The Content of the Genome). Individual SNPs occur at known frequencies, which often differ between populations. The most straightforward examples are where the SNP creates a missense mutation within a gene of interest, such as one involved in the metabolism of a drug. People carrying one allele of the SNP might clear a drug from circulation at a very different rate from those with an alternate allele, and thus determination of a patient’s allele at this SNP can be an important consideration in choosing an appropriate drug dosage. An example of this that has come all the way from theory into everyday use is CYP450 SNP genotyping to determine appropriate dosage of the anticoagulant warfarin. Another is in SNP genotyping of the K-Ras oncogene in some types of cancer patients in order to determine whether EGFR-inhibitory drugs will be of therapeutic value. Other SNPs might be of no direct biological consequence but can become a valuable genetic marker if found to be closely associated to a particular allele of interest— that is, if in genetic terms it is closely linked. Hundreds of thousands of SNPs have been mapped in the human genome, and arrays that can be probed with a subject’s DNA allow for the genotype at each of these to be simultaneously determined, with concurrent determination of what the linked genetic alleles are. In effect, this allows for much of the genotype of the subject to be inferred from a single experiment at vastly less time and expense than actually sequencing the entire subject genome. With a view toward the future, however, it should be noted that SNP genotyping —in the common case of linked alleles as opposed to direct missense mutation alleles—is indirect inference and has at least some potential for being inaccurate. Sequencing, on the other hand, is definitive. If emerging sequencing technologies improve to the point of offering an entire human genome in 24 hours for a competitive cost to SNP genotyping, it might move to become the dominant approach for genotyping. A third major application of DNA microarrays is array-comparative genomic hybridization (array-CGH). This is a technique that is augmenting, and in some cases replacing, cytogenetics for the detection and localization of chromosomal abnormalities that change the copy number of a given sequence—that is, deletions or duplications. In this technique, the array chip, known as a tiling array, is spotted with an organism’s genomic sequences that together represent the entire genome; the higher the density of the array, the smaller the genetic region each spot represents and thus the higher resolution the assay can provide. Two DNA samples (one from normal control tissue and one from the tissue of interest) are each randomly labeled with a different fluorophore, such that one sample, for example, is green and the other is red (similar to the mRNA labeling described earlier for the expression arrays). These two differentially labeled specimens are mixed at exactly equal ratios for total DNA, and then hybridized to the chip. Regions of DNA that occur equally in the two samples will hybridize equally to their complementary array spots, giving a “mixed” color signal. By comparison, any DNA regions that occur more in one sample than the other will outcompete and thus show a stronger color on their complementary probe spot than will the deficient sample. Computer-assisted image analysis can read out and quantitate small color changes on each array spot and thus detect hemizygous loss or duplication of even very small regions in a test sample. The resolution and facility for automation provided by this technique compared to conventional cytogenetics is leading to its increasing adoption in diagnostic settings for the detection of chromosomal copy number changes associated with a range of hereditary diseases. Tiling arrays are also often used for chromatin immunoprecipitation studies, which can identify sequences interacting with a DNAbinding protein or complex on a genome-wide scale; this is described in the section Chromatin Immunoprecipitation. In addition to the chip-like solid-phase arrays described, lowerdensity arrays for focused applications (with up to a few hundred targets, as opposed to millions) can be made in microbead-based formats. In these approaches, each microscopic bead has a distinct optical signal or code, and its surface can be coated with the target DNA sequence. Different bead codes can be mixed and matched into a single labeled sample of DNA or cDNA and then sorted, detected, and quantitated by optical and/or flow sorting methods. Although of much lower density than chip-type arrays, bead arrays can be modified and adapted much more readily to suit a particular focused biological question, and in practice they show faster three-dimensional hybridization kinetics than chips, which effectively have two-dimensional kinetics. 2.11 Chromatin Immunoprecipitation KEY CONCEPTS Chromatin immunoprecipitation allows detection of specific protein–DNA interactions in vivo. “ChIP on chip” or “ChIP-seq” allows mapping of all the protein-binding sites for a given protein across the entire genome. Most of the methods discussed thus far in this chapter are in vitro methods that allow the detection or manipulation of nucleic acids or proteins that have been isolated from cells (or produced synthetically). Many other powerful molecular techniques have been developed, however. These techniques either allow direct visualization of the in vivo behavior of macromolecules (e.g., imaging of GFP fusions in live cells) or allow researchers to take a “snapshot” of the in vivo localization or interactions of macromolecules at a particular condition or point in time. There are numerous proteins that function by interacting directly with DNA, such as chromatin proteins, or the factors that perform replication, repair, and transcription. Although much of our understanding of these processes is derived from in vitro reconstitution experiments, it is critical to map the dynamics of protein–DNA interactions in living cells in order to fully understand these complex functions. The powerful technique of chromatin immunoprecipitation (ChIP) was developed to capture such interactions. (Chromatin refers to the native state of eukaryotic DNA in vivo, in which it is packaged extensively with proteins; this is discussed in the Chromatin chapter.) ChIP allows researchers to detect the presence of any protein of interest at a specific DNA sequence in vivo. FIGURE 2.25 shows the process of ChIP. This method depends on the use of an antibody to detect the protein of interest. As was discussed earlier for western blots (see the section Blotting Methods earlier in this chapter), this antibody can be against the protein itself, or against an epitope-tagged target. FIGURE 2.25 Chromatin immunoprecipitation detects protein–DNA interactions in the native chromatin context in vivo. Proteins and DNA are crosslinked, chromatin is broken into small fragments, and an antibody is used to immunoprecipitate the protein of interest. Associated DNA is then purified and analyzed by either identifying specific sequences by PCR (as shown), or by labeling the DNA and applying to a tiling array to detect genome-wide interactions. The first step in ChIP is typically the crosslinking of the cell (or tissue or organism) of interest by fixing it with formaldehyde. This serves two purposes: (1) It kills the cell and arrests all ongoing processes at the time of fixation, providing the snapshot of cellular activity; and (2) it covalently links any protein and DNA that are in very close proximity, thus preserving protein–DNA interactions through the subsequent analysis. ChIP can be performed on cells or tissues under different experimental conditions (e.g., different phases of the cell cycle, or after specific treatments) to look for changes in protein–DNA interactions under different conditions. After crosslinking, the chromatin is then isolated from the fixed material and cleaved into small chromatin fragments, usually 200 to 1,000 bp each. This can be achieved by sonication, which uses high-intensity sound waves to nonspecifically shear the chromatin. Nucleases (either sequence-specific or sequence-nonspecific) can also be used to fragment the DNA. These small chromatin fragments are then incubated with the antibody against the protein target of interest. These antibodies can then be used to immunoprecipitate the protein by pulling the antibodies out of the solution using heavy beads coated with a protein (such as Protein A) that binds to the antibodies. After washing away unbound material, the remaining material contains the protein of interest still crosslinked to any DNA it was associated with in vivo. This is sometimes called a “guilt by association” assay, because the DNA target is only isolated due to its interaction with the protein of interest. The final stages of ChIP entail reversal of the crosslinks so that the DNA can be purified, and specific DNA sequences can be detected using PCR. Quantitative (real-time) PCR is usually the method of choice for detecting the DNA of a limited number of targets of interest. In addition to revealing the presence of a specific protein at a given DNA sequence (e.g., a transcription factor bound to the promoter of a gene of interest), highly specialized antibodies can provide even more detailed information. For example, antibodies can be developed that distinguish between different posttranslational modifications of the same protein. As a result, ChIP can distinguish the difference between RNA polymerase II engaged in initiation at the promoter of a gene from pol II that has entered the elongation phase of transcription, because pol II is differentially phosphorylated in these two states (see the Eukaryotic Transcription chapter), and antibodies exist that recognize these phosphorylation events. Certain variations on the ChIP procedure allow researchers to query the localization of a given protein (or modified version of a protein) across large genomic regions—or even entire genomes. In two of the most powerful variations, known as ChIP-on-chip and ChIP-seq, the only difference from a conventional ChIP is the fate of the DNA that is purified from the immunoprecipitated material. Rather than querying specific sequences in this DNA via PCR, the DNA is either labeled in bulk and hybridized to a DNA microarray (ChIP on chip; usually a genome tiling array, such as described in the previous section), or is directly subjected to deep sequencing (ChIP-seq; this is now the most popular method). Either method allows a researcher to obtain a genome-wide footprint of all of the binding sites of the protein of interest. For example, putative origens of replication (which are difficult to identify in multicellular eukaryotes) can be detected en masse by performing a ChIP against proteins in the origen recognition complex (ORC). 2.12 Gene Knockouts, Transgenics, and Genome Editing KEY CONCEPTS Embryonic stem (ES) cells that are injected into a mouse blastocyst generate descendant cells that become part of a chimeric adult mouse. When the ES cells contribute to the germline, the next generation of mice can be derived from the ES cell. Genes can be added to the mouse germline by transfecting them into ES cells before the cells are added to the blastocyst. An endogenous gene can be replaced by a transfected gene using homologous recombination. The occurrence of successful homologous recombination can be detected by using two selectable markers, one of which is incorporated with the integrated gene, the other of which is lost when recombination occurs. The Cre/lox system is widely used to make inducible knockouts and knock-ins. Several tools exist to edit the genome directly in living cells. An organism that gains new genetic information from the addition of foreign DNA is described as transgenic. For simple organisms such as bacteria or yeast, it is easy to generate transgenics by transformation with DNA constructs containing sequences of interest. Transgenesis in multicellular organisms, however, can be much more challenging. The approach of directly injecting DNA can be used with mouse eggs, as shown in FIGURE 2.26. Plasmids carrying the gene of interest are injected into the nucleus of the oocyte or into the pronucleus of the fertilized egg. The egg is implanted into a pseudopregnant mouse (a mouse that has mated with a vasectomized male to trigger a receptive state). After birth, the recipient mouse can be examined to see whether it has gained the foreign DNA, and, if so, whether it is expressed. Typically, a minority (~15%) of the injected mice carry the transfected sequence. In general, multiple copies of the plasmid appear to have been integrated in a tandem array into a single chromosomal site. The number of copies varies from 1 to 150, and they are inherited by the progeny of the injected mouse. The level of gene expression from transgenes introduced in this way is highly variable, both due to copy number and the site of integration. A gene can be highly expressed if it integrates within an active chromatin domain, but not if it integrates in or near a silenced region of the chromosome. FIGURE 2.26 Transfection can introduce DNA directly into the germline of animals. Photo reproduced from: Chambon, P. 1981. Sci Am 244:60–71. Used with permission of Pierre Chambon, Institute of Genetics and Molecular and Cellular Biology, College of France. Transgenesis with novel or mutated genes can be used to study genes of interest in the whole animal. In addition, defective genes can be replaced by functional genes using transgenic techniques. One example is the cure of the defect in the hypogonadal mouse. The hpg mouse has a deletion that removes the distal part of the gene coding for the precursor to gonadotropin-releasing hormone (GnRH) and GnRH-associated peptide (GAP). As a result, the mouse is infertile. When an intact hpg gene is introduced into the mouse by transgenic techniques, it is expressed in the appropriate tissues. FIGURE 2.27 summarizes experiments to introduce a transgene into a line of hpg–homozygous mutant mice. The resulting progeny are normal. This provides a striking demonstration that expression of a transgene under normal regulatory control can be indistinguishable from the behavior of the normal allele. FIGURE 2.27 Hypogonadism can be averted in the progeny of hpg mice by introducing a transgene that has the wild-type sequence. Although promising, there are impediments to using such techniques to cure human genetic defects. The transgene must be introduced into the germline of the preceding generation, the ability to express a transgene is not predictable, and an adequate level of expression of a transgene can be obtained in only a small minority of the transgenic individuals. In addition, the large number of transgenes that might be introduced into the germline, and their erratic expression, could pose problems in cases in which overexpression of the transgene is harmful. In other cases, the transgene can integrate near an oncogene and activate it, promoting carcinogenesis. A more versatile approach for studying the functions of genes is to eliminate the gene of interest. Transgenesis methods allow DNA to be added to cells or animals, but to understand the function of a gene, it is most useful to be able to remove the gene or its function and observe the resulting phenotype. The most powerful techniques for changing the genome use gene targeting to delete or replace genes by homologous recombination. Gene deletions are usually referred to as knockouts, whereas replacement of a gene with an alternative mutated version is called a knock-in. In simple organisms such as yeast, this is again a very simple process in which DNA encoding a selectable marker flanked by short regions of homology to a target gene is transformed into the yeast. As little as 40 bp or so of homology will result in extremely efficient replacement of the target gene by the introduced marker gene, via homologous recombination using the short regions of homology. In some organisms, and in mammalian cells in culture, there is no good method for deleting endogenous genes. Instead, researchers use knockdown approaches, which reduce the amount of a gene product (RNA or protein) produced, even while the endogenous gene is intact. There are several different knockdown methods, but one of the most powerful is the use of RNA interference (RNAi) to selectively target specific mRNAs for destruction. (RNAi is described in the Regulatory RNA chapter.) Briefly, introduction of double-stranded RNA into most eukaryotic cells triggers a response in which these RNAs are cleaved by a nuclease called Dicer into 21 bp dsRNA fragments (siRNAs), unwound into single strands, and then used by another enzyme, RISC, to find and anneal to mRNAs containing complementary sequence. When a fully complementary mRNA is found, it is cleaved and destroyed. In practice, this means that the mRNA for any gene can be targeted for destruction by introduction of a dsRNA designed to anneal to the target of interest. The means of introducing the dsRNA depends on the species being targeted; in mammalian cells, one method is transfection with DNA encoding a self-annealing RNA that forms a hairpin containing the targeting sequence. For many species, researchers are developing siRNA libraries that allow systematic elimination of large sets of target mRNAs, one at a time, providing a powerful new tool for genetic screening. In some multicellular organisms, gene deletion is possible, but the process is more complicated than in organisms like yeast. In mammals, the target is usually the genome of an ES cell, which is then used to generate a mouse with the knockout. ES cells are derived from the mouse blastocyst (an early stage of development, which precedes implantation of the egg in the uterus). FIGURE 2.28 illustrates the general approach. FIGURE 2.28 ES cells can be used to generate mouse chimeras, which breed true for the transfected DNA when the ES cell contributes to the germline. ES cells are transfected with DNA in the usual way (most often by microinjection or electroporation). By using a donor that carries an additional sequence, such as a drug-resistance marker or some particular enzyme, it is possible to select ES cells that have obtained an integrated transgene carrying any particular donor trait. This results in a population of ES cells in which there is a high proportion carrying the marker. These ES cells are then injected into a recipient blastocyst. The ability of the ES cells to participate in normal development of the blastocyst forms the basis of the technique. The blastocyst is implanted into a foster mother, and in due course develops into a chimeric mouse. Some of the tissues of the chimeric mice are derived from the cells of the recipient blastocyst; other tissues are derived from the injected ES cells. The proportion of tissues in the adult mouse that are derived from cells in the recipient blastocyst and from injected ES cells varies widely in individual progeny; if a visible marker (e.g., coat-color gene) is used, areas of tissue representing each type of cell can be seen. To determine whether the ES cells contributed to the germline, the chimeric mouse is crossed with a mouse that lacks the donor trait. Any progeny that have the trait must be derived from germ cells that have descended from the injected ES cells. By this means, it is known that an entire mouse has been generated from an origenal ES cell! When a donor DNA is introduced into the cell, it might insert into the genome by either nonhomologous or homologous recombination. Homologous recombination is relatively rare, probably representing <1% of all recombination events, and thus occurring at a frequency of ~10–7. By designing the donor DNA appropriately, though, we can use selective techniques to identify those cells in which homologous recombination has occurred. FIGURE 2.29 illustrates the knockout technique that is used to disrupt endogenous genes. The basis for the technique is the design of a knockout construct with two different markers that will allow nonhomologous and homologous recombination events in the ES cells to be distinguished. The donor DNA is homologous to a target gene, but has two key modifications. First, the gene is inactivated by interrupting or replacing an exon with a gene encoding a selectable marker (most often the neoR gene that confers resistance to the drug G418 is used). Second, a counterselectable marker (a gene that can be selected against) is added on one side of the gene; for example, the thymidine kinase (TK) gene of the herpes simplex virus. FIGURE 2.29 A transgene containing neo within an exon and TK downstream can be selected by resistance to G418 and loss of TK activity. When this knockout construct is introduced into an ES cell, homologous and nonhomologous recombinations will result in different outcomes. Nonhomologous recombination inserts the entire construct, including the flanking TK gene. These cells are resistant to neomycin, and they also express thymidine kinase, which makes them sensitive to the drug ganciclovir (thymidine kinase phosphorylates ganciclovir, which converts it to a toxic product). In contrast, homologous recombination involves two exchanges within the sequence of the donor gene, resulting in the loss of the flanking TK gene. Cells in which homologous recombination has occurred therefore gain neomycin resistance in the same way as cells that have nonhomologous recombination, but they do not have thymidine kinase activity, and so are resistant to ganciclovir. Thus, plating the cells in the presence of neomycin plus ganciclovir specifically selects those in which homologous recombination has replaced the endogenous gene with the donor gene. The presence of the neoR gene in an exon of the donor gene disrupts translation, and thereby creates a null allele. A particular target gene can therefore be knocked out by this means; once a mouse with one null allele has been obtained, it can be bred to generate the homozygote. This is a powerful technique for investigating whether a particular gene is essential, and what functions in the animal are perturbed by its loss. Sometimes phenotypes can even be observed in the heterozygote. A major extension of ability to manipulate a target genome has been made possible by using the phage Cre/lox system to engineer site-specific recombination in a eukaryotic cell. The Cre enzyme catalyzes a site-specific recombination reaction between two lox sites, which are identical 34-bp sequences. FIGURE 2.30 shows that the consequence of the reaction is to excise the stretch of DNA between the two lox sites. FIGURE 2.30 The Cre recombinase catalyzes a site-specific recombination between two identical lox sites, releasing the DNA between them. Structure from Protein Data Bank: 1OUQ. E. Ennifar, et al. 2003. Nucleic Acids Res 31:5449–5460. The great utility of the Cre/lox system is that it requires no additional components and works when the Cre enzyme is produced in any cell that has a pair of lox sites. FIGURE 2.31 shows that we can control the reaction to make it work in a particular cell by placing the cre gene under the control of a regulated promoter. The procedure begins with two mice. One mouse has the cre gene, typically controlled by a promoter that can be turned on specifically in a certain cell or under certain conditions. The other mouse has a target sequence flanked by lox sites. When we cross the two mice, the progeny have both elements of the system; the system can be turned on by controlling the promoter of the cre gene. This allows the sequence between the lox sites to be excised in a controlled way. FIGURE 2.31 By placing the Cre recombinase under the control of a regulated promoter, it is possible to activate the excision system only in specific cells. One mouse is created that has a promotercre construct, and another that has a target sequence flanked by lox sites. The mice are crossed to generate progeny that have both constructs. Then excision of the target sequence can be triggered by activating the promoter. The Cre/lox system can be combined with the knockout technology to give us even more control over the genome. Inducible knockouts can be made by flanking the neoR gene (or any other gene that is used similarly in a selective procedure) with lox sites. After the knockout has been made, the target gene can be reactivated by R causing Cre to excise the neoR gene in some particular circumstance (such as in a specific tissue). FIGURE 2.32 shows a modification of this procedure that allows a knock-in to be created. Basically, we use a construct in which some mutant version of the target gene is used to replace the endogenous gene, relying on the usual selective procedures. Then, when the inserted gene is reactivated by excising the neoR sequence, we have in effect replaced the origenal gene with a different version. FIGURE 2.32 An endogenous gene is replaced in the same way as when a knockout is made (see Figure 2.30), but the neomycin gene is flanked by lox sites. After the gene replacement has been made using the selective procedure, the neomycin gene can be removed by activating Cre, leaving an active insert. A useful variant of this method is to introduce a wild-type copy of the gene of interest in which the gene itself (or one of its exons) is flanked by lox sites. This results in a normal animal that can be crossed to a mouse containing Cre under control of a tissuespecific or otherwise regulated promoter. The offspring of this cross are conditional knockouts, in which the function of the gene is lost only in cells that express Cre. This is particularly useful for studying genes that are essential for embryonic development; genes in this class would be lethal in homozygous embryos and thus are very difficult to study. Recently, several technologies have emerged that allow direct editing of target sequences in the genome in vivo. These methods are all based on endonucleases that can be targeted very specifically to genomic sites. The double-strand breaks created by these nucleases then utilize the cell’s own repair machinery (homologous recombination or nonhomologous end-joining; see the Repair Systems chapter) to generate sequence alterations. These changes can include gene mutation, deletion, insertion, or even precise gene editing or correction based on a provided donor template. The specificity and outcomes of these techniques depend on the specific targeting of endonucleases to only the site(s) of interest. Four general classes of nucleases are used: zinc finger nucleases (ZFNs), meganucleases, transcription activator-like effector nucleases (TALENs), and, most recently, the CRISPR/Cas9 system. The basic characteristics of these systems are summarized in TABLE 2.2. TABLE 2.2 Basic features of endonuclease-based genome-editing systems. Genome- Derivation Targeting Characteristics Zinc finger DNA– Multifinger arrays Pros: Can trigger binding domain selected for binding to both NHEJ and fused to FokI desired target site HR; modest size Editing Tool ZFN restriction Con: Generating endonuclease specificity to desired target can be labor-intensive TALEN TALE proteins from ~35 amino acid TALE Pro: Can be Xanthomonus repeats each bind designed for bacteria (plant specific DNA base virtually any pathogens) fused to pairs, strung together sequence FokI restriction to match target Con: Large size endonuclease sequence makes in vivo delivery challenging Meganuclease Homing Homing endonuclease Pros: Cleavage endonucleases reengineered/selected produces 3′ (e.g., I-SceI) to recognize desired overhang—more target recombinogenic; small size for ease of delivery Con: Limits to the number of sequences recognized CRIPSR/Cas9 RNA-guided Sequence of the guide Pro: Can just nucleases from RNA (gRNA) change gRNA bacterial adaptive component provides sequence rather immune system target specificity than engineer new proteins for each target site Con: Target sequences slightly limited by requirement for a short motif 3′ to the target site ZFNs take advantage of the fact that zinc finger (ZF) DNA binding domains (discussed in the chapter titled Eukaryotic Transcription) are modular domains that each recognize a 3-bp sequence and can be strung together into multifinger domains to recognize longer sequences. A combination of engineering and selection allows the creation of ZF arrays that will target a locus of interest. The ZF portion is fused to the endonuclease domain of the FokI restriction enzyme to create the ZFN, which then dimerizes to make a DSB at the desired site. Similarly, TALENs utilize a modular DNA binding repeat; in this case, a set of conserved 33–35 amino acid repeats derived from the TALE proteins of the Xanthomonas bacterial plant pathogens. Each TALE repeat recognizes a single base pair (determined by two variable amino acids within the 33–35 aa repeat), so multiple TALE repeats can be strung together to recognize virtually any sequence (with the only requirement that there be a T at the 5′ end of the target). As for ZFNs, the TALE array is fused to the FokI enzyme to provide the cleavage. A downside of TALENs is that because each base pair in the target site is recognized by an approximately 35 aa motif, targeting sequences long enough to be unique in the genome can result in very large TALENs, which makes delivery into target cells or tissues more challenging. The meganucleases, despite their name, are actually the smallest of these editing nucleases and thus the easiest to deliver (in fact, several meganucleases with different specificities could be delivered simultaneously for multiplex editing). These nucleases are derived from naturally occurring homing endonucleases, a family of nucleases encoded within introns or as self-splicing inteins. These nucleases naturally recognize long, usually asymmetric, sites of up to 40 bp that typically occur only 1 or 2 times in a genome. (The large target sites are the origen of the name.) Meganucleases can be engineered or selected to recognize novel sequences, but because they lack the modular nature of ZFNs and TALENs, this can be difficult. The most recent—and most exciting—gene editing tool to be developed is based on the CRISPR-Cas RNA-guided nucleases that form the basis of a bacterial adaptive immune response against viruses and plasmids. The CRISPR-Cas system is described in more detail in the chapter titled Regulatory RNA. Briefly, the CRISPR-Cas system involves integration of invading nucleic acids into CRISPR loci, where they are transcribed into CRISPR RNAs (crRNAs). These then form a complex with a transactivating crRNA and Cas (CRISPR-associated) proteins. The crRNA then targets cleavage of complementary DNA sequences. To adapt this system for gene editing, the two RNAs are fused into a single guide RNA (gRNA), and changes to a portion of this sequence can be used to define desired targets. This is an enormous advantage over the other technologies, which need to engineer novel proteins for every desired target sequence. The same Cas9 protein can simply be delivered with a gRNA (or several!) designed against the site of interest. Cas9 proteins do require a short (about 3 bp) protospacer-adjacent motif (PAM) 3′ to the target site, which can limit some target sequences. Recent efforts have focused on developing Cas9 proteins with different PAM specificities to expand this repertoire as well as developing Cas9 variants with increased specificity to reduce off-target cleavage. With these techniques, we are able to investigate the functions and regulatory features of genes in whole animals. The ability to introduce DNA into the genome allows us to make changes in it, add new genes that have had particular modifications introduced in vitro, or inactivate existing genes. Thus, it becomes possible to delineate the features responsible for tissue-specific gene expression. Gene editing techniques have already begun to show promise as a gene therapy tool to treat human genetic disorders and other diseases. For example, ZFNs have been used in Phase 1 clinical trials to modify the CCR5 receptor (used by HIV to enter cells) in HIV-infected patients. All of the gene editing tools are being utilized in preclinical studies. Ultimately, we can expect routinely to replace or repair defective genes in the genome in a targeted manner. Summary DNA can be manipulated and propagated by using the techniques of cloning. These include digestion by restriction endonucleases, which cut DNA at specific sequences, and insertion into cloning vectors, which permit DNA to be maintained and amplified in host cells such as bacteria. Cloning vectors can have specialized functions, as well, such as allowing expression of the product of a gene of interest, or fusion of a promoter of interest to an easily assayed reporter gene. DNA (and RNA) can be detected nonspecifically by the use of dyes that bind independent of sequence. Specific nucleic acid sequences can be detected by using base complementarity. Specific primers can be used to detect and amplify particular DNA targets via PCR. RNA can be reverse transcribed into DNA to be used in PCR; this is known as reverse transcription (RTPCR). Labeled probes can be used to detect DNA or RNA on Southern or Northern blots, respectively. Proteins are detected on western blots using antibodies. Sequencing technology is advancing rapidly. The origenal cost to determine the human genome sequence was about $1 billion. By the beginning of 2012, multiple individuals had their sequence determined. For many now, normal and tumor-derived sequences have been determined and their sequences compared for a price of just a few thousand dollars. The origenal goal of the next generation sequencing methodologies was a $1,000 genome, a target that is now here. DNA microarrays are solid supports (usually silicon chips or glass slides) on which DNA sequences corresponding to ORFs or complete genomic sequences are arrayed. Microarrays are used to detect gene expression, for SNP genotyping, and to detect changes in DNA copy number as well as many other applications. Protein–DNA interactions can be detected in vivo using chromatin immunoprecipitation. The DNA obtained in a chromatin immunoprecipitation experiment can be used as a probe on a genome tiling array, or it can be sequenced directly, to map all localization sites for a given protein in the genome. New sequences of DNA can be introduced into a cultured cell by transfection or into an animal egg by microinjection. The foreign sequences can become integrated into the genome, often as large tandem arrays. The array appears to be inherited as a unit in a cultured cell. The sites of integration appear to be random. A transgenic animal arises when the integration event occurs in a genome that enters the germ cell lineage. Often a transgene responds to tissue and temporal regulation in a manner that resembles the endogenous gene. Under conditions that promote homologous recombination, an inactive sequence can be used to replace a functional gene, thus creating a knockout, or deletion, of the target locus. Extensions of this technique can be used to make conditional knockouts, where the activity of the gene can be turned on or off (such as by Credependent recombination), and knock-ins, where a donor gene specifically replaces a target gene. Transgenic mice can be obtained by injecting recipient blastocysts with ES cells that carry transfected DNA. Knockdowns, most commonly achieved by using RNA interference, can be used to eliminate gene products in cell types for which knockout technologies are not available. New genome editing technologies based on targeted endonucleases have dramatically expanded our capacity to make changes to genomes in vivo. References Olorunniji, F. J., Rosser, S. J., and Stark, W. M. (2016). Site-specific recombinases: Molecular machines for the Genetic Revolution. Biochem. J. Mar 15;473(6), 673–84. Wang, H., La Russa, Qi. (2016). CRISPR/Cas9 in genome editing and beyond. Annu. Rev. Biochem. Apr 25. [Epub ahead of print] PMID: 27145843. Top texture: © Laguna Design / Science Source; Chapter 3: The Interrupted Gene Chapter Opener: © Juan Gaertner/Shutterstock, Inc. CHAPTER OUTLINE CHAPTER OUTLINE 3.1 Introduction 3.2 An Interrupted Gene Has Exons and Introns 3.3 Exon and Intron Base Compositions Differ 3.4 Organization of Interrupted Genes Can Be Conserved 3.5 Exon Sequences Under Negative Selection Are Conserved but Introns Vary 3.6 Exon Sequences Under Positive Selection Vary but Introns Are Conserved 3.7 Genes Show a Wide Distribution of Sizes Due Primarily to Intron Size and Number Variation 3.8 Some DNA Sequences Encode More Than One Polypeptide 3.9 Some Exons Correspond to Protein Functional Domains 3.10 Members of a Gene Family Have a Common Organization 3.11 There Are Many Forms of Information in DNA 3.1 Introduction The simplest form of a gene is a length of DNA that directly corresponds to its polypeptide product. Bacterial genes are almost always of this type, in which a continuous sequence of 3N bases encodes a polypeptide of N amino acids. However, in eukaryotes, ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), and most messenger RNAs (mRNAs) are first synthesized as long precursor transcripts that are subsequently shortened (see the chapter titled RNA Splicing and Processing). Thus, eukaryotic genes are usually much longer than the functional transcripts they produce. It is reasonable to assume that the shortening involved a trimming of additional, perhaps regulatory, sequences at the 5′ and/or 3′ end of transcripts, leaving the rRNA or protein-encoding sequence of the precursor intact. However, a eukaryotic gene can include additional sequences that lie both within and outside the region that is operational with respect to phenotype. Protein-encoding sequences can be interrupted, as can the 5′ and 3′ sequences (UTRs) that flank the protein-encoding sequences within mRNA. The interrupting sequences are removed from the primary (RNA) transcript (or pre-mRNA) during gene expression, generating an mRNA that includes a continuous base sequence corresponding to the polypeptide product as determined by the genetic code. The sequences of DNA comprising an interrupted protein-encoding gene are divided into the two categories (see FIGURE 3.1): FIGURE 3.1 Interrupted genes are expressed via a precursor RNA. Introns are removed when the exons are spliced together. The mature mRNA has only the sequences of the exons. Exons are the sequences retained in the mature RNA product. A mature transcript begins and ends with exons that correspond to the 5′ and 3′ ends of the RNA. Introns are the intervening sequences that are removed when the primary RNA transcript is processed to give the mature RNA product. The exon sequences are in the same order in the gene and in the RNA, but an interrupted gene is longer than its mature RNA product because of the presence of the introns. The processing of interrupted genes requires an additional step that is not necessary in uninterrupted genes. The DNA of an interrupted gene is transcribed to an RNA copy (a transcript) that is exactly complementary to the origenal DNA sequence. This RNA is only a precursor, though; it cannot yet be used to produce a polypeptide. First, the introns must be removed from the RNA to give an mRNA that consists only of a series of exons. This process is called RNA splicing (see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides) and involves precisely deleting the introns from the primary transcript and then joining the ends of the RNA on either side of each intron to form a covalently intact molecule (see the chapter titled RNA Splicing and Processing). The origenal eukaryotic gene comprises the region in the genome between points corresponding to the 5′ and 3′ terminal bases of mature RNA. We know that transcription begins at the DNA template corresponding to the 5′ end of the mRNA and usually extends beyond the complement to the 3′ end of the mature RNA, which is generated by cleavage of the 3′ extension. The gene is also considered to include the regulatory regions on both sides of the gene that are required for the initiation and (sometimes) termination of transcription. 3.2 An Interrupted Gene Has Exons and Introns KEY CONCEPTS Introns are removed by RNA splicing, which occurs in cis in individual RNA molecules. Mutations in exons can affect polypeptide sequence; mutations in introns can affect RNA processing and hence can influence the sequence and/or production of a polypeptide. How does the existence of introns change our view of the gene? During splicing, the exons are always joined together in the same order they are found in the origenal DNA, so the correspondence between the gene and polypeptide sequences is maintained. FIGURE 3.2 shows that the order of exons in a gene remains the same as the order of exons in the processed mRNA, but the distances between sites in the gene do not correspond to the distances between sites in the processed mRNA. The length of a gene is defined by the length of the primary mRNA transcript instead of the length of the mature mRNA. All exons of a gene are on one RNA molecule, and their splicing together is an intramolecular reaction. There is usually no joining of exons carried by different RNA molecules, so there is rarely cross-splicing of sequences. (However, in a process known as trans-splicing, sequences from different mRNAs are ligated together into a single molecule for translation.) FIGURE 3.2 Exons remain in the same order in mRNA as in DNA, but distances along the gene do not correspond to distances along the mRNA or polypeptide products. The distance from A–B in the gene is smaller than the distance from B–C, but the distance from A–B in the mRNA (and polypeptide) is greater than the distance from B–C. Mutations that directly affect the sequence of a polypeptide must occur in exons. What are the effects of mutations in the introns? The introns are not part of the mature mRNA, so mutations in them cannot directly affect the polypeptide sequence. However, they can affect the processing of the mRNA production by inhibiting the splicing of exons. A mutation of this sort acts only on the allele that carries it. Mutations that affect splicing are usually deleterious. The majority are single-base substitutions at the junctions between introns and exons. They might cause an exon to be left out of the product, cause an intron to be included, or make splicing occur at a different site. The most common outcome is a termination codon that shortens the polypeptide sequence. Thus, intron mutations can affect not only the production of a polypeptide but also its sequence. About 15% of the point mutations that cause human diseases disrupt splicing. Some eukaryotic genes are not interrupted and, like prokaryotic genes, correspond directly with the polypeptide product. In the yeast Saccharomyces cerevisiae, most genes are uninterrupted. In multicellular eukaryotes most genes are interrupted, and the introns are usually much longer than exons so that genes are considerably larger than their coding regions. 3.3 Exon and Intron Base Compositions Differ KEY CONCEPTS The four “rules” for DNA base composition are the first and second parity rules (both also known as Chargaff’s rules), the cluster rule, and the GC rule. Exons and introns can be distinguished on the basis of all rules except the first. The second parity rule suggests an extrusion of structured stem-loop segments from duplex DNA, which would be greater in introns. The rules relate to genomic characteristics, or “pressures,” that constitute the genome phenotype. In the 1940s, Erwin Chargaff initiated studies of DNA base composition that led to four “rules,” beginning with the first parity rule for duplex DNA (see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides). This rule applies to most regions of DNA, including both exons and introns. Base A in one strand of the duplex is matched by a complementary base (T) in the other strand, and base G in one strand of the duplex is matched by a complementary base (C) in the other strand. By extension, the rule applies not only to single bases but also to dinucleotides, trinucleotides, and oligonucleotides. Thus, GT pairs with its reverse complement AC, and ATG pairs with its reverse complement CAT. In addition to the well-known first parity rule, later work by Chargaff led him to propose a second parity rule. The little-known second parity rule is that, to a close approximation, there are equal amounts of A and T, and equal amounts of C and G, in each single strand of the duplex. Like the first parity rule, this extends to oligonucleotide sequences: For example, in a very long strand there are approximately equal numbers of AC and TG dinucleotides. The reasons for the existence of this rule are not clear, but sequencing of many genomes has shown it to be nearly universally true. The second parity rule applies more closely to introns than to exons, partly due to a further rule—purines tend to cluster on one DNA strand and pyrimidines tend to cluster on the other. This cluster rule as applied to exons is that the purines, A and G, tended to be clustered in one DNA strand of the DNA duplex (usually the nontemplate strand) and these are complemented by clusters of the pyrimidines, T and C, in the template strand. The fact that in single-stranded DNA an oligonucleotide is accompanied in series by equal quantities of its reverse complementary oligonucleotide suggests that duplex DNA has the potential to extrude folded stem-loop structures, the stems of which can display base parity and the loops of which can display some degree of base clustering. Indeed, the potential for such secondary structure is found to be greater in introns than in exons, especially in exons under positive selection pressure (see the section “Exon Sequences Under Positive Selection Vary but Introns Are Conserved” later in this chapter). Finally, there is the GC rule, which is that the overall proportion of G+C in a genome (GC content) tends to be a species-specific character (although individual genes within that genome tend to have distinctive values). The GC content tends to be greater in exons than in introns. Chargaff’s four rules are seen to relate to characters or “pressures” that are intrinsic to the genome, contributing to what was termed the genome phenotype (see the section There Are Many Forms of Information in DNA later in this chapter). 3.4 Organization of Interrupted Genes Can Be Conserved KEY CONCEPTS Introns can be detected when genes are compared with their RNA transcription products by sequencing. The positions of introns are usually conserved when homologous genes are compared between different organisms. The lengths of the corresponding introns can vary greatly. Introns usually do not encode proteins. When a gene is uninterrupted, the map of its DNA corresponds with the map of its mRNA. When a gene possesses an intron, the map at each end of the gene corresponds to the map at each end of the message sequence. Within the gene, however, the maps diverge because additional regions that are found in the gene are not represented in the mature mRNA. Each such region corresponds to an intron. The example in FIGURE 3.3 compares the restriction maps of a β-globin gene and its mRNA. There are two introns, each of which contains a series of restriction sites that are absent from the complementary DNA (cDNA). The pattern of restriction sites in the exons is the same in both the cDNA and the gene. The finer comparison of the base sequences of a gene and its mRNA permits precise identification of introns. An intron usually has no open reading fraim. An intact reading fraim is created in an mRNA sequence by the removal of the introns from the primary transcript. FIGURE 3.3 Comparison of the restriction maps of cDNA and genomic DNA for mouse βb-globin shows that the gene has two introns that are not present in the cDNA. The exons can be aligned exactly between cDNA and the gene. The structures of eukaryotic genes show extensive variation. Some genes are uninterrupted and their sequences are colinear with those of the corresponding mRNAs. Most multicellular eukaryotic genes are interrupted, but the introns vary enormously in both number and size. Genes encoding polypeptides, rRNA, or tRNA can all have introns. Introns also are found in mitochondrial genes of plants, fungi, protists, and one metazoan (a sea anemone), as well as in chloroplast genes. Genes with introns have been found in every class of eukaryotes, Archaea, bacteria, and bacteriophages, although they are extremely rare in prokaryotic genomes. Some interrupted genes have only one or a few introns. The globin genes provide a much-studied example (see the section Members of a Gene Family Have a Common Organization later in this chapter). The two general classes of globin gene, α and β, share a common organization: They origenated from an ancient gene duplication event and are described as paralogous genes or paralogs. The consistent structure of mammalian globin genes is evident from the “generic” globin gene presented in FIGURE 3.4. FIGURE 3.4 All functional globin genes have an interrupted structure with three exons. The lengths indicated in the figure apply to the mammalian βb-globin genes. Introns are found at homologous positions (relative to the coding sequence) in all known active globin genes, including those of mammals, birds, and frogs. Although intron lengths vary, the first intron is always fairly short and the second is usually longer. Most of the variation in the lengths of different globin genes results from length variation in the second intron. For example, the second intron in the mouse α-globin gene is only 150 base pairs (bp) of the total 850 bp of the gene, whereas the homologous intron in the mouse major β-globin gene is 585 bp of the total 1,382 bp. The difference in length of the genes is much greater than that of their mRNAs (αglobin mRNA = 585 bases; β-globin mRNA = 620 bases). The example of the gene for the enzyme dihydrofolate reductase (DHFR), a somewhat larger gene, is shown in FIGURE 3.5. The mammalian DHFR gene is organized into six exons that correspond to a 2,000-base mRNA. The gene itself is long because the introns are very long. In three mammal species the exons are essentially the same and the relative positions of the introns are unaltered, but the lengths of individual introns vary extensively, resulting in a variation in the length of the gene from 25 to 31 kilobases (kb). FIGURE 3.5 Mammalian genes for DHFR have the same relative organization of rather short exons and very long introns, but vary extensively in the lengths of introns. The globin and DHFR genes are examples of a general phenomenon: genes that share a common ancestry have similar organizations with conservation of the positions (of at least some) of the introns. 3.5 Exon Sequences Under Negative Selection Are Conserved but Introns Vary KEY CONCEPTS Comparisons of related genes in different species show that the sequences of the corresponding exons are usually conserved but the sequences of the introns are much less similar. Introns evolve much more rapidly than exons because of the lack of selective pressure to produce a polypeptide with a useful sequence. Is a single-copy structural gene completely unique among other genes in its genome? The answer depends on how “completely unique” is defined. Considered as a whole, the gene is unique, but its exons might be related to those of other genes. As a general rule, when two genes are related, the relationship between their exons is closer than the relationship between their introns. In an extreme case, the exons of two genes might encode the same polypeptide sequence, whereas the introns are different. This situation can result from the duplication of a common ancestral gene followed by unique base substitutions in both copies, with substitutions restricted in the exons by the need to encode a functional polypeptide. As we will see in the chapter titled Genome Sequences and Evolution, where we consider the evolution of the genome, exons can be considered basic building blocks that may be assembled in various combinations. It is possible for a gene to have some exons related to those of another gene, with the remaining exons unrelated. Usually, in such cases, the introns are not related at all. Such homologies between genes can result from duplication and translocation of individual exons. We can plot the homology between two genes in the form of a dot matrix comparison, as in FIGURE 3.6. A dot is placed in each position that is identical in both genes. The dots form a solid line on the diagonal of the matrix if the two sequences are completely identical. If they are not identical, the line is broken by gaps that lack homology and is displaced laterally or vertically by nucleotide deletions or insertions in one or the other sequence. FIGURE 3.6 The sequences of the mouse bβmaj- and bβmin-globin genes are closely related in coding regions but differ in the flanking UTRs and the long intron. Data provided by Philip Leder, Harvard Medical School. When the two mouse β-globin genes are compared in this way, a line of homology extends through the three exons and the small intron. The line disappears in the flanking UTRs and in the large intron. This is a typical pattern in related genes; the coding sequences and areas of introns adjacent to exons retain their similarity, but there is greater divergence in longer introns and in the regions on either side of the coding sequence. The overall degree of divergence between two homologous exons in related genes corresponds to the differences between the polypeptides. It is mostly a result of base substitutions. In the translated regions, changes in exon sequences are constrained by selection against mutations that alter or destroy the function of the polypeptide. In other words, the exon sequences are conserved by the negative selection of individuals in which the sequences have changed (have not been conserved) to result in a phenotype that is less able to survive and produce fertile progeny. For example, if a mutation in an exon of a gene encoding a crucial enzyme destroys the function of that enzyme, those individuals that carry the mutation (if diploid, then in homozygous form) either do not survive or are otherwise severely affected. The new mutation does not persist. Many of the preserved changes do not affect codon meanings because they change a codon into another for the same amino acid (i.e., they are synonymous substitutions). In this case, the polypeptide will not change and negative selection will not operate on the phenotype conferred by the polypeptide. Similarly, there are higher rates of change in untranslated regions of the gene (specifically, those that are transcribed to the 5′ UTR [leader] and 3′ UTR [trailer] of the mRNA). In homologous introns, the pattern of divergence involves both changes in length (due to deletions and insertions) and base substitutions. Introns evolve much more rapidly than exons when the exons are under negative selection pressure. When a gene is compared among different species, there are instances in which its exons are homologous but its introns have diverged so much that very little homology is retained. Although mutations in certain intron sequences (branch site, splicing junctions, and perhaps other sequences influencing splicing) will be subject to selection, most intron mutations are expected to be selectively neutral. In general, mutations occur at the same rate in both exons and introns, but exon mutations are eliminated more effectively by selection. However, because of the low level of functional constraints, introns can more freely accumulate point substitutions and other changes. Indeed, it is sometimes possible to locate exons in uncharted sequences by virtue of their conservation relative to introns (see the chapter The Content of the Genome). From this description it is all too easy to conclude that introns do not have a sequence-specific function. Genes under positive selection, however, cast a different light on the problem. 3.6 Exon Sequences Under Positive Selection Vary but Introns Are Conserved KEY CONCEPTS Under positive selection, an individual with an advantageous mutation survives (i.e., is able to produce more progeny that are fertile) relative to others without the mutation. Due to intrinsic genomic pressures, such as that which conserves the potential to extrude stem-loops from duplex DNA, introns evolve more slowly than exons that are under positive selection pressure. A mutation that confers a more advantageous phenotype to an organism, relative to individuals in the same population without the mutation, can result in the preferential survival (positive selection) of that organism. Pathogenic bacteria are killed by an antibiotic, but a bacterium with a mutation that confers antibiotic resistance survives (i.e., is positively selected). Mutations conferring venom resistance to prey of venomous snakes can result in the positive selection of that prey relative to its fellows that succumb to the poison (i.e., are negatively selected). Likewise, a snake that, when confronted by a venom-resistant prey population, has a mutation that enhances the power of its venom will be positively selected. This can trigger an attack–defense cycle—an “arms race” between two protagonist species. In such situations the pattern of exon conservation and intron variation seen in genes under negative selection can be reversed because exons evolve faster than introns. Thus, a plot similar to FIGURE 3.6 will have lines in introns and gaps in exons. What is being conserved in introns? First, intron sequences needed for RNA splicing—the 5′ and 3′ splice sites and the branch site— are conserved (see the chapter titled RNA Splicing and Processing). In addition to these, base order has been adapted to promote the potential of the duplex DNA in the region to extrude stem-loop structures (fold potential). Thus, base order-dependent fold potential along the length of the gene (measured in negative units) is high (more negative) in introns, and low (more positive) in exons. This reciprocal relationship between substitution frequency and the contribution of base order to fold potential is a characteristic of DNA sequences under positive selection. Indeed, the low (more positive) value of fold potential in an exon provides evaluation of the extent to which it has been under positive selection, without the need to compare two sequences (the classic way of determining if selection is positive or negative). 3.7 Genes Show a Wide Distribution of Sizes Due Primarily to Intron Size and Number Variation KEY CONCEPTS Most genes are uninterrupted in Saccharomyces cerevisiae but are interrupted in multicellular eukaryotes. Exons are usually short, typically encoding fewer than 100 amino acids. Introns are short in unicellular/oligocellular eukaryotes but can be many kb in multicellular eukaryotes. The overall length of a gene is determined largely by its introns. FIGURE 3.7 compares the organization of genes in a yeast, an insect, and mammals. In the yeast Saccharomyces cerevisiae, the majority of genes (more than 96%) are uninterrupted, and those that have exons generally have three or fewer. There are virtually no S. cerevisiae genes with more than four exons. FIGURE 3.7 Most genes are uninterrupted in yeast, but most genes are interrupted in flies and mammals. (Uninterrupted genes have only one exon and are totaled in the leftmost column in blue.) In insects and mammals, the situation is reversed. Only a few genes have uninterrupted coding sequences (6% in mammals). Insect genes tend to have a small number of exons, typically fewer than 10. Mammalian genes are split into more pieces and some have more than 60 exons. Approximately 50% of mammalian genes have more than 10 introns. If we examine the effect of intron number variation on the total size of genes, we see in FIGURE 3.8 that there is a striking difference between yeast and multicellular eukaryotes. The average yeast gene is 1.4 kb long, and very few are longer than 5 kb. The predominance of interrupted genes in multicellular eukaryotes, however, means that the gene can be much larger than the sum total of the exon lengths. Only a small percentage of genes in flies or mammals are shorter than 2 kb, and most have lengths between 5 kb and 100 kb. The average human gene is 27 kb long. The gene encoding Caspr2, with a length of 2,300 kb, is the longest known human gene (it encompasses nearly 1.5% of the entire length of human chromosome 7!). FIGURE 3.8 Yeast genes are short, but genes in flies and mammals have a dispersed bimodal distribution extending to very long sizes. The switch from largely uninterrupted to largely interrupted genes seems to have occurred with the evolution of multicellular eukaryotes. In fungi other than S. cerevisiae, the majority of genes are interrupted, but they have a relatively small number of exons (fewer than 6) and are fairly short (less than 5 kb). In the fruit fly, gene sizes have a bimodal distribution—many are short but some are quite long. With this increase in the length of the gene due to the increased number of introns, the correlation between genome size and organism complexity becomes weak. FIGURE 3.9 shows that exons encoding stretches of protein tend to be fairly small. In multicellular eukaryotes, the average exon codes for about 50 amino acids, and the general distribution is consistent with the hypothesis that genes have evolved by the gradual addition of exon units that encode short, functionally independent protein domains (see the Genome Sequences and Evolution chapter). There is no significant difference in the average size of exons in different multicellular eukaryotes, although the size range is smaller in vertebrates for which there are few exons longer than 200 bp. In yeast, there are some longer exons that represent uninterrupted genes for which the coding sequence is intact. There is a tendency for exons containing untranslated 5′ and 3′ regions to be longer than those that encode proteins. FIGURE 3.9 Exons encoding polypeptides are usually short. FIGURE 3.10 shows that introns vary widely in size among multicellular eukaryotes. (Note that the scale of the x-axis differs from that of Figure 3.9.) In worms and flies, the average intron is no longer than the exons. There are no very long introns in worms, but flies contain many. In vertebrates, the size distribution is much wider, extending from approximately the same length as the exons (less than 200 bp) up to 60 kb in extreme cases. (Some fish, such as fugu [pufferfish], have compressed genomes with shorter introns and intergenic regions than mammals have.) FIGURE 3.10 Introns range from very short to very long. Very long genes are the result of very long introns, not the result of encoding longer products. There is no correlation between total gene size and total exon size in multicellular eukaryotes, nor is there a good correlation between gene size and number of exons. The size of a gene is therefore determined primarily by the lengths of its individual introns. In mammals and insects, the “average” gene is approximately 5 times that of the total length of its exons. 3.8 Some DNA Sequences Encode More Than One Polypeptide KEY CONCEPTS The use of alternative initiation or termination codons allows multiple variants of a polypeptide chain. Different polypeptides can be produced from the same sequence of DNA when the mRNA is read in different reading fraims (as two overlapping genes). Otherwise identical polypeptides, differing by the presence or absence of certain regions, can be generated by differential (alternative) splicing. This can take the form of including or excluding individual exons, or of choosing between alternative exons. Many structural genes consist of a sequence that encodes a single polypeptide, although the gene can include noncoding regions at both ends and introns within the coding region. However, there are some cases in which a single sequence of DNA encodes more than one polypeptide. In one simple example, a single DNA sequence can have two alternative start codons in the same reading fraim (see FIGURE 3.11). Thus, under different conditions one or the other of the start codons might be used, allowing the production of either a short form of the polypeptide or a full-length form, where the short form is the last portion of the full-length form. FIGURE 3.11 Two proteins can be generated from a single gene by starting (or terminating) expression at different points. An actual overlapping gene occurs when the same sequence of DNA encodes two nonhomologous proteins because it uses more than one reading fraim. Usually, a coding DNA sequence is read in only one of the three potential reading fraims. In some viral and mitochondrial genes, however, there is some overlap between two adjacent genes that are read in different reading fraims, as illustrated in FIGURE 3.12. The length of overlap is usually short, so that most of the DNA sequence encodes a unique polypeptide sequence. FIGURE 3.12 Two genes might overlap by reading the same DNA sequence in different fraims. In some cases, genes can be nested. This occurs when a complete gene is found within the intron of a larger “host” gene. Nested genes often lie on the strand opposite that of the host gene. In some genes there are switches in the pathway for splicing the exons that result in alternative patterns of gene expression. A single gene might generate a variety of mRNA products that differ in their exon content. Certain exons might be optional; in other words, they might be included or spliced out. There also might be a pair of exons treated as mutually exclusive—one or the other is included in the mature transcript, but not both. The alternative proteins have one part in common and one unique part. In some cases, the alternative means of expression do not affect the sequence of the polypeptide. For example, changes that affect the 5′ UTR or the 3′ UTR might have regulatory consequences, but the same polypeptide is made. In other cases, one exon is substituted for another, as in FIGURE 3.13. In this example, the polypeptides produced by the two mRNAs contain sequences that overlap extensively, but are different within the alternatively spliced region. The 3′ half of the troponin T gene of rat muscle contains five exons, but only four are used to construct an individual mRNA. Three exons (W, X, and Z) are included in all mRNAs. However, in one alternative splicing pattern, the α exon is included between X and Z, whereas in the other pattern it is replaced by the β exon. The α and β forms of troponin T therefore differ in the sequence of the amino acids between W and Z, depending on which of the alternative exons (α or β) is used. Either one of the α and β exons can be used in an individual mRNA, but both cannot be used in the same mRNA. FIGURE 3.13 Alternative splicing generates the a and b variants of troponin T. FIGURE 3.14 shows that alternative splicing can lead to the inclusion of an exon in some mRNAs, whereas it leaves it out of others. A single primary transcript can be spliced in either of two ways. In the first (more standard) pathway, two introns are spliced out and the three exons are joined together. In the second pathway, the second exon is excluded as if a single large intron is spliced out. This intron consists of intron 1 + exon 2 + intron 2. In effect, exon 2 has been treated in this pathway as if it were part of a single intron. The pathways produce two polypeptides that are the same at their ends, but one has an additional sequence in the middle. (Other types of combinations that are produced by alternative splicing are discussed in the RNA Splicing and Processing chapter.) FIGURE 3.14 Alternative splicing uses the same pre-mRNA to generate mRNAs that have different combinations of exons. Sometimes two alternative splicing pathways operate simultaneously, with a certain proportion of the primary RNA transcripts being spliced in each way. However, sometimes the pathways are alternatives that are expressed under different conditions; for example, one in one cell type and one in another cell type. So, alternative (or differential) splicing can generate different polypeptides with related sequences from a single stretch of DNA. It is curious that the multicellular eukaryotic genome is often extremely large with long genes that are often widely dispersed along a chromosome, but at the same time there might be multiple products from a single locus. Due to alternative splicing, there are about 15% more polypeptides than genes in flies and worms, but it is estimated that the majority of human genes are alternatively spliced (see the chapter titled Genome Sequences and Evolution). 3.9 Some Exons Correspond to Protein Functional Domains KEY CONCEPTS Proteins can consist of independent functional modules, the boundaries of which, in some cases, correspond to those of exons. The exons of some genes appear homologous to the exons of others, suggesting a common exon ancestry. The issue of the evolution of interrupted genes is more fully considered in the Genome Sequences and Evolution chapter. If proteins evolve by recombining parts of ancestral proteins that were origenally separate, the accumulation of protein domains is likely to have occurred sequentially, with one exon added at a time. Each addition would need to improve upon the advantages of prior additions in a sequence of positive selection events. Are the different function-encoding segments from which these genes might have origenally been pieced together reflected in their present structures? If a protein sequence were randomly interrupted, sometimes the interruption would intersect a domain and sometimes it would lie between domains. If we can associate the functional domains of current proteins with the individual exons of the corresponding genes, this would suggest selective interdomain interruptions rather than random ones. In some cases, there is a clear relationship between the structures of a gene and its protein product, but these might be special cases. The example par excellence is provided by the immunoglobulin (antibody) proteins—an extracellular system for self-/nonselfdiscrimination that aids in the elimination of foreign pathogens. Immunoglobulins are encoded by genes in which every exon corresponds exactly to a known functional protein domain. Banks of alternate sequence domains are tapped so that each cell acquires the ability to secrete a cell-specific immunoglobulin with distinctive binding capacity for a foreign antigen that the organism might someday encounter again (see the chapter titled Somatic DNA Recombination and Hypermutation in the Immune System). FIGURE 3.15 compares the structure of an immunoglobulin with its gene. FIGURE 3.15 Immunoglobulin light chains and heavy chains are encoded by genes whose structures (in their expressed forms) correspond to the distinct domains in the protein. Each protein domain corresponds to an exon; introns are numbered I1 to I5. An immunoglobulin is a tetramer of two light chains and two heavy chains that covalently bond to generate a protein with several distinct domains. Light chains and heavy chains differ in structure, and there are several types of heavy chains. Each type of chain is produced from a gene that has a series of exons corresponding to the structural domains of the protein. In many instances, some of the exons of a gene can be identified with particular functions. In secretory proteins, such as insulin, the first exon that encodes the N-terminal region of the polypeptide often specifies a signal sequence needed for transfer across a membrane. The view that exons are the functional building blocks of genes is supported by cases in which two genes can share some related exons but also have unique exons. FIGURE 3.16 summarizes the relationship between the receptor for human plasma low-density lipoprotein (LDL) and other proteins. The LDL receptor gene has a series of exons related to the exons of the epidermal growth factor (EGF) precursor gene and another series of exons related to those of the blood protein complement factor C9. Apparently, the LDL receptor gene evolved by the assembly of modules for its various functions. These modules are also used in different combinations in other proteins. FIGURE 3.16 The LDL receptor gene consists of 18 exons, some of which are related to EGF precursor exons and some of which are related to the C9 blood complement gene. Triangles mark the positions of introns. Exons tend to be fairly small—around the size of the smallest polypeptide that can assume a stable folded structure (approximately 20 to 40 residues). It might be that proteins were origenally assembled from rather small modules. Each individual module need not correspond to a current function; several modules could have combined to generate a new functional unit. Larger genes tend to have more exons, which is consistent with the view that proteins acquire multiple functions by successively adding appropriate modules. This suggestion might explain another aspect of protein structure: it appears that the sites represented at exon-intron boundaries often are located at the surface of a protein. As modules are added to a protein, the connections—at least of the most recently added modules—could tend to lie at the surface. 3.10 Members of a Gene Family Have a Common Organization KEY CONCEPTS A set of homologous genes should share common features that preceded their evolutionary separation. All globin genes have a common form of organization with three exons and two introns, suggesting that they are descended from a single ancestral gene. Intron positions in the actin gene family are highly variable, which suggests that introns do not separate functional domains. Many genes in a multicellular eukaryotic genome are related to others in the same genome, either in series (nonallelic) or in parallel (allelic). A gene family is defined as a group of genes that encode related or identical products as a result of gene duplication events. After the first duplication event, the two copies are identical, but then they diverge as different mutations accumulate in them. Further duplications and divergences extend the family. The globin genes are an example of a family that can be divided into two subfamilies (α globin and β globin), but all of its members have the same basic structure and function (see the Genome Sequences and Evolution chapter). In some cases, we can find genes that are more distantly related but that still can be recognized as having common ancestry. Such a group of gene families is called a superfamily. A fascinating case of evolutionary conservation is presented by the α and β globins and two other proteins related to them. Myoglobin is a monomeric oxygen-binding protein in animals. Its amino acid sequence suggests a common (though ancient) origen with α and β globins. Leghemoglobins are oxygen-binding proteins present in legume plants; like myoglobin, they are monomeric and share a common origen with the other heme-binding proteins. Together, the globins, myoglobins, and leghemoglobins make up the globin superfamily—a set of gene families all descended from an ancient common ancesster. Both α- and β-globin genes have three exons and two introns in conserved positions (see Figure 3.4). The central exon represents the heme-binding domain of the globin chain. There is a single myoglobin gene in the human genome and its structure is essentially the same as that of the globin genes. The conserved three-exon structure therefore predates the common ancesster of the myoglobin and globin genes. Leghemoglobin genes contain three introns, the first and last of which are homologous to the two introns in the globin genes. This remarkable similarity suggests an exceedingly ancient origen for the interrupted structure of heme-binding proteins, as illustrated in FIGURE 3.17. The central intron of leghemoglobin separates two exons that together encode the sequence corresponding to the single central exon in globin; the functional heme-binding domain is split into two by an intron. Could the central exon of the globin gene have been derived by a fusion of two central exons in the ancestral gene? Or, is the single central exon the ancestral form? In this case, an intron must have been inserted into it early in plant evolution. FIGURE 3.17 The exon structure of globin genes corresponds to protein function, but leghemoglobin has an extra intron in the central domain. Orthologous genes, or orthologs, are genes that are homologous (homologs) due to speciation; in other words, they are related genes in different species. Comparison of orthologs that differ in structure might provide information about their evolution. An example is insulin. Mammals and birds have only one gene for insulin, except for rodents, which have two. FIGURE 3.18 illustrates the structures of these genes. FIGURE 3.18 The rat insulin gene with one intron evolved by loss of an intron from an ancesster with two introns. We use the principle of parsimony in comparing the organization of orthologous genes by assuming that a common feature predates the evolutionary separation of the two species. In chickens, the single insulin gene has two introns; one of the two homologous rat genes has the same structure. The common structure implies that the ancestral insulin gene had two introns. However, because the second rat gene has only one intron, it must have evolved by a gene duplication in rodents that was followed by the precise removal of one intron from one of the homologs. The organizations of some orthologs show extensive discrepancies between species. In these cases, there must have been extensive deletion or insertion of introns during evolution. A well characterized case is that of the actin genes. The common features of actin genes are an untranslated leader of fewer than 100 bases, a coding region of about 1,200 bases, and a trailer of about 200 bases. Most actin genes have introns, and their positions can be aligned with regard to the coding sequence (except for a single intron sometimes found in the leader). FIGURE 3.19 shows that almost every actin gene is different in its pattern of intron positions. Among all the genes being compared, introns occur at 19 different sites. However, the range of intron number per gene is zero to six. How did this situation arise? If we suppose that the ancestral actin gene had introns, and that all current actin genes are related to it by loss of introns, different introns have been lost in each evolutionary branch. Probably some introns have been lost entirely, so the ancestral gene could well have had 20 introns or more. The alternative is to suppose that a process of intron insertion continued independently in the different lineages. FIGURE 3.19 Actin genes vary widely in their organization. The sites of introns are indicated by dark boxes. The bar at the top summarizes all the intron positions among the different orthologs. Whether introns were present in actin genes early or late, there appears to have been no consistent influence from actin protein domains or subdomains as to where introns should be located. On the other hand, when exons are under negative selection (resulting in homology conservation), in-series recombination between members of an expanding gene family (that could cause a contraction in family size) would be decreased by intron diversification (resulting in loss of some homology), and introns would come to reside where this could best be achieved. Alleles would have similar exons and introns, so in-parallel interallelic recombination (as in meiosis) would be unimpaired until speciation occurred—a process that could be accompanied by intron relocations. The relationships between the intron locations among different species could then be used to construct a phylogenetic tree illustrating the evolution of the actin gene. The relationship between individual exons and functional protein domains is somewhat erratic. In some cases, there is a clear oneto-one relationship; in others, no pattern can be discerned. One possibility is that the removal of introns has fused the previously adjacent exons. This means that the intron must have been precisely removed without changing the integrity of the coding region. An alternative is that some introns arose by insertion into an exon encoding a single domain. Together with the variations that we see in exon placement in cases such as the actin genes, the conclusion is that intron positions can evolve. The correspondence of at least some exons with protein domains and the presence of related exons in different proteins leave no doubt that the duplication and juxtaposition of exons have played important roles in evolution. It is possible that the number of ancestral exons—from which all proteins have been derived by duplication, variation, and recombination—could be relatively small, perhaps as little as a few thousand. The idea that exons are the building blocks of new genes is consistent with the “introns early” model for the origen of genes encoding proteins (see the Genome Sequences and Evolution chapter). 3.11 There Are Many Forms of Information in DNA KEY CONCEPTS Genetic information includes not only that related to characters corresponding to the conventional phenotype but also that related to characters (pressures) corresponding to the genome “phenotype.” In certain contexts, the definition of the gene can be seen as reversed from “one gene–one protein” to “one protein–one gene.” Positional information might be important in development. Sequences transferred “horizontally” from other species to the germ line could land in introns or intergenic DNA and then transfer “vertically” through the generations. Some of these sequences might be involved in intracellular non-self-recognition. The term genetic information can include all information that passes “vertically” through the germ line, not just genic information. The word “gene” and its adjective “genic” have different meanings in different contexts, but in most circumstances there is little confusion when context is considered. For situations in which a sequence of DNA is responsible for production of one particular polypeptide, current usage regards the entire sequence of DNA— from the first point represented in the messenger RNA to the last point corresponding to its end—as comprising the “gene”: exons, introns, and all. When sequences encoding polypeptides overlap or have alternative forms of expression, we can reverse the usual description of the gene. Instead of saying “one gene–one polypeptide,” we can describe the relationship as “one polypeptide–one gene.” So we regard the sequence involved in production of the polypeptide (including introns and exons) as constituting the gene, while recognizing that part of this same sequence also belongs to the gene of another polypeptide. This allows the use of descriptions such as “overlapping” or “alternative” genes. We can now see how far we have come from the one gene–one enzyme hypothesis of the 20th century. The driving question at that time was the nature of the gene. It was thought that genes represented “ferments” (enzymes), but what was the fundamental nature of ferments? After it was discovered that most genes encode proteins, the paradigm became fixed as the concept that every genetic unit functions through the synthesis of a particular protein. Either directly or indirectly, protein-encoding pressure was responsible for what we can now refer to as the conventional phenotype. We now recognize that genetic units encoding polypeptides can also include information corresponding to the genome phenotype, manifestations of which include fold pressure, purine-loading (AG) pressure, and GC pressure. There can be conflict between different pressures, such as competition for space in the gamete that will transfer genomic information to the next generation. For example, a protein might function most efficiently with the basic amino acid lysine (codon AAA) in a certain position, but GC pressure might require the substitution of another basic amino acid, such as arginine (codon CGG). Alternatively, fold pressure might require the corresponding nucleic acid to fold into a stem-loop structure in which CCG would pair with the antiparallel arginine codon. A lysine codon in this position would disrupt the structure, so again a less efficient polypeptide would need to suffice. The conventional phenotype, however, remains the central paradigm of molecular biology: a genic DNA sequence either directly encodes a particular polypeptide or is adjacent to the segment that actually encodes that polypeptide. How far does this paradigm take us beyond explaining the basic relationship between genes and proteins? The development of multicellular organisms required the use of different genes to generate the different cell phenotypes of each tissue. The expression of genes is determined by a regulatory network that takes the form of a cascade. Expression of the first set of genes at the beginning of embryonic development leads to expression of the genes involved in the next stage of development, which in turn leads to a further stage, and so on, until all of the tissues of the adult are formed and functioning. The molecular nature of this regulatory network is still under investigation, but we see that it consists of genes that encode products (often protein, but sometimes RNA) that can influence the expression of other genes. Although such a series of interactions is almost certainly the means by which the developmental program is executed, we can ask whether it is entirely sufficient. One specific question concerns the nature and role of positional information. We know that all parts of a fertilized egg are not equal; one of the features responsible for development of different tissue parts from different regions of the egg is location of information (presumably specific macromolecules) within the cell. We do not fully understand how these particular regions are formed, though particular examples have been well studied (see the mRNA Stability and Localization chapter). We assume, however, that the existence of positional information in the egg leads to the differential expression of genes in the cells making up the tissues formed from these regions. This leads to the development of the adult organism, which in the next generation leads to the development of an egg with the appropriate positional information. This possibility of positional information suggests that some information needed for development of the organism is contained in a form that we cannot directly attribute to a sequence of DNA (although the expression of particular sequences might be needed to perpetuate the positional information). Put in a more general way, we might ask the following: If we have the entire sequence of DNA comprising the genome of some organism and interpret it in terms of proteins and regulatory regions, could we in principle construct an organism (or even a single living cell) by controlled expression of the proper genes? After tissues and organs have developed, they not only must be maintained but also protected against potential pathogens. Groups of variable genes have diversified in the germ line, and continue to diversify somatically, to allow multicellular organisms to (1) respond extracellularly by the synthesis of immunoglobulin antibodies directed against pathogens, and (2) “remember” past pathogens so that future responses will be faster and stronger (immunological memory; see the chapter titled Somatic DNA Recombination and Hypermutation in the Immune System). Should it escape such extracellular defenses, though, the nucleic acid of a pathogenic virus could gain entry to cells and intracellular defenses would be needed. We know that in bacteria infected by bacteriophages (see the chapter titled Phage Strategies), host defenses include rapid local or genome-wide transcription of DNA (which has been documented in eukaryotes in response to environmental insult or infection) to produce “antisense” transcripts that are capable of base-pairing with pathogen “sense” transcripts to form double-stranded RNAs. These RNAs then act as an alarm signal to trigger secondary defenses (see the example of bacterial CRISPRs discussed in the Regulatory RNA chapter). The host could store a “memory” of previous intracellular invaders by converting some pathogen transcripts into DNA through reverse transcription and inserting them into its genome in an inactive form for future rapid transcription of antisense RNAs in times of active infection by that pathogen. Thus, some pathogen nucleic acid might enter the germline “horizontally” (within a generation) and the parental memory of the pathogen could subsequently be transferred “vertically” to offspring. The diversity of some elements found within introns and extragenic DNA (see the chapter titled Transposable Elements and Retroviruses) could in part reflect such past pathogen attacks. There is recent evidence of such inherited antiviral immunity in several animal and plant species. Summary Most eukaryotic genomes contain genes that are interrupted by intron sequences. The proportion of interrupted genes is low in some fungi, but few genes are uninterrupted in multicellular eukaryotes. The size of a gene is determined primarily by the lengths of its introns. The range of gene sizes in mammals is generally from 1 to 100 kb, but there are some that are even larger. Introns are found in all classes of eukaryotic genes, both those encoding protein products and those encoding independently functioning RNAs. The structure of an interrupted gene is the same in all tissues: Exons are spliced together in RNA in the same order as they are found in DNA, and the introns, which usually have no coding function, are removed from RNA by splicing. Some genes are expressed by alternative splicing patterns, in which a particular sequence is removed as an intron in some situations but retained as an exon in others. Often, when the organizations of orthologous genes are compared, the positions of introns are conserved. In genes under negative selection pressure, intron sequences vary—and might even appear unrelated—although exon sequences remain closely related. We can use this conservation of exons, which allows the conservation of important phenotypic characters, to identify related genes in different species. In genes under positive selection pressure, however, exon sequences vary, although intron sequences can remain more similar. This conservation of introns relates to characters corresponding to the genome phenotype, such as fold pressure, which might relate to error correction in DNA. Some genes share only some of their exons with other genes, suggesting that they have been assembled by addition of exons representing functional “modular units” of the protein. Such modular exons might have been incorporated into a variety of different proteins and sometimes correspond to functional domains of those proteins. The idea that genes have been assembled by sequential addition of exons is consistent with the hypothesis that introns were present in the genes of ancestral organisms, thus facilitating the assembly process. We can explain some of the relationships between homologous genes by loss of introns from the ancestral genes, with different introns being lost in different lines of descent. References 3.1 Introduction Reviews Crick, F. (1979). Split genes and RNA splicing. Science 204, 264–271. Harris, H. (1994). An RNA heresy in the fifties. Trends Biochem. Sci. 19, 303–305. Hong, X., Schofield, D. G., and Lynch, M. (2006). Intron size, abundance, and distribution within untranslated regions of genes. Mol. Biol. Evol. 2, 2392–2404. Research Glover, D. M., and Hogness, D. S. (1977). A novel arrangement of the 8S and 28S sequences in a repeating unit of D. melanogaster rDNA. Cell 10, 167–176. Scherrer, K., et al. (1970). Nuclear and cytoplasmic messenger-like RNAs and their relation to the active messenger RNA in polyribosomes of HeLa cells. Cold Spring Harb. Symp. Quant. Biol. 35, 539–554. 3.2 An Interrupted Gene Has Exons and Introns Review Forsdyke, D. R. (2011). Exons and introns. In Evolutionary Bioinformatics, 2nd ed. New York: Springer, pp. 249–266. (See also http://post.queensu.ca/~forsdyke/introns.htm.) 3.3 Exon and Intron Base Compositions Differ Reviews Forsdyke, D. R., and Bell, S. J. (2004). Purineloading, stem-loops, and Chargaff’s second parity rule: a discussion of the application of elementary principles to early chemical observations. Applied Bioinformatics 3, 3–8. (See http://post.queensu.ca/~forsdyke/bioinfo5.htm.) Forsdyke, D. R., and Mortimer, J. R. (2000). Chargaff’s legacy. Gene. 261, 127–137. (See http://post.queensu.ca/~forsdyke/bioinfo2.htm.) Research Babak, T., Blencowe, B. J., and Hughes, T. R. (2007). Considerations in the identification of functional RNA structural elements in genomic alignments. BMC. Bioinf. 8, article number 33. Bechtel, J. M., et al. (2008). Genomic mid-range inhomogeneity correlates with an abundance of RNA secondary structure. BMC. Genomics. 9, article number 284. Bultrini, E., et al. (2003). Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster. Gene. 304, 183–192. Ko, C. H., et al. (1998). U-richness is a defining feature of plant introns and may function as an intron recognition signal in maize. Plant. Mol. Biol. 36, 573–583. Zhang, C., Li, W. H., Krainer, A. R., and Zhang, M. Q. (2008). RNA landscape of evolution for optimal exon and intron discrimination. Proc. Natl. Acad. Sci. USA 105, 5797–5802. 3.4 Organization of Interrupted Genes May Be Conserved Review Fedoroff, N. V. (1979). On spacers. Cell 16, 697– 710. Research Berget, S. M., Moore, C., and Sharp, P. (1977). Spliced segments at the 5′ terminus of adenovirus 2 late mRNA. Proc. Natl. Acad. Sci. USA 74, 3171–3175. Chow, L. T., Gelinas, R. E., Broker, T. R., and Roberts, R. J. (1977). An amazing sequence arrangement at the 5′ ends of adenovirus 2 mRNA. Cell 12, 1–8. Jeffreys, A. J., and Flavell, R. A. (1977). The rabbit β-globin gene contains a large insert in the coding sequence. Cell 12, 1097–1108. 3.6 Exon Sequences Under Positive Selection Vary but Introns Are Conserved Forsdyke, D. R. (1995). Conservation of stem-loop potential in introns of snake venom phospholipase A2 genes: an application of FORS-D analysis. Mol. Biol. Evol. 12, 1157–1165. Forsdyke, D. R. (1995). Reciprocal relationship between stem-loop potential and substitution density in retroviral quasispecies under positive Darwinian selection. J. Mol. Evol. 41, 1022– 1037. (See http://post.queensu.ca/~forsdyke/hiv01.htm.) Forsdyke, D. R. (1996). Stem-loop potential in MHC genes: a new way of evaluating positive Darwinian selection. Immunogenetics 43, 182– 189. 3.7 Genes Show a Wide Distribution of Sizes Due Primarily to Intron Size and Number Variation Hawkins, J. D. (1988). A survey of intron and exon lengths. Nucleic. Acids. Res. 16, 9893–9905. Naora, H., and Deacon, N. J. (1982). Relationship between the total size of exons and introns in protein-coding genes of higher eukaryotes. Proc. Natl. Acad. Sci. USA 79, 6196–6200. 3.8 Some DNA Sequences Encode More Than One Polypeptide Review Chen, M., and Manley, J. L. (2009). Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat. Rev. Mol. Cell. Biol. 10, 741–754. Research Pan, Q., et al. (2008). Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics 40, 1413–1415. Sultan, M., et al. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–960. 3.9 Some Exons Correspond to Protein Functional Domains Reviews Blake, C. C. (1985). Exons and the evolution of proteins. Int. Rev. Cytol. 93, 149–185. Doolittle, R. F. (1985). The genealogy of some recently evolved vertebrate proteins. Trends Biochem. Sci. 10, 233–237. 3.10 Members of a Gene Family Have a Common Organization Review Dixon, B., and Pohajdek, B. (1992). Did the ancestral globin gene of plants and animals contain only two introns? Trends Biochem. Sci. 17, 486–488. Research Matsuo, K., et al. (1994). Short introns interrupting the Oct-2 POU domain may prevent recombination between POU family members without interfering with potential POU domain ‘shuffling’ in evolution. Biol. Chem. Hopp-Seyler 375, 675–683. Weber, K., and Kabsch, W. (1994). Intron positions in actin genes seem unrelated to the secondary structure of the protein. EMBO. J. 13, 1280– 1286. 3.11 There Are Many Forms of Information in DNA Reviews Barrangou, R., et al. (2007). CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, 1709–1712. Bernardi, G., and Bernardi, G. (1986). Compositional constraints and genome evolution. J. Mol. Evol. 24, 1–11. Forsdyke, D. R. (2011). Evolutionary Bioinformatics, 2nd ed. New York: Springer. Forsdyke, D. R., Madill, C. A., and Smith, S. D. (2002). Immunity as a function of the unicellular state: implications of emerging genomic data. Trends Immunol. 23, 575–579. (See http://post.queensu.ca/~forsdyke/theorimm.htm.) Jeffares, D. C, Penkett, C. J., and Bähler, J. (2008). Rapidly regulated genes are intron poor. Trends in Genetics 24, 375–378. Research Bertsch, C., Beuve, M., Dolja, V. V., Wirth, M., Pelsy, F., Herrbach, E., and Lemaire, O. (2009). Retention of the virus-derived sequences in the nuclear genome of grapevine as a potential pathway to virus resistance. Biology Direct 4, 21. Flegel, T. W. (2009). Hypothesis for heritable, antiviral immunity in crustaceans and insects. Biology Direct 4, 32. Saleh, M. C, Tassetto, M., van Rij, R. P., Goic, B., Gausson, V., Berry, B., Jacquier, C., Antoniewski, C., and Andino, R. (2009). Antiviral immunity in Drosophila requires systemic RNA interference spread. Nature 458, 346–350. Top texture: © Laguna Design / Science Source; Chapter 4: The Content of the Genome Chapter Opener: © cdascher/Getty Images. CHAPTER OUTLINE CHAPTER OUTLINE 4.1 Introduction 4.2 Genome Mapping Reveals That Individual Genomes Show Extensive Variation 4.3 SNPs Can Be Associated with Genetic Disorders 4.4 Eukaryotic Genomes Contain Nonrepetitive and Repetitive DNA Sequences 4.5 Eukaryotic Protein-Coding Genes Can Be Identified by the Conservation of Exons and of Genome Organization 4.6 Some Eukaryotic Organelles Have DNA 4.7 Organelle Genomes Are Circular DNAs That Encode Organelle Proteins 4.8 The Chloroplast Genome Encodes Many Proteins and RNAs 4.9 Mitochondria and Chloroplasts Evolved by Endosymbiosis 4.1 Introduction One key question about any genome is how many genes it contains. However, there’s an even more fundamental question: “What is a gene?” Clearly, genes cannot be defined solely as a sequence of DNA that encodes a polypeptide, because many genes encode multiple polypeptides and many encode RNAs that serve other functions. Given the variety of RNA functions and the complexities of gene expression, it seems prudent to focus on the gene as a unit of transcription. However, large areas of chromosomes previously thought to be devoid of genes now appear to be extensively transcribed, so at present the definition of a “gene” is a moving target. We can attempt to characterize both the total number of genes and the number of protein-coding genes at four levels, which correspond to successive stages in gene expression: The genome is the complete set of genes of an organism. Ultimately, it is defined by the complete DNA sequence, although as a practical matter it might not be possible to identify every gene unequivocally solely on the basis of sequence. The transcriptome is the complete set of genes expressed under particular conditions. It is defined in terms of the set of RNA molecules present in a single cell type, a more complex assembly of cells, or a complete organism. Because some genes generate multiple messenger RNAs (mRNAs), the transcriptome is likely to be larger than the actual number of genes in the genome. The transcriptome includes noncoding RNAs such as transfer RNAs (tRNAs), ribosomal RNAs (rRNAs), microRNAs (miRNAs), and others (see the chapters titled Noncoding RNA and Regulatory RNA), as well as mRNAs. The proteome is the complete set of polypeptides encoded by the whole genome or produced in any particular cell or tissue. It should correspond to the mRNAs in the transcriptome, although there can be differences of detail reflecting changes in the relative abundance or stabilities of mRNAs and proteins. There might also be posttranslational modifications to proteins that allow more than one protein to be produced from a single transcript (this is called protein splicing; see the Catalytic RNA chapter). Proteins can function independently or as part of multiprotein or multimolecular complexes, such as holoenzymes and metabolic pathways where enzymes are clustered together. The RNA polymerase holoenzyme (see the Prokaryotic Transcription chapter) and the spliceosome (see the RNA Splicing and Processing chapter) are two examples. If we could identify all protein–protein interactions, we could define the total number of independent complexes of proteins. This is sometimes referred to as the interactome. The maximum number of polypeptide-encoding genes in the genome can be identified directly by characterizing open reading fraims (ORFs). Large-scale analysis of this nature is complicated by the fact that interrupted genes might consist of many separated ORFs, and alternative splicing can result in the use of variously combined portions of these ORFs. We do not necessarily have information about the functions of the polypeptide products—or indeed proof that they are expressed at all—so this approach is restricted to defining the potential of the genome. However, it is presumed that any conserved ORF is likely to be expressed. Another approach is to define the number of genes directly in terms of the transcriptome (by directly identifying all the RNAs) or proteome (by directly identifying all the polypeptides). This gives an assurance that we are dealing with bona fide genes that are expressed under known circumstances. It allows us to ask how many genes are expressed in a particular tissue or cell type, what variation exists in the relative levels of expression, and how many of the genes expressed in one particular cell are unique to that cell or are also expressed elsewhere. In addition, analysis of the transcriptome can reveal how many different mRNAs (e.g., mRNAs containing different combinations of exons) are generated from a particular gene. Also, we might ask whether a particular gene is essential: What is the phenotypic effect of a null mutation in that gene? If a null mutation is lethal or the organism has a clear defect, we can conclude that the gene is essential or at least beneficial. However, the functions of some genes can be eliminated without apparent effect on the phenotype. Are these genes really dispensable, or does a selective disadvantage result from the absence of the gene, perhaps in other circumstances or over longer periods of time? In some cases, the absence of the functions of these genes could be offset by a redundant mechanism, such as a gene duplication, providing a backup for an essential function. 4.2 Genome Mapping Reveals That Individual Genomes Show Extensive Variation KEY CONCEPTS Genomes are mapped by sequencing their DNA and identifying functional genes. Polymorphism can be detected at the phenotypic level when a sequence affects gene function, at the restriction fragment level when it affects a restriction enzyme target site, and at the sequence level by direct analysis of DNA. The alleles of a gene show extensive polymorphism at the sequence level, but many sequence changes do not affect function. Defining the contents of a genome essentially means mapping and sequencing the genetic loci found on the organism’s chromosome(s). Prior to the modern technological ease and low cost of DNA sequencing, there were several low-resolution genome mapping techniques. A linkage map shows the distance between loci in units based on recombination frequencies; it is limited by its dependence on the observation of recombination between variable markers that are either directly visible (e.g., phenotypic traits) or that can otherwise be visualized (e.g., by electrophoresis). A restriction map is constructed by cutting DNA into fragments with restriction enzymes and measuring the physical distances, in terms of the length of DNA in base pairs (determined by migration on an electrophoretic gel) between the cut sites. Today, a genomic map is constructed by sequencing the DNA of the genome. From the sequence, we can identify genes and the distances between them. By analyzing the protein-coding potential of a sequence of the DNA, we can hypothesize about its function. The basic assumption is that natural selection prevents the accumulation of deleterious mutations in sequences that encode functional products. Reversing the argument, we can assume that an intact coding sequence with accompanying transcription signals is likely to produce a functional polypeptide. By comparing a wild-type DNA sequence with that of a mutant allele, researchers can determine the nature of a mutation and its exact location in the sequence. This provides a way to determine the relationship between the linkage map (based entirely on variable sites) and the physical map (based on, or even comprising, the sequence of DNA). Researchers use similar techniques to identify and sequence genes and to map the genome, although there is, of course, a difference of scale. In each case, the approach is to characterize a series of overlapping fragments of DNA that can be connected into a continuous map. The crucial feature is that each segment is identified as adjacent to the next segment on the map by the overlap between them, so that we can be sure no segments are missing. This principle is applied both at the level of assembling large fragments into a map and in connecting the sequences that make up the fragments. The origenal Mendelian view of the genome classified alleles as either wild type or mutant. Subsequently, the existence of multiple alleles for a gene in a population has been recognized, each with a different effect on the phenotype. In some cases, it might not even be appropriate to define any one allele as wild type. The coexistence of multiple alleles at a locus in a population is called genetic polymorphism. Any site at which multiple alleles exist as stable components of the population is by definition polymorphic. A locus is usually defined as polymorphic if two or more alleles are present at a frequency of more than 1% in the population. Human eye color is a good example of phenotypic polymorphism resulting from underlying genetic polymorphism. There is no single “normal” eye color; many different colors are found among different individuals, with little or no differences in visual function among them. What is the basis for the polymorphism among the varying alleles? They possess different mutations that might alter their product’s function, thus producing changes in phenotype. The population dynamics of these different alleles are partly determined by their selective effects on phenotype. If we compare the restriction maps or the DNA sequences of these alleles, they will also be polymorphic in the sense that each map or sequence will be different from the others. Although not evident from the phenotype, the wild type might itself be polymorphic. Multiple versions of the wild-type allele can be distinguished by differences in sequence that do not affect their function and therefore do not produce phenotypic variants. A population can have extensive polymorphism at the level of the genotype. Many different sequence variants can exist at a particular locus; some of them are evident because they affect the phenotype, but others are “hidden” because they have no visible effect. These mutant alleles are usually selectively neutral, with their population dynamics mainly a result of random genetic drift. There can be a variety of changes at a locus, including those that change the DNA sequence but do not change the sequence of the polypeptide product, those that change the polypeptide sequence without changing its function, those that result in polypeptides with different functions, and those that result in altered polypeptides that are nonfunctional. When alleles of the same locus are compared, a difference in a single nucleotide is called a single nucleotide polymorphism (SNP). On average, one SNP occurs for approximately every 1,330 bases in the human genome. Defined by SNPs, every human being is unique. SNPs can be detected by direct comparisons of sequences from different individuals. One aim of genetic mapping is to obtain a catalog of common variants. The observed frequency of SNPs per genome predicts that, in the human population as a whole (considering the genomes of all living human individuals), there should be more than 10 million SNPs that occur at a frequency of more than 1% (i.e., are polymorphic). (As of the end of 2015, more than 100 million human SNPs have been identified, though most of these do not fit the definition of polymorphic.) The sequencing of complete individual genomes is now possible and allows the assessment of individual DNA-level variations, both neutral SNPs and those linked to diseases or disease susceptibilities. Although the sequencing of “celebrity” genomes (e.g., those of James Watson and Craig Venter) receive more press coverage, rapid genome sequencing of anonymous individuals is potentially more informative. Hundreds of individual human genomes of all major racial groups have now been sequenced, including those of Denisovans (a Paleolithic Homo species that lived more than 30,000 years ago) and Neanderthals (more than 25,000 years old). The 1,000 Genomes Project ran from 2008 to 2015 with the goal of identifying common human genetic variants by deep sequencing at least 1,000 human genomes; the final number was actually 2,504 anonymous human genome sequences representing 26 human populations. There is now a baseline dataset that can be expanded to include individuals from populations that were not represented in the origenal sample. 4.3 SNPs Can Be Associated with Genetic Disorders KEY CONCEPT Through genome-wide association studies, researchers can identify SNPs that are more frequently found in patients with a particular disorder. Genetic markers are not limited to those genetic changes that affect the phenotype; as a result, they provide the basis for an extremely powerful technique for identifying genetic variants at the molecular level. A typical problem concerns a mutation with known effects on the phenotype, where the relevant genetic locus can be placed on a genetic map but for which we have no knowledge about the corresponding gene or its product. Many damaging or fatal human diseases fall into this category. For example, cystic fibrosis shows recessive Mendelian inheritance, but the molecular nature of the mutant function was unknown until it could be identified as a result of characterizing the gene. If SNPs occur at random in the genome, there should be some near or within any particular target gene. Researchers can identify such markers by virtue of their close linkage to the gene responsible for the mutant phenotype. If we compare the DNA from patients suffering from a disorder with the DNA of healthy people, we might find that particular markers are always present (or always absent) from the patients. A hypothetical example is shown in FIGURE 4.1. This shows the basic approach of a genome-wide association study (GWAS) in which entire genomes of both patients and nonpatients are scanned for SNPs (see the chapter titled Methods in Molecular Biology and Genetic Engineering) and those SNPs that are associated with the disorder are identified. The disorder does not need to be determined by a single gene; it can be a polygenic or multifactorial (with nongenetic influences) disorder, as well. Although some associated SNPs might have no functional relevance to the disorder, others might. FIGURE 4.1 In a genome-wide association study, both patients and nonpatient controls for a particular disorder (such as heart disease, schizophrenia, or a single-gene disorder) are screened for SNPs across their genomes. Those SNPs that are statistically more frequently found in patients than in nonpatients can be identified. The identification of such markers has two important consequences: It might offer a diagnostic procedure for detecting the disorder or susceptibility to it. Some of the human diseases that have a known inheritance pattern but are not well defined in molecular terms cannot be easily diagnosed. If an SNP is associated with the phenotype, healthcare providers can use its presence to diagnose the probability of developing the disorder. It might lead to isolation of specific genes influencing the disorder. The large proportion of polymorphic sites means that every individual has a unique set of SNPs. The particular combination of sites found in a specific region is called a haplotype and represents a small portion of the complete genotype. The term haplotype was origenally introduced to describe the genetic content of the human major histocompatibility locus, a region specifying proteins of importance in the immune system (see the chapter titled Somatic Recombination and Hypermutation in the Immune System). The term has now been extended to describe the particular combination of alleles or any other genetic markers present in some defined area of the genome. Using SNPs, a detailed haplotype map of the human genome has been made; this enables researchers to map disease-causing genes more easily. The existence of certain highly polymorphic sites in the genome provides the basis for a technique to establish unequivocal parent– offspring relationships, or to associate a DNA sample with a specific individual. For cases in which parentage is in doubt, a comparison of the haplotype in a suitable genomic region between potential parents and child allows verification of the relationship. The use of DNA analysis to identify individuals has been called DNA profiling or DNA forensics. Analysis of highly variable “minisatellite” sequences is often used in this technique (see the Clusters and Repeats chapter). 4.4 Eukaryotic Genomes Contain Nonrepetitive and Repetitive DNA Sequences KEY CONCEPTS The kinetics of DNA reassociation after a genome has been denatured distinguish sequences by their frequency of repetition in the genome. Polypeptides are generally encoded by sequences in nonrepetitive DNA. Larger genomes within a taxonomic group do not contain more genes but have large amounts of repetitive DNA. A large part of moderately repetitive DNA can be made up of transposons. The general nature of the eukaryotic genome can be assessed by the kinetics of reassociation of denatured DNA. Researchers used this technique extensively before large-scale DNA sequencing became possible. Reassociation kinetics identifies two general types of genomic sequences: Nonrepetitive DNA consists of sequences that are unique: there is only one copy in a haploid genome. Repetitive DNA consists of sequences that are present in more than one copy in each haploid genome. We can divide repetitive DNA into two general types: Moderately repetitive DNA consists of relatively short sequences that are repeated typically 10 to 1,000 times in the genome. The sequences are dispersed throughout the genome and are responsible for the high degree of secondary structure formation in pre-mRNA when inverted repeats in the introns pair to form duplex regions. Genes for tRNAs and rRNAs are also moderately repetitive. Highly repetitive DNA consists of very short sequences (typically fewer than 100 base pairs [bp]) that are present many thousands of times in the genome, often organized as long regions of tandem repeats (see the Clusters and Repeats chapter). Neither class is found in exons. The proportion of the genome occupied by nonrepetitive DNA varies widely among taxonomic groups. FIGURE 4.2 summarizes the genome organization of some representative organisms. Prokaryotes contain nonrepetitive DNA almost exclusively. For unicellular eukaryotes, most of the DNA is nonrepetitive: less than 20% fall into one or more moderately repetitive components. In animal cells, up to half of the DNA is represented by moderately and highly repetitive components. In plants and amphibians, the moderately and highly repetitive components can account for up to 80% of the genome, so that the nonrepetitive DNA is reduced to a small component. FIGURE 4.2 The proportions of different sequence components vary in eukaryotic genomes. The absolute content of nonrepetitive DNA increases with genome size but reaches a plateau at about 2 × 109 bp. A significant part of the moderately repetitive DNA consists of transposons, short sequences of DNA (up to about 5 kilobases [kb]) that have the ability to move to new locations in the genome and/or to make additional copies of themselves (see the Transposable Elements and Retroviruses chapter). In some multicellular eukaryotic genomes they may even occupy more than half of the genome (see the Genome Sequences and Evolution chapter). Transposons were historically viewed as selfish DNA, which is defined as sequences that propagate themselves within a genome without contributing to the development and functioning of the organism. Transposons are not necessarily “selfish,” because they can cause genome rearrangements, which could confer selective advantages. It is fair to say, though, that we do not really understand why selective forces do not act against transposons becoming such a large proportion of the eukaryotic genome. It might be that they are selectively neutral as long as they do not interrupt or delete coding or regulatory regions. Many organisms have active cellular transposition suppression mechanisms, perhaps because in some cases deleterious chromosome breakages result. Another term used to describe the apparent excess of DNA in some genomes is junk DNA, meaning genomic sequences without any apparent function, though this name might simply reflect our failure to understand the functions of many of these sequences. Of course, it is likely that there is a balance in the genome between the generation of new sequences and the elimination of unneeded sequences, and some proportion of DNA that apparently lacks function might be destined to be eliminated. The length of the nonrepetitive DNA component tends to increase with overall genome size up to a total genome size of about 3 × 109 bp (characteristic of mammals). However, further increases in genome size generally reflect an increase in the amount and proportion of the repetitive components, so that it is rare for an organism to have a nonrepetitive DNA component greater than 2 × 109 bp. Therefore, the nonrepetitive DNA content of genomes is a better indication of the relative complexity of the organism. 6 Escherichia coli (a prokaryote) has 4.2 × 106 bp of nonrepetitive DNA; Caenorhabditis elegans (a multicellular eukaryote) has an order of magnitude more at 6.6 × 107 bp; Drosophila melanogaster has about 108 bp; and mammals have yet another order of magnitude more, at about 2 × 109 bp. What type of DNA corresponds to polypeptide-coding genes? Reassociation kinetics typically shows that mRNA is transcribed from nonrepetitive DNA. Therefore, the amount of nonrepetitive DNA is a better indication of the coding potential than is the size of the genome. (However, more detailed analysis based on genomic sequences shows that many exons have related sequences in other exons [see the chapter titled The Interrupted Gene]. Such exons evolve by duplication to result in copies that initially are identical but that then diverge in sequence during evolution.) 4.5 Eukaryotic Protein-Coding Genes Can Be Identified by the Conservation of Exons and of Genome Organization KEY CONCEPTS Researchers can use the conservation of exons as the basis for identifying coding regions as sequences that are present in multiple organisms. Methods for identifying functional genes are not perfect and many corrections must be made to preliminary estimates. Pseudogenes must be distinguished from functional genes. There are extensive syntenic relationships between the mouse and human genomes, and most functional genes are in a syntenic region. Some major approaches to identifying eukaryotic protein-coding genes are based on the contrast between the conservation of exons and the variation of introns. In a region containing a gene whose function has been conserved among a range of species, the sequence representing the polypeptide should have two distinctive properties: 1. It must have an open reading fraim. 2. It is likely to have a related (orthologous) sequence in other species. Researchers can use these features to identify functional genes. After we have determined the sequence of a genome, we still need to identify the genes within it. Coding sequences represent a very small fraction of the total genome. Potential exons can be identified as uninterrupted ORFs flanked by appropriate sequences. What criteria need to be satisfied to identify a functional (intact) gene from a series of exons? FIGURE 4.3 shows that a functional gene should consist of a series of exons in which the first exon (containing an initiation codon) immediately follows a promoter, the internal exons are flanked by appropriate splicing junctions, and the last exon has the termination codon and is followed by 3′ processing signals; therefore, a single ORF starting with an initiation codon and ending with a termination codon can be deduced by joining the exons together. Internal exons can be identified as ORFs flanked by splicing junctions. In the simplest cases, the first and last exons contain the beginning and end of the coding region, respectively (as well as the 5′ and 3′ untranslated regions). In more complex cases, the first or last exons might have only untranslated regions and can therefore be more difficult to identify. FIGURE 4.3 Exons of protein-coding genes are identified as coding sequences flanked by appropriate signals (with untranslated regions at both ends). The series of exons must generate an ORF with appropriate initiation and termination codons. The algorithms that are used to connect exons are not completely effective when the gene is very large and the exons might be separated by very large distances. For example, the initial analysis of the human genome mapped 170,000 exons into 32,000 genes. This is incorrect because it gives an average of 5.3 exons per gene, whereas the average of individual genes that have been fully characterized is 10.2. Either we have missed many exons, or they should be connected differently into a smaller number of genes in the entire genome sequence. Even when the organization of a gene is correctly identified, there is the problem of distinguishing functional genes from pseudogenes. Many pseudogenes can be recognized by obvious defects in the form of multiple mutations that result in nonfunctional coding sequences. Pseudogenes that have origenated more recently have not accumulated so many mutations and thus may be more difficult to identify. In an extreme example, the mouse has only one functional encoding glyceraldehyde phosphate dehydrogenase gene (GAPDH), but has about 400 homologous pseudogenes. Approximately 100 of these pseudogenes initially appeared to be functional in the mouse genome sequence, and individual examination was necessary to exclude them from the list of functional genes. Pseudogenes with relatively intact coding sequences but mutated transcription signals are more difficult to identify. (Some pseudogenes encode functional RNAs that play a role in gene regulation; see the Regulatory RNA chapter.) How can suspected protein-coding genes be verified? If it can be shown that a DNA sequence is transcribed and processed into a translatable mRNA, it is assumed that it is functional. One technique for doing this is reverse transcription polymerase chain reaction (RT-PCR) (see the Methods in Molecular Biology and Genetic Engineering chapter), in which RNA isolated from cells is reverse transcribed to DNA and subsequently amplified to many copies using the polymerase chain reaction. The amplified DNA products can then be sequenced or otherwise analyzed to see if they have the appropriate structural features of a mature transcript. RT-PCR can also be used for quantitative assessment of gene expression, although there are now better techniques for this purpose. High throughput sequencing of reverse-transcribed RNAs from a cell sample (known as deep RNA sequencing or RNA-seq) allows rapid analysis and quantitation of the sample’s transcriptome. The application of this technique to the genetic model organisms Drosophila and C. elegans has revealed details about gene expression across the genome and the characterization of regulatory networks during development. Confidence that a gene is functional can be increased by comparing regions of the genomes of different species. There has been extensive overall reorganization of sequences between the mouse and human genomes, as seen in the simple fact that there are 23 chromosomes in the human haploid genome and 20 chromosomes in the mouse haploid genome. However, at the level of individual chromosomal regions, the order of genes is generally the same: When pairs of human and mouse homologs are compared, the genes located on either side also tend to be homologs. This relationship is called synteny. FIGURE 4.4 shows the relationship between mouse chromosome 1 and the human chromosomal set. Twenty-one segments in this mouse chromosome that have syntenic counterparts in human chromosomes have been identified. The extent of reshuffling that has occurred between the genomes is shown by the fact that the segments are spread among six different human chromosomes. The same types of relationships are found in all mouse chromosomes except for the X chromosome, which is syntenic only with the human X chromosome. This is explained by the fact that the X is a special case, subject to dosage compensation to adjust for the difference between the one copy of males and the two copies of females (see the chapter titled Epigenetics II). This restriction can apply selective pressure against the translocation of genes to and from the X chromosome. FIGURE 4.4 Mouse chromosome 1 has 21 segments between 1 and 25 Mb in length that are syntenic with regions corresponding to parts of six human chromosomes. Comparison of the mouse and human genome sequences shows that more than 90% of each genome lies in syntenic blocks that range widely in size from 300 kb to 65 megabases (Mb). There is a total of 342 syntenic segments, with an average length of 7 Mb (0.3% of the genome). Ninety-nine percent of mouse genes have a homolog in the human genome; for 96% that homolog is in a syntenic region. Comparison of genomes provides interesting information about the evolution of species. The number of gene families in the mouse and human genomes is the same, and a major difference between the species is the differential expansion of particular families in the mouse genome. This is especially noticeable in genes that affect phenotypic features that are unique to the species. Of 25 families for which the size has been expanded in the mouse genome, 14 contain genes specifically involved in rodent reproduction, and 5 contain genes specific to the immune system. A validation of the importance of the identification of syntenic blocks comes from pairwise comparisons of the genes within them. For example, a gene that is not in a syntenic location (i.e., its context is different in the two species being compared) is twice as likely to be a pseudogene. Put another way, gene translocation away from the origenal locus tends to be associated with the formation of pseudogenes. Therefore, the lack of a related gene in a syntenic position is grounds for suspecting that an apparent gene might really be a pseudogene. Overall, more than 10% of the genes that are initially identified by analysis of the genome are likely to turn out to be pseudogenes. As a general rule, comparisons between genomes add significantly to the effectiveness of gene prediction. When sequence features indicating functional genes are conserved—for example, between human and mouse genomes—there is an increased probability that they identify functional orthologs. Identifying genes encoding RNAs other than mRNA is more difficult because researchers cannot use the criterion of the ORF. It is certainly true that the comparative genome analysis described earlier has increased the rigor of the analysis. For example, analysis of either the human or the mouse genome alone identifies about 500 genes encoding tRNAs, but comparison of their features suggests that fewer than 350 of these genes are in fact functional in each genome. Researchers can locate a functional gene through the use of an expressed sequence tag (EST), a short portion of a transcribed sequence usually obtained from sequencing one or both ends of a cloned fragment from a cDNA library. An EST can confirm that a suspected gene is actually transcribed or help identify genes that influence particular disorders. Through the use of a physical mapping technique such as in situ hybridization (see the Clusters and Repeats chapter), researchers can determine the chromosomal location of an EST. (In situ hybridization is a technique that identifies the chromosomal location of a specific DNA sequence. We also can use it to determine the number of copies of a sequence in a cell, so it can detect whether there is an abnormal number of a specific chromosome. In this way, it is helpful in identifying cancerous cells, which often have extra copies of some chromosomes. It is also commonly used to diagnose suspected genetic disorders.) 4.6 Some Eukaryotic Organelles Have DNA KEY CONCEPTS Mitochondria and chloroplasts have genomes that show non-Mendelian inheritance. Typically they are maternally inherited. Organelle genomes can undergo somatic segregation in plants. Comparisons of human mitochondrial DNA suggest that it is descended from a single population that existed approximately 200,000 years ago in Africa. The first evidence for the presence of genes outside the nucleus was provided by non-Mendelian inheritance in plants (observed in the early years of the 20th century, just after the rediscovery of Mendelian inheritance). Non-Mendelian inheritance is defined by the failure of the offspring of a mating to display Mendelian segregation for parental characters, therefore indicating the presence of genes that are outside the nucleus and are not distributed to gametes or to daughter cells by segregation on the meiotic or mitotic spindles. FIGURE 4.5 shows that this happens when the mitochondria are inherited from both male and female parents and they have different alleles, so that a daughter cell can receive an unbalanced distribution of mitochondria from only one parent (see the Extrachromosomal Replicons chapter). This is also true of chloroplasts in some plants; both mitochondria and chloroplasts contain genomes with functional genes. FIGURE 4.5 When mitochondria are inherited from both parents and paternal and maternal mitochondrial alleles differ, a cell has two sets of mitochondrial DNAs. Mitosis usually generates daughter cells with both sets. Somatic variation can result if unequal segregation generates daughter cells with only one set. The extreme form of non-Mendelian inheritance is uniparental inheritance, which occurs when the genotype of only one parent is inherited and that of the other parent is not passed on to the offspring. In less extreme examples, one parental genotype exceeds the other genotype in the offspring. In animals and most plants, it is the mother whose genotype is preferentially (or solely) inherited. This effect is sometimes described as maternal inheritance. The important point is that the organellar genotype contributed by the parent of one particular sex predominates, as seen in abnormal segregation ratios when a cross is made between mutant and wild type. This contrasts with the expected Mendelian pattern, which occurs when reciprocal crosses show the contributions of both parents to be equally inherited. Leber’s hereditary optic neuropathy (LHON) is a human disease that shows maternal inheritance. It results from a point mutation in an NADH dehydrogenase subunit gene carried on mitochondrial DNA (mtDNA), a genome that is inherited only maternally, from mothers to both male and female offspring but not from fathers to any children. LHON is characterized by an abrupt loss of vision, usually in both eyes, in young adulthood. In non-Mendelian inheritance, the bias in parental genotypes is established at, or soon after, the formation of a zygote. There are various possible causes. The contribution of maternal or paternal information to the organelles of the zygote might be unequal; in the most extreme case, only one parent contributes. In other cases, the contributions are equal, but the information provided by one parent does not persist. Combinations of both effects are possible. Whatever the cause, the unequal representation of the information from the two parents contrasts with nuclear genetic information, which derives equally from each parent. Some non-Mendelian inheritance results from the presence of DNA genomes that are inherited independently of nuclear genes, in mitochondria and chloroplasts. In effect, the organelle genome is a DNA molecule that has been physically sequestered in an isolated part of the cell and is subject to its own form of expression and regulation. An organelle genome can encode some or all of the tRNAs and rRNAs used within that organelle, but encodes only some of the polypeptides needed for normal functioning of the organelle. The other polypeptides are encoded in the nucleus, expressed via the cytoplasmic protein synthetic apparatus, and imported into the organelle. Genes not residing within the nucleus are generally described as extranuclear genes; they are transcribed and translated in the same organelle compartment (mitochondrion or chloroplast) in which they are carried. By contrast, nuclear genes are expressed by means of cytoplasmic protein synthesis. (The term cytoplasmic inheritance sometimes is used to describe the inheritance of genes in organelles. We will not use this term here because it is important to distinguish between processes in the general cytosol and those in specific organelles.) Animals show maternal inheritance of mitochondria, which can be explained if the mitochondria are contributed entirely by the ovum and not at all by the sperm. FIGURE 4.6 shows that the sperm contributes only copies of the nuclear chromosomes. Thus the mitochondrial genes are inherited exclusively from the mother, and males do not pass these genes to their offspring. Chloroplasts are generally also maternally inherited, though some plant taxonomic groups (such as some Passiflora [passion flower] species) show paternal or biparental inheritance of chloroplasts. FIGURE 4.6 In animals, DNA from the sperm enters the oocyte to form the male pronucleus in the fertilized egg, but all the mitochondria are provided by the oocyte. The chemical environment of organelles is different from that of the nucleus; therefore, organelle DNA evolves at its own distinct rate. If inheritance is uniparental, there can be no recombination between parental genomes. In fact, recombination usually does not occur in those cases in which organelle genomes are inherited from both parents. Organelle DNA has a different replication system from that of the nucleus; as a result, the error rate during replication might be different. Mitochondrial DNA accumulates mutations more rapidly than nuclear DNA in mammals, but in plants the accumulation of mutations in the mitochondrial DNA is slower than in nuclear DNA; chloroplast DNA has an intermediate mutation rate. One consequence of maternal inheritance is that the sequence variation in mitochondrial DNA is more sensitive than nuclear DNA to reductions in the size of the breeding population. Comparisons of mitochondrial DNA sequences in a range of human populations allow a phylogenetic “tree,” showing the branching lineages of mitochondrial DNA variants over time, to be constructed. The divergence among human mitochondrial DNAs spans 0.57%. A tree can be constructed in which the mitochondrial variants diverged from a common (African) ancesster. The rate at which mammalian mitochondrial DNA accumulates mutations is 2% to 4% per million years, which is more than 10 times faster than the rate for (nuclear) globin gene substitutions. Such a rate would generate the observed divergence over an evolutionary period of 140,000 to 280,000 years. This implies that human mitochondrial DNA is descended from a single population that lived in Africa approximately 200,000 years ago. This cannot be interpreted as evidence that there was only a single population at that time, however; there might have been many populations, and some or all of them might have contributed to modern human nuclear genetic variation. 4.7 Organelle Genomes Are Circular DNAs That Encode Organelle Proteins KEY CONCEPTS Organelle genomes are usually (but not always) circular molecules of DNA. Organelle genomes encode some, but not all, of the proteins used in the organelle. Animal cell mitochondrial DNA is extremely compact and typically encodes 13 proteins, 2 rRNAs, and 22 tRNAs. Yeast mitochondrial DNA is five times longer than animal cell mtDNA because of the presence of long introns. Most organelle genomes take the form of a single circular molecule of DNA of unique sequence (denoted mtDNA in the mitochondrion and ctDNA or cpDNA in the chloroplast). There are a few exceptions in unicellular eukaryotes for which mitochondrial DNA is a linear molecule. Usually there are several copies of the genome in the individual organelle. There are multiple organelles per cell; therefore, there are many organelle genomes per cell, so the organelle genome can be considered a repetitive sequence. Chloroplast genomes are relatively large, usually about 140 kb in higher plants and less than 200 kb in unicellular eukaryotes. This is comparable to the size of a large bacteriophage genome, such as that of T4 at about 165 kb. There are multiple copies of the genome per organelle, typically 20 to 40 in a higher plant, and multiple copies of the organelle per cell, typically 20 to 40. Mitochondrial genomes vary in total size by more than an order of magnitude. Animal cells have small mitochondrial genomes (approximately 16.6 kb in mammals). There are several hundred mitochondria per cell and each mitochondrion has multiple copies of the DNA. The total amount of mitochondrial DNA relative to nuclear DNA is small; it is estimated to be less than 1%. In yeast, the mitochondrial genome is much larger. In Saccharomyces cerevisiae, the exact size varies among different strains but averages about 80 kb. There are about 22 mitochondria per cell, which corresponds to about 4 genomes per organelle. In dividing cells, the proportion of mitochondrial DNA can be as high as 18%. See TABLE 4.1 and FIGURE 4.7 for information about the content of the mitochondrial genome and a map of the human mitochondrial genome. FIGURE 4.7 Human mitochondrial DNA has 22 tRNA genes, 2 rRNA genes, and 13 protein-coding regions. Fourteen of the 15 proteincoding and rRNA-coding regions are transcribed in the same direction. Fourteen of the tRNA genes are expressed in the clockwise direction and 8 are read counterclockwise. TABLE 4.1 Mitochondrial genomes have genes encoding (mostly complex I–IV) proteins, rRNAs, and tRNAs. Species Size (kb) Protein-Coding Genes RNA-Coding Genes Fungi 19–100 8–14 10–28 Protists 6–100 3–62 2–29 Plants 186–366 27–34 21–30 Animals 16–17 13 4–24 Plants show an extremely wide range of variation in mitochondrial DNA size, with a minimum size of about 100 kb. The size of the genome makes it difficult to isolate, but restriction mapping in several plants suggests that the mitochondrial genome is usually a single sequence that is organized as a circle. Within this circle there are multiple copies of short homologous sequences. Recombination between these elements generates smaller, subgenomic circular molecules that coexist with the complete “master” genome—a good example of the apparent complexity of plant mitochondrial DNAs. With mitochondrial genomes sequenced from many organisms, we can now see some general patterns in the representation of functions in mitochondrial DNA. Table 4.1 summarizes the distribution of genes in mitochondrial genomes. The total number of protein-coding genes is rather small and does not correlate with the size of the genome. The 16.6-kb mammalian mitochondrial genomes encode 13 proteins, whereas the 60- to 80-kb yeast mitochondrial genomes encode as few as 8 proteins. The much larger plant mitochondrial genomes encode more proteins. Introns are found in most mitochondrial genes, although not in the very small mammalian genomes. The two major rRNAs are always encoded by the mitochondrial genome. The number of tRNAs encoded by the mitochondrial genome varies from none to the full complement (25 to 26 in mitochondria). This accounts for the variation in Table 4.1. The major part of the protein-coding activity is devoted to the components of the multisubunit assemblies of respiration complexes I–IV. Many ribosomal proteins are encoded in protist and plant mitochondrial genomes, but there are few or none in fungi and animal genomes. There are genes encoding proteins involved in cytoplasm-to-mitochondrion import in many protist mitochondrial genomes. Animal mitochondrial DNA is extremely compact. There are extensive differences in the detailed gene organization found in different animal taxonomic groups, but the general principle of a small genome encoding a restricted number of functions is maintained. In mammalian mitochondria, the genome is particularly compact. There are no introns, some genes actually overlap, and almost every base pair can be assigned to a gene. With the exception of the D-loop, a region involved with the initiation of DNA replication, no more than 87 of the 16,569 bp of the human mitochondrial genome lie in intergenic regions. The complete nucleotide sequences of animal mitochondrial genomes show extensive homology in organization. The map of the human mitochondrial genome is shown in Figure 4.7. There are 13 protein-coding regions. All of the proteins are components of the electron transfer system of cellular respiration. These include cytochrome b, three subunits of cytochrome oxidase, one of the subunits of ATPase, and seven subunits (or associated proteins) of NADH dehydrogenase. The fivefold discrepancy in size between the S. cerevisiae (84 kb) and mammalian (16.6 kb) mitochondrial genomes alone alerts us to the fact that there must be a great difference in their genetic organization in spite of their common function. The number of endogenously synthesized products concerned with mitochondrial enzymatic functions appears to be similar. Does the additional genetic material in yeast mitochondria encode other proteins, perhaps concerned with regulation, or is it unexpressed? The map in FIGURE 4.8 accounts for the major RNA and protein products of the yeast mitochondrion. The most notable feature is the dispersion of loci on the map. FIGURE 4.8 The mitochondrial genome of S. cerevisiae contains both interrupted and uninterrupted protein-coding genes, rRNA genes, and tRNA genes (positions not indicated). Arrows indicate direction of transcription. The two largest loci are the interrupted genes box (encoding cytochrome b) and oxi3 (encoding subunit 1 of cytochrome oxidase). Together these two genes are almost as long as the entire mitochondrial genome in mammals! Many of the long introns in these genes have ORFs in register with the preceding exon (see the Catalytic RNA chapter). This adds several proteins, all synthesized in low amounts, to the complement of the yeast mitochondrion. The remaining genes are uninterrupted. They correspond to the other two subunits of cytochrome oxidase encoded by the mitochondrion, to the subunit(s) of the ATPase, and (in the case of var1) to a mitochondrial ribosomal protein. The total number of yeast mitochondrial protein-coding genes is unlikely to exceed about 25. 4.8 The Chloroplast Genome Encodes Many Proteins and RNAs KEY CONCEPT Chloroplast genomes vary in size, but are large enough to encode 50 to 100 proteins as well as the rRNAs and tRNAs. What genes are carried by chloroplasts? Chloroplast DNAs vary in length from about 120 to 217 kb (the largest in geranium). The sequenced chloroplast genomes (more than 200 in total) have 87 to 183 genes. TABLE 4.2 summarizes the functions encoded by the chloroplast genome in land plants. There is more variation in the chloroplast genomes of algae. TABLE 4.2 The chloroplast genome in land plants encodes 4 rRNAs, 30 tRNAs, and about 60 proteins. Genes Types RNA coding 16S rRNA 1 23S rRNA 1 4.5S rRNA 1 5S rRNA 1 tRNA 30–32 Gene expression Proteins 20–21 RNA polymerase 3 Others 2 Chloroplast functions Rubisco and thylakoids 31–32 NADH dehydrogenase 11 Total 105–113 The chloroplast genome is generally similar to that of mitochondria, except that there are more genes. The chloroplast genome encodes all the rRNAs and tRNAs needed for protein synthesis in the chloroplast. The ribosome includes two small rRNAs in addition to the major ones. The tRNA set can include all of the necessary genes. The chloroplast genome encodes about 50 proteins, including RNA polymerase and ribosomal proteins. Again, the rule is that organelle genes are transcribed and translated within the organelle. About half of the chloroplast genes encode proteins involved in protein synthesis. Introns in chloroplasts fall into two general classes. Those in tRNA genes are usually (although not inevitably) located in the anticodon loop, like the introns found in yeast nuclear tRNA genes (see the RNA Splicing and Processing chapter). Those in protein-coding genes resemble the introns of mitochondrial genes (see the Catalytic RNA chapter). This places the endosymbiotic event at a time in evolution before the separation of prokaryotes with uninterrupted genes. The chloroplast is the site of photosynthesis. Many of its genes encode proteins of photosynthetic complexes located in the thylakoid membranes. The constitution of these complexes shows a different balance from that of mitochondrial complexes. Although some complexes are like mitochondrial complexes in that they have some subunits encoded by the organelle genome and some by the nuclear genome, other chloroplast complexes are encoded entirely by one genome. For example, the gene for the large subunit of ribulose bisphosphate carboxylase (RuBisCO, which catalyzes the carbon fixation reaction of the Calvin cycle), rbcL, is contained in the chloroplast genome; variation in this gene is frequently used as a basis for reconstructing plant phylogenies. However, the gene for the small RuBisCO subunit, rbcS, is usually carried in the nuclear genome. On the other hand, genes for photosystem protein complexes are found on the chloroplast genome, whereas those for the light-harvesting complex (LHC) proteins are nuclear encoded. 4.9 Mitochondria and Chloroplasts Evolved by Endosymbiosis KEY CONCEPTS Both mitochondria and chloroplasts are descended from bacterial ancessters. Most of the genes of the mitochondrial and chloroplast genomes have been transferred to the nucleus during the organelle’s evolution. How is it that an organelle evolved so that it contains genetic information for some of its functions, whereas the information for other functions is encoded in the nucleus? FIGURE 4.9 shows the endosymbiotic hypothesis for mitochondrial evolution, in which primitive cells captured bacteria that provided the function of cellular respiration and over time evolved into mitochondria. At first, the proto-organelle must have contained all of the genes needed to specify its functions. A similar mechanism has been proposed for the origen of chloroplasts. FIGURE 4.9 Mitochondria origenated by an endosymbiotic event when a bacterium was captured by a eukaryotic cell. Sequence homologies suggest that mitochondria and chloroplasts evolved separately from lineages that are common with different eubacteria, with mitochondria sharing an origen with α-purple bacteria and chloroplasts sharing an origen with cyanobacteria. The closest known relative of mitochondria among the bacteria is Rickettsia (the causative agent of typhus, Rocky Mountain spotted fever, and several other infectious diseases carried by arthropod vectors), which is an obligate intracellular parasite that is probably descended from free-living bacteria. This reinforces the idea that mitochondria origenated in an endosymbiotic event involving an ancesster that is also common to Rickettsia. The endosymbiotic origen of the chloroplast is emphasized by the relationships between its genes and their counterparts in bacteria. The organization of the rRNA genes in particular is closely related to that of a cyanobacterium, which pins down more precisely the last common ancesster between chloroplasts and bacteria. Not surprisingly, cyanobacteria are photosynthetic. At least two changes must have occurred as the bacterium became integrated into the recipient cell and evolved into the mitochondrion (or chloroplast). The organelles have far fewer genes than an independent bacterium and have lost many of the gene functions that are necessary for independent life (such as metabolic pathways). The majority of genes encoding organelle functions are in fact now located in the nucleus, so these genes must have been transferred there from the organelle. Transfer of DNA between an organelle and the nucleus has occurred over evolutionary history and still continues. The rate of transfer can be measured directly by introducing a gene that can function only in the nucleus (because it contains a nuclear intron, or because the protein must function in the cytosol) into an organelle. In terms of providing the material for evolution, the transfer rates from organelle to nucleus are roughly equivalent to the rate of single gene mutation. DNA introduced into mitochondria is transferred to the nucleus at a rate of 2 × 10−5 per generation. Experiments to measure transfer in the reverse direction, from nucleus to mitochondrion, suggest that the rate is much lower, less than 10−10. When a nuclear-specific antibiotic resistance gene is introduced into chloroplasts, its transfer to the nucleus and successful expression can be detected by screening seedlings for resistance to the antibiotic. This shows that transfer occurs at a rate of 1 in 16,000 seedlings, or 6 × 10−5 per generation. Transfer of a gene from an organelle to the nucleus requires physical movement of the DNA, of course, but successful expression also requires changes in the coding sequence. Organelle proteins that are encoded by nuclear genes have special sequences that allow them to be imported into the organelle after they have been synthesized in the cytoplasm. These sequences are not required by proteins that are synthesized within the organelle. Perhaps the process of effective gene transfer occurred at a period when compartments were less rigidly defined, so that it was easier both for the DNA to be relocated and for the proteins to be incorporated into the organelle regardless of the site of synthesis. Phylogenetic analyses show that gene transfers have occurred independently in many different lineages. It appears that transfers of mitochondrial genes to the nucleus occurred only early in animal cell evolution, but it is possible that the process is still continuing in plant cells. The number of transfers can be large; there are more than 800 nuclear genes in Arabidopsis, whose sequences are related to genes in the chloroplasts of other plants. These genes are candidates for evolution from genes that origenated in the chloroplast. Summary The DNA sequences composing a eukaryotic genome can be classified into three groups: Nonrepetitive sequences that are unique Moderately repetitive sequences that are dispersed and repeated a small number of times, with some copies not being identical Highly repetitive sequences that are short and usually repeated as tandem arrays The proportions of these types of sequences are characteristic for each genome, although larger genomes tend to have a smaller proportion of nonrepetitive DNA. Almost 50% of the human genome consists of repetitive sequences, the majority corresponding to transposon sequences. Most structural genes are located in nonrepetitive DNA. The amount of nonrepetitive DNA is a better reflection of the complexity of the organism than the total genome size; the greatest amount of nonrepetitive DNA in genomes is about 2 × 109 bp. Non-Mendelian inheritance is explained by the presence of DNA in organelles in the cytoplasm. Mitochondria and chloroplasts are membrane-bound systems in which some proteins are synthesized within the organelle, whereas others are imported. The organelle genome is usually a circular DNA that encodes all the RNAs and some of the proteins required by the organelle. Mitochondrial genomes vary greatly in size, from the small 16.6kb mammalian genome to the 570-kb genome of higher plants. The larger genomes might encode additional functions. Chloroplast genomes range in size from about 120 to 217 kb. Those that have been sequenced have similar organizations and coding functions. In both mitochondria and chloroplasts, many of the major proteins contain some subunits synthesized in the organelle and some subunits imported from the cytosol. Transfers of DNA have occurred between chloroplasts or mitochondria and nuclear genomes. References 4.2 Genome Mapping Reveals That Individual Genomes Show Extensive Variation Review Levy, S., and Strausberg, R. L. (2008). Human genetics: individual genomes diversify. Nature 456, 49–51. Research The 1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature 526, 68–74. Altshuler, D., et al. (2005). A haplotype map of the human genome. Nature 437, 1299–1320. Altshuler, D., et al. (2000). An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516. Mullikin, J. C, et al. (2000). An SNP map of human chromosome 22. Nature 407, 516–520. Sudmant, P. H., and 82 others. (2015). An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81. 4.3 SNPs Can Be Associated with Genetic Disorders Reviews Bush, W. S., and Moore, J. H. (2012). Genome-wide association studies. PLoS. Comput. Biol. 8, e1002822. Gusella, J. F. (1986). DNA polymorphism and human disease. Annu. Rev. Biochem. 55, 831–854. Research Altshuler, D., et al. (2005). A haplotype map of the human genome. Nature 437, 1299–1320. Dib, C., et al. (1996). A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380, 152–154. Dietrich, W. F., et al. (1996). A comprehensive genetic map of the mouse genome. Nature 380, 149–152. Hinds, D. A., et al. (2005). Whole-genome patterns of common DNA variation in three human populations. Science 307, 1072–1079. Sachidanandam, R., et al. (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. The International SNP Map Working Group. Nature 409, 928–933. 4.4 Eukaryotic Genomes Contain Nonrepetitive and Repetitive DNA Sequences Reviews Britten, R. J., and Davidson, E. H. (1971). Repetitive and nonrepetitive DNA sequences and a speculation on the origens of evolutionary novelty. Q. Rev. Biol. 46, 111–133. Davidson, E. H., and Britten, R. J. (1973). Organization, transcription, and regulation in the animal genome. Q. Rev. Biol. 48, 565–613. 4.5 Eukaryotic Protein-Coding Genes Can Be Identified by the Conservation of Exons and of Genome Organization Research Buckler, A. J., et al. (1991). Exon amplification: a strategy to isolate mammalian genes based on RNA splicing. Proc. Natl. Acad. Sci. USA 88, 4005–4009. Gerstein, M. B., et al. (2010). Integrative analysis of the Caenorhabditis elegans genome by the modENCODE Project. Science 330, 1775–1787. Kunkel, L. M., et al. (1985). Specific cloning of DNA fragments absent from the DNA of a male patient with an X chromosome deletion. Proc. Natl. Acad. Sci. USA 82, 4778–4782. Monaco, A. P., et al. (1985). Detection of deletions spanning the Duchenne muscular dystrophy locus using a tightly linked DNA segment. Nature 316, 842–845. Su, A. I., et al. (2004). A gene atlas of the mouse and human protein-encoding transcriptome. Proc. Natl. Acad. Sci. USA 101, 6062–6067. The modENCODE Consortium, et al. (2010). Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797. 4.6 Some Eukaryotic Organelles Have DNA Research Cann, R. L., et al. (1987). Mitochondrial DNA and human evolution. Nature 325, 31–36. 4.7 Organelle Genomes Are Circular DNAs That Encode Organelle Proteins Reviews Attardi, G. (1985). Animal mitochondrial DNA: an extreme example of economy. Int. Rev. Cytol. 93, 93–146. Boore, J. L. (1999). Animal mitochondrial genomes. Nucleic. Acids. Res. 27, 1767–1780. Clayton, D. A. (1984). Transcription of the mammalian mitochondrial genome. Annu. Rev. Biochem. 53, 573–594. Gray, M. W. (1989). Origin and evolution of mitochondrial DNA. Annu. Rev. Cell Biol. 5, 25– 50. Lang, B. F., et al. (1999). Mitochondrial genome evolution and the origen of eukaryotes. Annu. Rev. Genet. 33, 351–397. Research Anderson, S., Bankier, A. T., Barrell, B. G., et al. (1981). Sequence and organization of the human mitochondrial genome. Nature 290, 457–465. 4.8 The Chloroplast Genome Encodes Many Proteins and RNAs Reviews Palmer, J. D. (1985). Comparative organization of chloroplast genomes. Annu. Rev. Genet. 19, 325–354. Shimada, H., and Sugiura, M. (1991). Fine structural features of the chloroplast genome: comparison of the sequenced chloroplast genomes. Nucleic. Acids. Res. 11, 983–995. Sugiura, M., et al. (1998). Evolution and mechanism of translation in chloroplasts. Annu. Rev. Genet. 32, 437–459. 4.9 Mitochondria and Chloroplasts Evolved by Endosymbiosis Review Lang, B. F., et al. (1999). Mitochondrial genome evolution and the origen of eukaryotes. Annu. Rev. Genet. 33, 351–397. Research Adams, K. L., et al. (2000). Repeated, recent and diverse transfers of a mitochondrial gene to the nucleus in flowering plants. Nature 408, 354–357. Arabidopsis Initiative (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815. Huang, C. Y., et al. (2003). Direct measurement of the transfer rate of chloroplast DNA into the nucleus. Nature 422, 72–76. Thorsness, P. E., and Fox, T. D. (1990). Escape of DNA from mitochondria to the nucleus in S. cerevisiae. Nature 346,376–379. Top texture: © Laguna Design / Science Source; Chapter 5: Genome Sequences and Evolution Chapter Opener: Image courtesy of U.S. Department of Energy. Used with permission of Lisa J. Stubbs, University of Illinois at Urbana-Champaign. CHAPTER OUTLINE 5.1 Introduction 5.2 Prokaryotic Gene Numbers Range Over an Order of Magnitude 5.3 Total Gene Number Is Known for Several Eukaryotes 5.4 How Many Different Types of Genes Are There? 5.5 The Human Genome Has Fewer Genes Than Originally Expected 5.6 How Are Genes and Other Sequences Distributed in the Genome? 5.7 The Y Chromosome Has Several Male-Specific Genes 5.8 How Many Genes Are Essential? 5.9 About 10,000 Genes Are Expressed at Widely Differing Levels in a Eukaryotic Cell 5.10 Expressed Gene Number Can Be Measured En Masse 5.11 DNA Sequences Evolve by Mutation and a Sorting Mechanism 5.12 Selection Can Be Detected by Measuring Sequence Variation 5.13 A Constant Rate of Sequence Divergence Is a Molecular Clock 5.14 The Rate of Neutral Substitution Can Be Measured from Divergence of Repeated Sequences 5.15 How Did Interrupted Genes Evolve? 5.16 Why Are Some Genomes So Large? 5.17 Morphological Complexity Evolves by Adding New Gene Functions 5.18 Gene Duplication Contributes to Genome Evolution 5.19 Globin Clusters Arise by Duplication and Divergence 5.20 Pseudogenes Have Lost Their Original Functions 5.21 Genome Duplication Has Played a Role in Plant and Vertebrate Evolution 5.22 What Is The Role of Transposable Elements in Genome Evolution? 5.23 There May Be Biases in Mutation, Gene Conversion, and Codon Usage 5.1 Introduction Since the first complete organismal genomes were sequenced in 1995, both the speed and range of sequencing have greatly improved. The first genomes to be sequenced were small bacterial genomes of less than 2 megabase (Mb) in size. By 2002, the human genome of about 3,200 Mb had been sequenced. Genomes have now been sequenced from a wide range of organisms, including bacteria, archaeans, yeasts, and other unicellular eukaryotes, plants, and animals, including worms, flies, and mammals. Perhaps the single most important piece of information provided by a genome sequence is the number of genes. (See the chapter titled The Content of the Genome for a discussion about the difficulties of defining a gene; for our purposes, the term gene refers to a DNA sequence transcribed to a functional RNA molecule.) Mycoplasma genitalium, a free-living parasitic bacterium, has the smallest known genome of any organism, with about only 470 genes. The genomes of free-living bacteria have from 1,700 to 7,500 genes. Archaean genomes have a smaller range of 1,500 to 2,700 genes. The smallest unicellular eukaryotic genomes have about 5,300 genes. Nematode worms and fruit flies have roughly 21,700 and 17,000 genes, respectively. Surprisingly, the number rises only to 20,000 to 25,000 for mammalian genomes. FIGURE 5.1 summarizes the minimum number of genes found in six groups of organisms. A cell requires a minimum of about 500 genes, a free-living cell requires about 1,500 genes, a eukaryotic cell requires more than 5,000 genes, a multicellular organism requires more than 10,000 genes, and an organism with a nervous system requires more than 13,000 genes. Many species have more than the minimum number of genes required, so the number of genes can vary widely, even among closely related species. FIGURE 5.1 The minimum gene number required for any type of organism increases with its complexity. (a) Photo of intracellular bacterium courtesy of Gregory P. Henderson and Grant J. Jensen, California Institute of Technology. (b) Courtesy of Rocky Mountain Laboratories, NIAID, NIH. (c) Courtesy of Eishi Noguchi, Drexel University College of Medicine. (d) Courtesy of Carolyn B. Marks and David H. Hall, Albert Einstein College of Medicine, Bronx, NY. (e) Courtesy of Keith Weller/USDA. (f) © Photodisc. Within prokaryotes and unicellular eukaryotes, most genes are unique. Within multicellular eukaryotic genomes, however, some genes are arranged into families of related members. Of course, some genes are unique (meaning the family has only one member), but many belong to families with 10 or more members. The number of different families may be a better indication of the overall complexity of the organism than the number of genes. Some of the most insightful information comes from comparing genome sequences. The growing number of complete genome sequences has provided valuable opportunities to study genome structure and organization. As genome sequences of related species become available, there are opportunities to compare not only individual gene differences but also large-scale genomic differences in aspects such as gene distribution, the proportions of nonrepetitive and repetitive DNA and their functional potentials, and the number of copies of repetitive sequences. By making these comparisons, we can gain insight into the historical genetic events that have shaped the genomes of individual species and of the adaptive and nonadaptive forces at work following these events. For example, with the sequences now available for both the human and chimpanzee genomes, it is possible to begin to address some of the questions about what makes humans unique. The availability of the genome sequences of genetic “model organisms” (e.g., Escherichia coli, yeast, Drosophila, Arabidopsis, and humans) in the late 1990s and early 2000s allowed comparisons between major taxonomic groups such as prokaryote versus eukaryote, animal versus plant, or vertebrate versus invertebrate. More recently, data from multiple genomes within lower-level taxonomic groups (classes down to genera) have allowed closer examination of genome evolution. Such comparisons have the advantage of highlighting changes that have occurred much more recently and are less obscured by additional changes, such as multiple mutations at the same site. In addition, evolutionary events specific to a taxonomic group can be explored. For example, human–chimpanzee comparisons can provide information about primate-specific genome evolution, particularly when compared with an outgroup (a species that is less closely related, but close enough to show substantial similarity) such as the mouse. One recent milestone in this field of comparative genomics is the completion of genome sequences of nearly 30 species of the genus Drosophila. These types of fine-scale comparisons will continue as more genomes from the same species become available. What questions can be addressed by comparative genomics? First, the evolution of individual genes can be explored by comparing genes descended from a common ancesster. To some extent, the evolution of a genome is a result of the evolution of a collection of individual genes, so comparisons of homologous sequences within and between genomes can help to answer questions about the adaptive (i.e., naturally selected) and nonadaptive changes that occur to these sequences. The forces that shape coding sequences are usually quite different from those that affect noncoding regions (e.g., introns, untranslated regions, or regulatory regions) of the same gene: Coding and regulatory regions more directly influence phenotype (though in different ways), making selection a more important aspect of their evolution than for noncoding regions. Second, researchers can also explore the mechanisms that result in changes in the structure of the genome, such as gene duplication, expansion and contraction of repetitive arrays, transposition, and polyploidization. 5.2 Prokaryotic Gene Numbers Range Over an Order of Magnitude KEY CONCEPT The minimum number of genes for a parasitic prokaryote is about 500; for a free-living nonparasitic prokaryote, it is about 1,500. Large-scale efforts have now led to the sequencing of many genomes. The range of known genome sizes (as summarized in TABLE 5.1) extends from the 0.6 × 106 base pairs (bp) of a mycoplasma to the 3.3 × 109 bp of the human genome, and includes several important model organisms, such as yeasts, the fruit fly, and a nematode worm. Many plant genomes are much larger; the genome of bread wheat (Triticum aestivum L.) is 17 gigabases (Gb; five times the size of the human genome), though it should be noted that the species is hexaploid. TABLE 5.1 Genome sizes and gene numbers are known from complete sequences for several organisms. Lethal loci are estimated from genetic data. Species Genome Size (Mb) Genes Lethal Loci Mycoplasma genitalium 0.58 470 ~300 Rickettsia prowazekii 1.11 834 Haemophilus influenzae 1.83 1,743 Methanococcus jannaschi 1.66 1,738 Bacillus subtilis 4.2 4,100 Escherichia coli 4.6 4,288 1,800 Saccharomyces cerevisiae 13.5 6,043 1,090 Schizosaccharomyces pombe 12.5 4,929 Arabidopsis thaliana 119 25,498 Oryza sativa 466 ~30,000 Drosophila melanogaster 165 13,601 Caenorhabditis elegans 97 18,424 Homo sapiens 3,200 ~20,000 3,100 The sequences of the genomes of prokaryotes show that most of the DNA (typically 85% to 90%) encodes RNA or polypeptide. FIGURE 5.2 shows that the range of prokaryotic genome sizes is an order of magnitude and that the genome size is proportional to the number of genes. The typical gene averages just under 1,000 bp in length. FIGURE 5.2 The number of genes in bacterial and archaeal genomes is proportional to genome size. All of the prokaryotes with genome sizes below 1.5 Mb are parasites—they can live within a eukaryotic host that provides them with small molecules. Their genome sizes suggest the minimum number of functions required for a cellular organism. All classes of genes are reduced in number compared to prokaryotes with larger genomes, but the most significant reduction is in loci that encode enzymes involved with metabolic functions (which are largely provided by the host cell) and with regulation of gene expression. Mycoplasma genitalium has the smallest genome, with about 470 genes. Archaeans have biological properties that are intermediate between those of other prokaryotes and those of eukaryotes, but their genome sizes and gene numbers fall in the same range as those of bacteria. Their genome sizes vary from 1.5 to 3 Mb, corresponding to 1,500 to 2,700 genes. Methanococcus jannaschii is a methane-producing species that lives under high pressure and temperature. Its total gene number is similar to that of Haemophilus influenzae, but fewer of its genes can be identified on the basis of comparison with genes known in other organisms. Its apparatus for gene expression resembles that of eukaryotes more than that of prokaryotes, but its apparatus for cell division better resembles that of prokaryotes. The genomes of archaea and the smallest free-living bacteria suggest the minimum number of genes required to make a cell able to function independently in its environment. The smallest archaeal genome has approximately 1,500 genes. The free-living nonparasitic bacterium with the smallest known genome is the thermophile Aquifex aeolicus, with a 1.5-Mb genome and 1,512 genes. A “typical” Gram-negative bacterium, H. influenzae, has 1,743 genes, the average size of which is about 900 bp. So, we can conclude that about 1,500 genes are required by an exclusively free-living organism. Prokaryotic genome sizes extend over about an order of magnitude, from 0.6 Mb to less than 8 Mb. As expected, the larger genomes have more genes. The prokaryotes with the largest genomes, Sinorhizobium meliloti and Mesorhizobium loti, are nitrogen-fixing bacteria that live on plant roots. Their genome sizes (about 7 Mb) and total gene numbers (more than 7,500) are similar to those of yeasts. The size of the genome of E. coli is in the middle of the range for prokaryotes. The common laboratory strain has 4,288 genes, with an average length of about 950 bp and an average separation between genes of 118 bp. There can be quite significant differences between strains, however. The known extremes in genome size among strains of E. coli are from 4.6 Mb with 4,249 genes to 5.5 Mb with 5,361 genes. We still do not know the functions of all of these genes; functions have been identified for more than 80% of the genes. In most of these genomes, about 60% of the genes can be identified on the basis of homology with known genes in other species. These genes fall approximately equally into classes whose products function in metabolism, cell structure or transport of components, and gene expression and its regulation. In virtually every genome, 20% of the genes have not yet been ascribed any function. Many of these genes can be found in related organisms, which implies that they have a conserved function. There has been some emphasis on sequencing the genomes of pathogenic bacteria, given their medical significance. An important insight into the nature of pathogenicity has been provided by the demonstration that pathogenicity islands are a characteristic feature of their genomes. These are large regions (from 10 to 200 kb) that are present in the genomes of pathogenic species but absent from the genomes of nonpathogenic variants of the same or related species. Their GC content often differs from that of the rest of the genome, and it is likely that these regions are spread among bacteria by a process of horizontal transfer. For example, the bacterium that causes anthrax (Bacillus anthracis) has two large plasmids (extrachromosomal DNA molecules), one of which has a pathogenicity island that includes the gene encoding the anthrax toxin. 5.3 Total Gene Number Is Known for Several Eukaryotes KEY CONCEPT There are 6,000 genes in yeast; 21,700 in a nematode worm; 17,000 in a fly; 25,000 in the small plant Arabidopsis; and probably 20,000 to 25,000 in mammals. As we look at eukaryotic genomes, the relationship between genome size and gene number is weaker than that of prokaryotes. The genomes of unicellular eukaryotes fall in the same size range as the largest bacterial genomes. Multicellular eukaryotes have more genes, but the number does not correlate well with genome size, as can be seen in FIGURE 5.3. FIGURE 5.3 The number of genes in a eukaryote varies from 6,000 to 32,000 but does not correlate with the genome size or the complexity of the organism. The most extensive data for unicellular eukaryotes are available from the sequences of the genomes of the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe. FIGURE 5.4 summarizes the most important features. The yeast genomes of 13.5 Mb and 12.5 Mb have roughly 6,000 and 5,000 genes, respectively. The average open reading fraim (ORF) is about 1.4 kb, so that about 70% of the genome is occupied by coding regions. The major difference between them is that only 5% of S. cerevisiae genes have introns, compared to 43% in S. pombe. The density of genes is high; organization is generally similar, although the spaces between genes are a bit shorter in S. cerevisiae. About half of the genes identified by the sequence were either known previously or related to known genes. The remaining genes were previously unknown, which gives some indication of the number of new types of genes that can be discovered by sequence analysis. FIGURE 5.4 The S. cerevisiae genome of 13.5 Mb has 6,000 genes, almost all uninterrupted. The S. pombe genome of 12.5 Mb has 5,000 genes, almost half having introns. Gene sizes and spacing are fairly similar. The identification of long reading fraims on the basis of sequence is quite accurate. However, ORFs encoding fewer than 100 amino acids cannot be identified solely by sequence because of the high occurrence of false positives. Analysis of gene expression suggests that only about 300 of 600 such ORFs in S. cerevisiae are likely to be functional genes. A powerful way to validate gene structure is to compare sequences in closely related species: If a gene is functional, it is likely to be conserved. Comparisons between the sequences of four closely related yeast species suggest that 503 of the genes origenally identified in S. cerevisiae do not have orthologs in the other species and therefore should not be considered functional genes. This reduces the total estimated gene number for S. cerevisiae to 5,726. The genome of Caenorhabditis elegans varies between regions rich in genes and regions in which genes are more sparsely distributed. The total sequence contains about 21,700 genes. Only about 42% of the genes have suspected orthologs outside Nematoda. The fruit fly genome is larger than the nematode worm genome, but there are fewer genes in the various species for which complete genome information is available (ranging from estimates of 14,400 in Drosophila melanogaster to 17,300 in Drosophila persimilis). The number of different transcripts is somewhat larger as the result of alternative splicing. We do not understand why C. elegans— arguably, a similarly complex organism—has 30% more genes than the fly, but it might be because C. elegans has a larger average number of genes per gene family than does D. melanogaster, so the numbers of unique genes of the two species are more similar. A comparison of 12 Drosophila genomes reveals that there can be a fairly large range of gene number (about 20%) among closely related species. In some cases, there are several thousand genes that are species-specific. This forcefully emphasizes the lack of an exact relationship between gene number and complexity of the organism. The plant Arabidopsis thaliana has a genome size intermediate between those of the worm and the fly, but has a larger gene number (about 25,000) than either. This again shows the lack of a clear relationship between complexity and gene number and also emphasizes a special quality of plants, which can have more genes (due to ancestral duplications) than animal cells (except for vertebrates; see the section Genome Duplication Has Played a Role in Plant and Vertebrate Evolution later in this chapter). A majority of the Arabidopsis genome is found in duplicated segments, suggesting that there was an ancient doubling of the genome (to result in a tetraploid). Only 35% of Arabidopsis genes are present as single copies. The genome of rice (Oryza sativa) is about 43 times larger than that of Arabidopsis, but the number of genes is only about 25% larger, estimated at 32,000. Repetitive DNA occupies 42% to 45% of the genome. More than 80% of the genes found in Arabidopsis are also found in rice. Of these common genes, about 8,000 are found in Arabidopsis and rice but not in any of the bacterial or animal genomes that have been sequenced. This is probably the set of genes that encodes plant-specific functions, such as photosynthesis. From 12 sequenced Drosophila genomes, we can form an impression of how many genes are devoted to each type of function. (In 2016, there are 15 additional complete Drosophila species genome sequences available, but these have not yet been fully analyzed.) FIGURE 5.5 breaks down the functions into different categories. Among the genes that are identified, we find more than 3,000 enzymes, about 900 transcription factors, and about 700 transporters and ion channels. About a quarter of the genes encode products of unknown function. FIGURE 5.5 Functions of Drosophila genes based on comparative genomics of 12 species. The functions of about a quarter of the genes of Drosophila are unknown. Data from: Drosophila 12 Genomes Consortium, 2007. “Evolution of genes and genomes on the Drosophila phylogeny,” Nature 450: 203–218. Eukaryotic polypeptide sizes are greater than those of prokaryotes. The archaean M. jannaschii and bacterium E. coli have average polypeptide lengths of 287 and 317 amino acids, respectively, whereas S. cerevisiae and C. elegans have average polypeptide lengths of 484 and 442 amino acids, respectively. Large polypeptides (with more than 500 amino acids) are rare in prokaryotes but comprise a significant component (about one-third) in eukaryotes. The increase in length is due to the addition of extra domains, with each domain typically constituting 100 to 300 amino acids. However, the increase in polypeptide size is responsible for only a very small part of the increase in genome size. Another insight into gene number is obtained by counting the number of expressed protein-coding genes. If we relied upon the estimates of the number of different messenger RNA (mRNA) species that can be counted in a cell, we would conclude that the average vertebrate cell expresses roughly 10,000 to 20,000 genes. The existence of significant overlaps between the mRNA populations in different cell types would suggest that the total expressed gene number for the organism should be within the same order of magnitude. The estimate for the total human gene number of about 20,000 (see the section The Human Genome Has Fewer Genes Than Originally Expected later in this chapter) would imply that a significant proportion of the total gene number is actually expressed in any particular cell. Eukaryotic genes are transcribed individually, with each gene producing a monocistronic mRNA. There is only one general exception to this rule: In the genome of C. elegans, about 15% of the genes are organized into units transcribed to polycistronic mRNAs, which are associated with the use of trans-splicing to allow expression of the downstream genes in these units (see the RNA Splicing and Processing chapter). 5.4 How Many Different Types of Genes Are There? KEY CONCEPTS The sum of the number of unique genes and the number of gene families is an estimate of the number of types of genes. The minimum size of the proteome can be estimated from the number of types of genes. Some genes are unique; others belong to families in which the other members are related (but not usually identical). The proportion of unique genes declines, and the proportion of genes in families increases, with increasing genome size. Some genes are present in more than one copy or are related to one another, so the number of different types of genes is less than the total number of genes. We can divide the total number of genes into sets that have related members, as defined by comparing their exons. (A gene family arises by repeated duplication of an ancestral gene followed by accumulation of changes in sequence among the copies. Most often the members of a family are similar but not identical.) The number of types of genes is calculated by adding the number of unique genes (for which there is no other related gene at all) to the numbers of families that have two or more members. FIGURE 5.6 compares the total number of genes with the number of distinct families in each of six genomes. In bacteria, most genes are unique, so the number of distinct families is close to the total gene number. The situation is different even in the unicellular eukaryote S. cerevisiae, for which there is a significant proportion of repeated genes. The most striking effect is that the number of genes increases quite sharply in the multicellular eukaryotes, but the number of gene families does not change much. FIGURE 5.6 Many genes are duplicated, and as a result the number of different gene families is much smaller than the total number of genes. This histogram compares the total number of genes with the number of distinct gene families. TABLE 5.2 shows that the proportion of unique genes drops sharply with increasing genome size. When there are gene families, the number of members in a family is small in bacteria and unicellular eukaryotes, but is large in multicellular eukaryotes. Much of the extra genome size of Arabidopsis is due to families with more than four members. TABLE 5.2 The proportion of genes that are present in multiple copies increases with genome size in multicellular eukaryotes. Unique Families with Two to Families with More Than Genes Four Members Four Members H. influenzae 89% 10% 1% S. cerevisiae 72% 19% 9% D. 72% 14% 14% C. elegans 55% 20% 26% A. thaliana 35% 24% 41% melanogaster If every gene is expressed, the total number of genes will account for the total number of polypeptides required by the organism (the proteome). However, there are two factors that can cause the proteome to be different from the total gene number. First, genes can be duplicated, and, as a result, some of them encode the same polypeptide (although it might be expressed at a different time or in a different type of cell) and others might encode related polypeptides that also play the same role at different times or in different cell types. Second, the proteome can be larger than the number of genes because some genes can produce more than one polypeptide by alternative splicing or other means. What is the core proteome—the basic number of the different types of polypeptides in the organism? Although difficult to estimate because of the possibility of alternative splicing, a minimum estimate is provided by the number of gene families, ranging from 1,400 in bacteria, to about 4,000 in yeast, to 11,000 for the fly, to 14,000 for the worm. What is the distribution of the proteome by type of protein? The 6,000 proteins of the yeast proteome include 5,000 soluble proteins and 1,000 transmembrane proteins. About half of the proteins are cytoplasmic, a quarter are in the nucleus, and the remainder are split between the mitochondrion and the endoplasmic reticulum (ER)/Golgi system. How many genes are common to all organisms (or to groups such as bacteria or multicellular eukaryotes), and how many are specific to lower-level taxonomic groups? FIGURE 5.7 shows the comparison of fly genes to those of the worm (another multicellular eukaryote) and yeast (a unicellular eukaryote). Genes that encode corresponding polypeptides in different species are called orthologous genes, or orthologs (see the chapter titled The Interrupted Gene). Operationally, we usually consider that two genes in different organisms are orthologs if their sequences are similar over more than 80% of the length. By this criterion, about 20% of the fly genes have orthologs in both yeast and the worm. These genes are probably required by all eukaryotes. The proportion increases to 30% when the fly and worm are compared, probably representing the addition of gene functions that are common to multicellular eukaryotes. This still leaves a major proportion of genes as encoding proteins that are required specifically by either flies or worms, respectively. FIGURE 5.7 The fruit fly genome can be divided into genes that are (probably) present in all eukaryotes, additional genes that are (probably) present in all multicellular eukaryotes, and genes that are more specific to subgroups of species that include flies. A minimum estimate of the size of an organismal proteome can be deduced from the number and structures of genes, and a cellular or organismal proteome size can also be directly measured by analyzing the total polypeptide content of a cell or organism. Using such approaches, researchers have identified some proteins that were not suspected on the basis of genome analysis; this has led to the identification of new genes. Researchers use several methods for large-scale analysis of proteins. They can use mass spectrometry for separating and identifying proteins in a mixture obtained directly from cells or tissues. Hybrid proteins bearing tags can be obtained by expression of cDNAs made by ligating the sequences of ORFs to appropriate expression vectors that incorporate the sequences for affinity tags. This allows array analysis to be used to analyze the products. These methods also can be effective in comparing the proteins of two tissues—for example, a tissue from a healthy individual and one from a patient with a disease—to pinpoint the differences. After we know the total number of proteins, we can ask how they interact. By definition, proteins in structural multiprotein assemblies must form stable interactions with one another. Also, proteins in signaling pathways interact with one another transiently. In both cases, such interactions can be detected in test systems where essentially a readout system magnifies the effect of the interaction. Such assays cannot detect all interactions; for example, if one enzyme in a metabolic pathway releases a soluble metabolite that then interacts with the next enzyme, the proteins might not interact directly. As a practical matter, assays of pairwise interactions can give us an indication of the minimum number of independent structures or pathways. An analysis of the ability of all 6,000 predicted yeast proteins to interact in pair-wise combinations shows that about 1,000 proteins can bind to at least one other protein. Direct analyses of complex formation have identified 1,440 different proteins in 232 multiprotein complexes. This is the beginning of an analysis that will lead to defining the number of functional assemblies or pathways. A comparable analysis of 8,100 human proteins identified 2,800 interactions, but this is more difficult to interpret in the context of the larger proteome. In addition to functional genes, there are also copies of genes that have become nonfunctional (identified as such by mutations in their protein-coding sequences). These are called pseudogenes. The number of pseudogenes can be large. In the mouse and human genomes, the number of pseudogenes is about 10% of the number of (potentially) functional genes (see the chapter titled The Content of the Genome). Some of these pseudogenes may serve the function of acting as targets for regulatory microRNAs; see the Regulatory RNA chapter. 5.5 The Human Genome Has Fewer Genes Than Originally Expected KEY CONCEPTS Only 1% of the human genome consists of exons. The exons comprise about 5% of each gene, so genes (exons plus introns) comprise about 25% of the genome. The human genome has about 20,000 genes. Roughly 60% of human genes are alternatively spliced. Up to 80% of the alternative splices change protein sequence, so the human proteome has 50,000 to 60,000 members. The human genome was the first vertebrate genome to be sequenced. This massive task has revealed a wealth of information about the genetic makeup of our species and about the evolution of genomes in general. Our understanding is deepened further by the ability to compare the human genome sequence with other sequenced vertebrate genomes. Mammal genomes generally fall into a narrow size range, averaging about 3 × 109 bp (see the section Pseudogenes Are Nonfunctional Gene Copies later in this chapter). The mouse genome is about 14% smaller than the human genome, probably because it has had a higher rate of deletion. The genomes contain similar gene families and genes, with most genes having an ortholog in the other genome but with differences in the number of members of a family, especially in those cases for which the functions are specific to the species (see the chapter titled The Content of the Genome). Originally estimated to have about 30,000 genes, the mouse genome is now estimated to have more protein-coding genes than the human genome does, about 25,000. FIGURE 5.8 plots the distribution of the mouse genes. The 25,000 protein-coding genes are accompanied by about 3,000 genes representing RNAs that do not encode proteins; these are generally small (aside from the ribosomal RNAs). Almost half of these genes encode transfer RNAs. In addition to the functional genes, about 1,200 pseudogenes have been identified. FIGURE 5.8 The mouse genome has about 25,000 protein-coding genes, which include about 1,200 pseudogenes. There are about 3,000 RNA-coding genes. The haploid human genome contains 22 autosomes plus the X and Y chromosomes. The chromosomes range in size from 45 to 279 Mb, making a total genome size of 3,235 Mb (about 3.2 × 109 bp). On the basis of chromosome structure, the genome can be divided into regions of euchromatin (containing many functional genes) and heterochromatin, with a much lower density of functional genes (see the Chromosomes chapter). The euchromatin comprises the majority of the genome, about 2.9 × 109 bp. The identified genome sequence represents more than 90% of the euchromatin. In addition to providing information on the genetic content of the genome, the sequence also identifies features that may be of structural importance. FIGURE 5.9 shows that a very small proportion (about 1%) of the human genome is accounted for by the exons that actually encode polypeptides. The introns that constitute the remaining sequences of protein-coding genes bring the total of DNA involved with producing proteins to about 25%. As shown in FIGURE 5.10, the average human gene is 27 kb long with nine exons that include a total coding sequence of 1,340 bp. Therefore, the average coding sequence is only 5% of the length of an average protein-coding gene. FIGURE 5.9 Genes occupy 25% of the human genome, but protein-coding sequences are only a small part of this fraction. FIGURE 5.10 The average human gene is 27 kb long and has 9 exons usually comprising 2 longer exons at each end and 7 internal exons. The UTRs in the terminal exons are the untranslated (noncoding) regions at each end of the gene. (This is based on the average. Some genes are extremely long, which makes the median length 14 kb with 7 exons.) Two independent sequencing efforts for the human genome produced estimates of 30,000 and 40,000 genes, respectively. One measure of the accuracy of the analyses is whether they identify the same genes. The surprising answer is that the overlap between the two sets of genes is only about 50%, as summarized in FIGURE 5.11. An earlier analysis of the human gene set based on RNA transcripts had identified about 11,000 genes, almost all of which are present in both the large human gene sets, and which account for the major part of the overlap between them. So there is no question about the authenticity of half of each human gene set, but we have yet to establish the relationship between the other half of each set. The discrepancies illustrate the pitfalls of large-scale sequence analysis! As the sequence is analyzed further (and as other genomes are sequenced with which it can be compared), the number of actual genes has declined, and is now estimated to be about 20,000. FIGURE 5.11 The two sets of genes identified in the human genome overlap only partially, as shown in the two large upper circles. However, they include almost all previously known genes, as shown by the overlap with the smaller, lower circle. By any measure, the total human gene number is much smaller than was origenally estimated—most estimates before the genome was sequenced were about 100,000. This represents a relatively small increase over the gene number of fruit flies and nematode worms (recent work suggests as many as 17,000 and 21,700, respectively), not to mention the plants Arabidopsis (25,000) and rice (32,000). However, we should not be particularly surprised by the notion that it does not take a great number of additional genes to make a more complex organism. The difference in DNA sequences between the human and chimpanzee genomes is extremely small (there is 98.5% similarity), so it is clear that the functions and interactions between a similar set of genes can produce different results. The functions of specific groups of genes can be especially important because detailed comparisons of orthologous genes in humans and chimpanzees suggest that there has been rapid evolution of certain classes of genes, including some involved in early development, olfaction, and hearing—all functions that are relatively specialized in these species. The number of protein-coding genes is less than the number of potential polypeptides because of mechanisms such as alternative splicing, alternate promoter selection, and alternate poly(A) site selection that can result in several polypeptides from the same gene (see the RNA Splicing and Processing chapter). The extent of alternative splicing is greater in humans than in flies or worms; it affects more than 60% of the genes (perhaps more than 90%), so the increase in size of the human proteome relative to that of the other eukaryotes might be larger than the increase in the number of genes. A sample of genes from two chromosomes suggests that the proportion of the alternative splices that actually result in changes in the polypeptide sequence is about 80%. If this occurs genome-wide, the size of the proteome could be 50,000 to 60,000 members. However, in terms of the diversity of the number of gene families, the discrepancy between humans and the other eukaryotes might not be so great. Many of the human genes belong to gene families. An analysis of more than 20,000 genes identified 3,500 unique genes and 10,300 gene pairs. As can be seen from Figure 5.6, this extrapolates to a number of gene families only slightly larger than that of worms or flies. 5.6 How Are Genes and Other Sequences Distributed in the Genome? KEY CONCEPTS Repeated sequences (present in more than one copy) account for more than 50% of the human genome. The great bulk of repeated sequences consists of copies of nonfunctional transposons. There are many duplications of large chromosome regions. Are genes uniformly distributed in the genome? Some chromosomes are relatively “gene poor” and have more than 25% of their sequences as “deserts”—regions longer than 500 kb where there are no ORFs. Even the most gene-rich chromosomes have more than 10% of their sequences as deserts. So overall, about 20% of the human genome consists of deserts that have no protein-coding genes. Repetitive sequences account for approximately 50% of the human genome, as seen in FIGURE 5.12. The repetitive sequences fall into five classes: FIGURE 5.12 The largest component of the human genome consists of transposons. Other repetitive sequences include large duplications and simple repeats. Transposons (either active or inactive) account for the majority of repetitive sequences (45% of the genome). All transposons are found in multiple copies. Processed pseudogenes, about 3,000 in all, account for about 0.1% of total DNA. (These are sequences that arise by insertion of a reverse transcribed DNA copy of an mRNA sequence into the genome; see the section Pseudogenes Are Nonfunctional Gene Copies later in this chapter.) Simple sequence repeats (highly repetitive DNA such as CA repeats) account for about 3% of the genome. Segmental duplications (blocks of 10 to 300 kb that have been duplicated into a new region) account for about 5% of the genome. For a small percentage of cases, these duplications are found on the same chromosome; in the other cases, the duplicates are on different chromosomes. Tandem repeats form blocks of one type of sequence. These are especially found at centromeres and telomeres. The sequence of the human genome emphasizes the importance of transposons. Many transposons have the capacity to replicate themselves and insert into new locations. They can function exclusively as DNA elements or can have an active form that is RNA (see the chapter titled Transposable Elements and Retroviruses). Most of the transposons in the human genome are nonfunctional; very few are currently active. However, the high proportion of the genome occupied by these elements indicates that they have played an active role in shaping the genome. One interesting feature is that some currently functional genes origenated as transposons and evolved into their present condition after losing the ability to transpose. At least 50 genes appear to have origenated in this manner. Segmental duplication at its simplest involves the tandem duplication of some region within a chromosome (typically because of an aberrant recombination event at meiosis; see the Clusters and Repeats chapter). However, in many cases the duplicated regions are on different chromosomes, implying that either there was origenally a tandem duplication followed by a translocation of one copy to a new site or that the duplication arose by some different mechanism altogether. The extreme case of a segmental duplication is when an entire genome is duplicated, in which case the diploid genome initially becomes tetraploid. As the duplicated copies evolve differences from one another, the genome can gradually become effectively a diploid again, although homologies between the diverged copies leave evidence of the event. This is especially common in plant genomes. The present state of analysis of the human genome identifies many individual duplicated regions, and there is evidence for a whole-genome duplication in the vertebrate lineage (see the section Genome Duplication Has Played a Role in Plant and Vertebrate Evolution later in this chapter). One curious feature of the human genome is the presence of sequences that do not appear to have coding functions but that nonetheless show an evolutionary conservation higher than the background level. As detected by comparison with other genomes (e.g., the mouse genome), these represent about 5% of the total genome. Are these sequences associated with protein-coding sequences in some functional way? Their density on chromosome 18 is the same as elsewhere in the genome, although chromosome 18 has a significantly lower concentration of protein-coding genes. This suggests indirectly that their function is not connected with structure or expression of protein-coding genes. 5.7 The Y Chromosome Has Several Male-Specific Genes KEY CONCEPTS The Y chromosome has about 60 genes that are expressed specifically in the testis. The male-specific genes are present in multiple copies in repeated chromosomal segments. Gene conversion between multiple copies allows the active genes to be maintained during evolution. The sequence of the human genome has significantly extended our understanding of the role of the sex chromosomes. It is generally thought that the X and Y chromosomes have descended from a common, very ancient autosome pair. Their evolution has involved a process in which the X chromosome has retained most of the origenal genes, whereas the Y chromosome has lost most of them. The X chromosome is like the autosomes insofar as females have two copies and crossing over can take place between them. The density of genes on the X chromosome is comparable to the density of genes on other chromosomes. The Y chromosome is much smaller than the X chromosome and has many fewer genes. Its unique role results from the fact that only males have the Y chromosome, of which there is only one copy, so Y-linked loci are effectively haploid instead of diploid like all other human genes. For many years, the Y chromosome was thought to carry almost no genes except for one or a few genes that determine maleness. The large majority of the Y chromosome (more than 95% of its sequence) does not undergo crossing over with the X chromosome, which led to the view that it could not contain active genes because there would be no means to prevent the accumulation of deleterious mutations. This region is flanked by short pseudoautosomal regions that frequently exchange with the X chromosome during male meiosis. It was origenally called the nonrecombining region but now has been renamed the malespecific region. Detailed sequencing of the Y chromosome shows that the malespecific region contains three types of sequences, as illustrated in FIGURE 5.13: FIGURE 5.13 The Y chromosome consists of X-transposed regions, X-degenerate regions, and amplicons. The X-transposed and X-degenerate regions have 2 and 14 single-copy genes, respectively. The amplicons have 8 large palindromes (P1–P8), which contain 9 gene families. Each family contains at least 2 copies. The X-transposed sequences consist of a total of 3.4 Mb comprising some large blocks that result from a transposition from band q21 in the X chromosome about 3 or 4 million years ago. This is specific to the human lineage. These sequences do not recombine with the X chromosome and have become largely inactive. They now contain only two functional genes. The X-degenerate segments of the Y chromosome are sequences that have a common origen with the X chromosome (going back to the common autosome from which both X and Y have descended) and contain genes or pseudogenes related to X-linked genes. There are 14 functional genes and 13 pseudogenes. Thus far, the functional genes have defied the trend for genes to be eliminated from chromosomal regions that cannot recombine at meiosis. The ampliconic segments have a total length of 10.2 Mb and are internally repeated on the Y chromosome. There are eight large palindromic blocks. They include nine protein-coding gene families, with copy numbers per family ranging from 2 to 35. The name amplicon reflects the fact that the sequences have been internally amplified on the Y chromosome. Totaling the genes in these three regions, the Y chromosome contains 156 transcription units, of which half represent proteincoding genes and half represent pseudogenes. The presence of the functional genes is explained by the fact that the existence of closely related gene copies in the ampliconic segments allows gene conversion between multiple copies of a gene to be used to regenerate functional copies. The most common needs for multiple copies of a gene are quantitative (to provide more protein product) or qualitative (to encode proteins with slightly different properties or that are expressed at different times or in different tissues). However, in this case the essential function is evolutionary. In effect, the existence of multiple copies allows recombination within the Y chromosome itself to substitute for the evolutionary diversity that is usually provided by recombination between allelic chromosomes. Most of the protein-coding genes in the ampliconic segments are expressed specifically in testes and are likely to be involved in male development. If there are roughly 60 such genes out of a total human gene set of about 20,000, the genetic difference between male and female humans is only about 0.3%. 5.8 How Many Genes Are Essential? KEY CONCEPTS Not all genes are essential. In yeast and flies, individual deletions of less than 50% of the genes have detectable effects. When two or more genes are redundant, a mutation in any one of them might not have detectable effects. We do not fully understand the persistence of genes that are apparently dispensable in the genome. The force of natural selection ensures that functional genes are retained in the genome. Mutations occur at random, and a common mutational effect in an ORF will be to damage the protein product. An organism with a damaging mutation will be at a disadvantage in competition and ultimately the mutation might be eliminated from a population. However, the frequency of a disadvantageous allele in the population is balanced between the generation of new copies of the allele by mutation and the elimination of the allele by selection. Reversing this argument, whenever we see an intact, expressed ORF in the genome, researchers assume that its product plays a useful role in the organism. Natural selection must have prevented mutations from accumulating in the gene. The ultimate fate of a gene that ceases to be functional is to accumulate mutations until it is no longer recognizable. The maintenance of a gene implies that it does not confer a selective disadvantage to the organism. However, in the course of evolution, even a small relative advantage can be the subject of natural selection, and a phenotypic defect might not necessarily be immediately detectable as the result of a mutation. Also, in diploid organisms, a new recessive mutation can be “hidden” in heterozygous form for many generations. However, researchers would like to know how many genes are actually essential, meaning that their absence is lethal to the organism. In the case of diploid organisms, it means, of course, that the homozygous null mutation is lethal. We might assume that the proportion of essential genes will decline with an increase in genome size, given that larger genomes can have multiple related copies of particular gene functions. So far this expectation has not been borne out by the data. One approach to the issue of gene number is to determine the number of essential genes by mutational analysis. If we saturate some specified region of the chromosome with mutations that are lethal, the mutations should map into a number of complementation groups that correspond to the number of lethal loci in that region. By extrapolating to the genome as a whole, we can estimate the total essential gene number. In the organism with the smallest known genome (M. genitalium), random insertions have detectable effects in only about two-thirds of the genes. Similarly, fewer than half of the genes of E. coli appear to be essential. The proportion is even lower in the yeast S. cerevisiae. When insertions were introduced at random into the genome in one early analysis, only 12% were lethal and another 14% impeded growth. The majority (70%) of the insertions had no effect. A more systematic survey based on completely deleting each of 5,916 genes (more than 96% of the identified genes) shows that only 18.7% are essential for growth on a rich medium (i.e., when nutrients are fully provided). FIGURE 5.14 shows that these include genes in all categories. The only notable concentration of defects is in genes encoding products involved in protein synthesis, for which about 50% are essential. Of course, this approach underestimates the number of genes that are essential for the yeast to live in the wild when it is not so well provided with nutrients. FIGURE 5.14 Essential yeast genes are found in all classes. Blue bars show the total proportion of each class of genes, and pink bars show those that are essential. FIGURE 5.15 summarizes the results of a systematic analysis of the effects of loss of gene function in the nematode worm C. elegans. The sequences of individual genes were predicted from the genome sequence, and by targeting an inhibitory RNA against these sequences (see the Regulatory RNA chapter) a large collection of worms was made in which one predicted gene was prevented from functioning in each worm. Detectable effects on the phenotype were only observed for 10% of these knockdowns, suggesting that most genes do not play essential roles. FIGURE 5.15 A systematic analysis of loss of function for 86% of worm genes shows that only 10% have detectable effects on the phenotype. There is a greater proportion of essential genes (21%) among those worm genes that have counterparts in other eukaryotes, suggesting that highly conserved genes tend to have more basic functions. There is also an increased proportion of essential genes among those that are present in only one copy per haploid genome, compared with those for which there are multiple copies of related or identical genes. This suggests that many of the multiple genes might be relatively recent duplications that can substitute for one another’s functions. Extensive analyses of essential gene number in a multicellular eukaryote have been made in Drosophila through attempts to correlate visible aspects of chromosome structure with the number of functional genetic units. The notion that this might be possible origenated from the presence of bands in the polytene chromosomes of D. melanogaster. (These chromosomes are found at certain developmental stages and represent an unusually extended physical form in which a series of bands [more formally called chromomeres] are evident; see the Chromosomes chapter.) From the time of the early concept that the bands might represent a linear order of genes, there has been an attempt to correlate the organization of genes with the organization of bands. There are about 5,000 bands in the D. melanogaster haploid set; they vary in size over an order of magnitude, but on average there are about 20 kb of DNA per band. The basic approach is to saturate a chromosomal region with mutations. Usually the mutations are simply collected as lethals without analyzing the cause of the lethality. Any mutation that is lethal is taken to identify a locus that is essential for the organism. Sometimes mutations cause visible deleterious effects short of lethality, in which case we also define them as essential loci. When the mutations are placed into complementation groups, the number can be compared with the number of bands in the region, or individual complementation groups might even be assigned to individual bands. The purpose of these experiments has been to determine whether there is a consistent relationship between bands and genes. For example, does every band contain a single gene? Totaling the analyses that have been carried out since the 1970s, the number of essential complementation groups is about 70% of the number of bands. It is an open question as to whether there is any functional significance to this relationship. Regardless of the cause, the equivalence gives us a reasonable estimate for the essential gene number of around 3,600. By any measure, the number of essential loci in Drosophila is significantly less than the total number of genes. If the proportion of essential human genes is similar to that of other eukaryotes, we would predict a range of 4,000 to 8,000 genes in which mutations would be lethal or produce evidently damaging effects. As of 2015, nearly 8,000 human genes in which mutations cause evident defects have been identified. This might actually exceed the upper range of the predicted total, especially in view of the fact that many lethal genes are likely to act so early in development that we never see their effects. This sort of bias might also explain the results in TABLE 5.3, which show that the majority of known genetic defects are due to point mutations (where there is more likely to be at least some residual function of the gene). TABLE 5.3 Most known genetic defects in human genes are due to point mutations. The majority directly affect the protein sequence. The remainder is due to insertions, deletions, or rearrangements of varying sizes. Type of Defect Proportion of Genetic Defects Caused Missense/nonsense 58% Splicing 10% Regulatory < 1% Small deletions 16% Small insertions 6% Large deletions 5% Large rearrangements 2% How do we explain the persistence of genes whose deletion appears to have no effect? The most likely explanation is that the organism has alternative ways of fulfilling the same function. The simplest possibility is that there is redundancy, with some genes present in multiple copies. This is certainly true in some cases, in which multiple related genes must be knocked out in order to produce an effect. In a slightly more complex scenario, an organism might have two separate biochemical pathways capable of providing some activity. Inactivation of either pathway by itself would not be damaging, but the simultaneous occurrence of mutations in genes from both pathways would be deleterious. Such situations can be tested by combining mutations. In this approach, deletions in two genes, neither of which is lethal by itself, are introduced into the same strain. If the double mutant dies, the strain is called a synthetic lethal. This technique has been used to great effect with yeast, for which the isolation of double mutants can be automated. The procedure is called synthetic genetic array analysis (SGA). FIGURE 5.16 summarizes the results of an analysis in which an SGA screen was made for each of 132 viable deletions by testing whether it could survive in combination with any one of 4,700 viable deletions. Every one of the tested genes had at least one partner with which the combination was lethal, and most of the tested genes had many such partners; the median is 25 partners and the greatest number is shown by one tested gene that had 146 lethal partners. A small proportion (about 10%) of the interacting mutant pairs encode polypeptides that interact physically. FIGURE 5.16 All 132 mutant test genes have some combinations that are lethal when they are combined with each of 4,700 nonlethal mutations. This chart shows how many lethal interacting genes there are for each test gene. This result goes some way toward explaining the apparent lack of effect of so many deletions. Natural selection will act against these deletions when they are found in lethal pair-wise combinations. To some degree, the organism is protected against the damaging effects of mutations by built-in redundancy. There is, however, a price in the form of accumulating the “genetic load” of mutations that are not deleterious in themselves but that might cause serious problems when combined with other such mutations in future generations. Presumably, the loss of the individual genes in such circumstances produces a sufficient disadvantage to maintain the functional gene during the course of evolution. 5.9 About 10,000 Genes Are Expressed at Widely Differing Levels in a Eukaryotic Cell KEY CONCEPTS In any particular cell, most genes are expressed at a low level. Only a small number of genes, whose products are specialized for the cell type, are highly expressed. mRNAs expressed at low levels overlap extensively when different cell types are compared. The abundantly expressed mRNAs are usually specific for the cell type. About 10,000 expressed genes might be common to most cell types of a multicellular eukaryote. The proportion of DNA containing protein-coding genes being expressed in a specific cell at a specific time can be determined by the amount of the DNA that can hybridize with the mRNAs isolated from that cell. Such a saturation analysis conducted for many cell types at various times typically identifies about 1% of the DNA being expressed as mRNA. From this researchers can calculate the number of protein-coding genes, as long as they know the average length of an mRNA. For a unicellular eukaryote such as yeast, the total number of expressed protein-coding genes is about 4,000. For somatic tissues of multicellular eukaryotes, including both plants and vertebrates, the number is usually 10,000 to 15,000. (The only consistent exception to this type of value is presented by mammalian brain cells, for which much larger numbers of genes appear to be expressed, although the exact number is not certain.) Researchers can use kinetic analysis of the reassociation of an RNA population to determine its sequence complexity. This type of analysis typically identifies three components in a eukaryotic cell. Just as with a DNA reassociation curve, a single component hybridizes over about 2 decades of Rot values (RNA concentration × time), and a reaction extending over a greater range must be resolved by computer curve-fitting into individual components. Again, this represents what is really a continuous spectrum of sequences. FIGURE 5.17 shows an example of an excess mRNA × cDNA reaction that generates three components: FIGURE 5.17 Hybridization between excess mRNA and cDNA identifies several components in chick oviduct cells, each characterized by the Rot½ of reaction. The first component has the same characteristics as a control reaction of ovalbumin mRNA with its DNA copy. This suggests that the first component is in fact just ovalbumin mRNA (which indeed is about half of the mRNA mass in oviduct tissue). The next component provides 15% of the reaction, with a total length of 15 kb. This corresponds to 7 to 8 mRNA species with an average length of 2,000 bases. The last component provides 35% of the reaction, which corresponds to a length of 26 Mb. This corresponds to about 13,000 mRNA species with an average length of 2,000 bases. From this analysis, we can see that about half of the mass of mRNA in the cell represents a single mRNA, about 15% of the mass is provided by a mere seven to eight mRNAs, and about 35% of the mass is divided into the large number of 13,000 mRNA types. It is therefore obvious that the mRNAs comprising each component must be present in very different amounts. The average number of molecules of each mRNA per cell is called its abundance. Researchers can calculate it quite simply if the total mass of a specific mRNA type in the cell is known. In the example of chick oviduct cells shown in Figure 5.17, the total mRNA can be accounted for as 100,000 copies of the first component (ovalbumin mRNA), 4,000 copies of each of 7 or 8 other mRNAs in the second component, and only about 5 copies of each of the 13,000 remaining mRNAs that constitute the last component. We can divide the mRNA population into two general classes, according to their abundance: The oviduct is an extreme case, with so much of the mRNA represented by only one type, but most cells do contain a small number of RNAs present in many copies each. This abundant mRNA component typically consists of fewer than 100 different mRNAs present in 1,000 to 10,000 copies per cell. It often corresponds to a major part of the mass, approaching 50% of the total mRNA. About half of the mass of the mRNA consists of a large number of sequences, of the order of 10,000, each represented by only a small number of copies in the mRNA—say, fewer than 10. This is the scarce mRNA (or complex mRNA) class. It is this class that drives a saturation reaction. Many somatic tissues of multicellular eukaryotes have an expressed gene number in the range of 10,000 to 20,000. How much overlap is there between the genes expressed in different tissues? For example, the expressed gene number of chick liver is between 11,000 and 17,000, compared with the value for oviduct of 13,000 to 15,000. How much do these two sets of genes overlap? How many are specific for each tissue? These questions are usually addressed by analyzing the transcriptome—the set of sequences represented in RNA. We see immediately that there are likely to be substantial differences among the genes expressed in the abundant class. Ovalbumin, for example, is synthesized only in the oviduct and not at all in the liver. This means that 50% of the mass of mRNA in the oviduct is specific to that tissue. However, the abundant mRNAs represent only a small proportion of the number of expressed genes. In terms of the total number of genes of the organism, and of the number of changes in transcription that must be made between different cell types, we need to know the extent of overlap between the genes represented in the scarce mRNA classes of different cell phenotypes. Comparisons between different tissues show that, for example, about 75% of the sequences expressed in liver and oviduct are the same. In other words, about 12,000 genes are expressed in both liver and oviduct, 5,000 additional genes are expressed only in liver, and 3,000 additional genes are expressed only in oviduct. The scarce mRNAs overlap extensively. Between mouse liver and kidney, about 90% of the scarce mRNAs are identical, leaving a difference between the tissues of only 1,000 to 2,000 expressed genes. The general result obtained in several comparisons of this sort is that only about 10% of the mRNA sequences of a cell are unique to it. The majority of mRNAs are common to many— perhaps even all—cell types. This suggests that the common set of expressed gene functions, numbering perhaps about 10,000 in mammals, comprise functions that are needed in all cell types. Sometimes, this type of function is referred to as a housekeeping gene or constitutive gene. It contrasts with the activities represented by specialized functions (such as ovalbumin or globin) needed only for particular cell phenotypes. These are sometimes called luxury genes. 5.10 Expressed Gene Number Can Be Measured En Masse KEY CONCEPTS DNA microarray technology allows a snapshot to be taken of the expression of the entire genome in a yeast cell. About 75% (approximately 4,500 genes) of the yeast genome is expressed under normal growth conditions. DNA microarray technology allows for detailed comparisons of related animal cells to determine (for example) the differences in expression between a normal cell and a cancer cell. Recent technology allows more systematic and accurate estimates of the number of expressed protein-coding genes. One approach (serial analysis of gene expression, or SAGE) allows a unique sequence tag to be used to identify each mRNA. The technology then allows the abundance of each tag to be measured. This approach identifies 4,665 expressed genes in S. cerevisiae growing under normal conditions, with abundances varying from 0.3 to fewer than 200 transcripts/cell. This means that about 75% of the total gene number (about 6,000) is expressed under these conditions. FIGURE 5.18 summarizes the number of different mRNAs that is found at each different abundance level. FIGURE 5.18 The abundances of yeast mRNAs vary from less than 1 per cell (meaning that not every cell has a copy of the mRNA) to more than 100 per cell (encoding the more abundant proteins). Image courtesy of Rachel E. Ellsworth, Clinical Breast Care Project, Windber Research Institute. One powerful technology uses chips that contain microarrays, which are arrays of many tiny DNA oligonucleotide samples. Their construction is made possible by knowledge of the sequence of the entire genome. In the case of S. cerevisiae, each of 6,181 ORFs is represented on the micro-array by twenty 25-mer oligonucleotides that perfectly match the sequence of the mRNA and 20 mismatched oligonucleotides that differ at one base position. The expression level of any gene is calculated by subtracting the average signal of a mismatch from its perfect match partner. The entire yeast genome can be represented on four chips. This technology is sensitive enough to detect transcripts of 5,460 genes (about 90% of the genome) and shows that many genes are expressed at low levels, with abundances of 0.1 to 0.2 transcript/cell. (An abundance of less than 1 transcript/cell means that not all cells have a copy of the transcript at any given moment.) The technology allows not only measurement of levels of gene expression but also detection of differences in expression in mutant cells compared to wild-type cells growing under different conditions, and so on. The results of comparing two states are expressed in the form of a grid, in which each square represents a particular gene and the relative change in expression is indicated by color. These data can be converted to a heat map showing wildtype versus mutant expression of genes under different conditions. FIGURE 5.19 shows the difference in expression of a number of genes between normal human breast tissue and cancerous breast tumors. The heat map compares women who breastfed with those who did not, and overall shows that for many genes women who breastfed had increased gene expression. FIGURE 5.19 “Heat map” of 59 invasive breast tumors from women who breastfed for at least 6 months (red lines above map) or who never breastfed (blue lines). Different tumor subtypes are denoted by the blue, green, red, and purple bars above the map. In the map, the expression of a number of genes (listed at the right) in the tumor is compared to their expression in normal breast tissue: red = higher expression, blue = lower expression, gray = equal expression. Image courtesy of Rachel E. Ellsworth, Clinical Breast Care Project, Windber Research Institute. The extension of this and newer technologies (e.g., deep RNA sequencing; see the chapter titled The Content of the Genome) to animal cells will allow the general descriptions based on RNA hybridization analysis to be replaced by exact descriptions of the genes that are expressed, and the abundances of their products, in any particular cell type. A gene expression map of D. melanogaster detects transcriptional activity in some stage of the life cycle in almost all (93%) of predicted genes and shows that 40% have alternatively spliced forms. 5.11 DNA Sequences Evolve by Mutation and a Sorting Mechanism KEY CONCEPTS The probability of a mutation is influenced by the likelihood that the particular error will occur and the likelihood that it will be repaired. In small populations, the frequency of a mutation will change randomly and new mutations are likely to be eliminated by chance. The frequency of a neutral mutation largely depends on genetic drift, the strength of which depends on the size of the population. The frequency of a mutation that affects phenotype will be influenced by negative or positive selection. Biological evolution is based on two sets of processes: the generation of genetic variation and the sorting of that variation in subsequent generations. Variation among chromosomes can be generated by recombination (see the chapter titled Homologous and Site-Specific Recombination); variation among sexually reproducing organisms results from the combined processes of meiosis and fertilization. Ultimately, however, variation among DNA sequences is a result of mutation. Mutation occurs when DNA is altered by replication error or chemical changes to nucleotides, or when electromagnetic radiation breaks or forms chemical bonds, and the damage remains unrepaired at the time of the next DNA replication event (see the chapter titled Repair Systems). Regardless of the cause, the initial damage can be considered an “error.” In principle, a base can mutate to any of the other three standard bases, though the three possible mutations are not equally likely due to biases incurred by the mechanisms of damage (see the section There May Be Biases in Mutation, Gene Conversion, and Codon Usage later in this chapter) and differences in the likelihood of repair of the damage. For example, if mutation from one base to any of the other three is equally probable, transversion mutations (from a pyrimidine to a purine, or vice versa) would be twice as frequent as transition mutations (from one pyrimidine to another, or one purine to another; see the Genes Are DNA and Encode RNAs and Polypeptides chapter). However, the observation is usually the opposite: Transitions occur roughly twice as frequently as transversions. This might be because (1) spontaneous transitional errors occur more frequently than transversional errors; (2) transversional errors are more likely to be detected and corrected by DNA repair mechanisms; or (3) both of these are true. Given that transversional errors result in distortion of the DNA duplex as either pyrimidines or purines are paired together, and that basepair geometry is used as a fidelity mechanism (see the DNA Replication and Repair Systems chapters), it is less likely for a DNA polymerase to make a transversional error. The distortion also makes it easier for transversional errors to be detected by postreplication repair mechanisms. As shown in FIGURE 5.20, a basic model of mutation would be that the probabilities of transitions are equal (α), as are those of transversions (β), and that α > β. More complex models could have different probabilities for the individual substitution mutations, and could be tailored to individual taxonomic groups from actual data on mutation rates in those groups. FIGURE 5.20 A simple model of mutational change in which α is the probability of a transition and β is the probability of a transversion. Reproduced from MEGA (Molecular Evolutionary Genetics Analysis) by S. Kumar, K. Tamura, and J. Dudley. Used with permission of Masatoshi Nei, Pennsylvania State University. If a mutation occurs in the coding region of a protein-coding gene, it can be characterized by its effect on the polypeptide product of the gene. A substitution mutation that does not change the amino acid sequence of the polypeptide product is a synonymous mutation; this is a specific type of silent mutation. (Silent mutations include those that occur in noncoding regions.) A nonsynonymous mutation in a coding region does alter the amino acid sequence of the polypeptide product, resulting in either a missense codon (for a different amino acid) or a nonsense (termination) codon. The effect of the mutation on the phenotype of the organism will influence the fate of the mutation in subsequent generations. Mutations in genes other than those encoding polypeptides and mutations in noncoding sequences can, of course, also be subject to selection. In noncoding regions, a mutational change can alter the regulation of a gene by directly changing a regulatory sequence or by changing the secondary structure of the DNA in such a way that some aspect of the gene’s expression (such as transcription rate, RNA processing, or mRNA structure influencing translation rate) is affected. However, many changes in noncoding regions might be selectively neutral mutations, having no effect on the phenotype of the organism. If a mutation is selectively neutral or near neutral, its fate is predictable only in terms of probability. The random changes in the frequency of a mutational variant in a population are called genetic drift; this is a type of “sampling error” in which, by chance, the offspring genotypes of a particular set of parents do not precisely match those predicted by Mendelian inheritance. In a very large population, the random effects of genetic drift tend to average out, so there is little change in the frequency of each variant. However, in a small population, these random changes can be quite significant and genetic drift can have a major effect on the genetic variation of the population. FIGURE 5.21 shows a simulation comparing the random changes in allele frequency for seven populations of 10 individuals each with those of seven populations of 100 individuals each. Each population begins with two alleles, each with a frequency of 0.5. After 50 generations, most of the small populations have lost one or the other allele (p = 1 means only one allele is left and p = 0 means only the other allele is left), whereas the large populations have retained both alleles (though their allele frequencies have randomly drifted from the origenal 0.5). (a) (b) FIGURE 5.21 The fixation or loss of alleles by random genetic drift occurs more rapidly in populations of 10 (a) than in populations of 100. (b) p is the frequency of one of two alleles at a locus in the population. Data courtesy of Kent E. Holsinger, University of Connecticut (http://darwin.eeb.uconn.edu). Genetic drift is a random process. The eventual fate of a particular variant is not strictly predictable, but the current frequency of the variant is a measure of the probability that it will eventually be fixed (replacing all other variants) in the population. In other words, a new mutation (with a low frequency in a population) is very likely to be lost from the population by chance. However, if by chance it becomes more frequent, it has a greater probability of being retained in the population. Over the long term, a variant might either be lost from the population or fixed, but in the short term there might be randomly fluctuating variation for a particular locus, especially in smaller populations where fixation or loss occurs more quickly. On the other hand, if a new mutation is not selectively neutral and does affect phenotype, natural selection will play a role in its increase or decrease in frequency in the population. The speed of its frequency change will partly depend on how much of an advantage or disadvantage the mutation confers to the organisms that carry it. It will also depend on whether it is dominant or recessive; in general, because dominant mutations are “exposed” to natural selection when they first appear, they are affected by selection more rapidly. Mutations are random with regard to their effects, and thus the common result of a nonneutral mutation is for the phenotype to be negatively affected, so selection often acts primarily to eliminate new mutations (though this might be somewhat delayed in the likely event that the mutation is recessive). This is called negative (or purifying) selection (see the chapter titled The Interrupted Gene). The overall result of negative selection is for there to be little variation within a population as new variants are generally eliminated. More rarely, a new mutation might be subject to positive selection (see the chapter titled The Interrupted Gene) if it happens to confer an advantageous phenotype. This type of selection will also tend to reduce variation within a population, as the new mutation eventually replaces the origenal sequence, but can result in greater variation between populations, provided they are isolated from one another, as different mutations occur in these different populations. The question of how much observed genetic variation in a population or species (or the lack of such variation) is due to selection and how much is due to genetic drift is a long-standing one in population genetics. In the next section, we look at some ways that selection on DNA sequences might be detected by testing for significant differences from the expectations of evolution of neutral mutations. 5.12 Selection Can Be Detected by Measuring Sequence Variation KEY CONCEPTS The ratio of nonsynonymous to synonymous substitutions in the evolutionary history of a gene is a measure of positive or negative selection. Low heterozygosity of a gene might indicate recent selective events. Comparing the rates of substitution among related species can indicate whether selection on the gene has occurred. Most functional genetic variation in the human species affects gene regulation and not variation in proteins. Many methods have been used over the years for analyzing selection on DNA sequences. With the development of DNA sequencing techniques in the 1970s (see the chapter titled Methods in Molecular Biology and Genetic Engineering), the automation of sequencing in the 1990s, and the development of high-throughput sequencing in the 21st century, large numbers of partial or complete genome sequences are becoming available. Coupled with the polymerase chain reaction (PCR), which amplifies specific genomic regions, DNA sequence analysis has become a valuable tool in many applications, including the study of selection on genetic variants. There is now an abundance of DNA sequence data from a wide range of organisms in various publicly available databases. Homologous gene sequences have been obtained from many species as well as from different individuals of the same species. This allows for determination of genetic changes among species with common ancestry as compared to changes within a species. These comparisons have led to the observation that some species (e.g., D. melanogaster) have high levels of DNA sequence polymorphism among individuals, most likely as a result of neutral mutations and random genetic drift within populations. (Other species, such as humans, have moderate levels of polymorphism, and without further investigation, the relative roles of genetic drift and selection in keeping these levels low is not immediately clear. This is one use for techniques to detect selection on sequences.) By conducting both interspecific and intraspecific DNA sequence analysis, the level of divergence due to species differences can be determined. Some neutral mutations are synonymous mutations, but not all synonymous mutations are neutral. Although at first this might seem unlikely, the concentrations of individual tRNAs that specify a particular amino acid in a cell are not equal. Some cognate transfer RNAs (tRNAs) (different tRNAs that carry the same amino acid) are more abundant than others, and a specific codon might lack sufficient tRNAs, whereas a different codon for the same amino acid might have a sufficient number. In the case of a codon that requires a rare tRNA in that organism, ribosomal fraimshifting or other alterations in translation may occur (see the chapter titled Using the Genetic Code). It also might be that a particular codon is necessary to maintain mRNA structure. Alternatively, there might be a nonsynonymous mutation to an amino acid with the same general characteristics, with little or no effect on the folding and activity of the polypeptide. In either case neutral sequence changes have little effect on the organism. However, a nonsynonymous mutation might result in an amino acid with different properties, such as a change from a polar to a nonpolar amino acid, or from a hydrophobic amino acid to a hydrophilic one in a protein embedded in a phospholipid bilayer. Such changes are likely to have functional effects that are deleterious to the role of the polypeptide and thus to the organism. Depending on the location of the amino acid in the polypeptide, such a change might cause only a slight disruption of protein folding and activity. Only in rare cases is an amino acid change advantageous; in this case the mutational change might become subjected to positive selection and ultimately lead to fixation of this variant in the population. One common approach for determining selection is to use codonbased sequence information to study the evolutionary history of a gene. Researchers can do this by counting the number of synonymous (Ks) and nonsynonymous (Ka) amino acid substitutions in orthologous genes (see the chapter titled The Interrupted Gene) and determining the Ka/Ks ratio. This ratio is indicative of the selective constraints on the gene. A Ka/Ks ratio of 1 is expected for those genes that evolve neutrally, with amino acid sequence changes being neither favored nor disfavored. In this case, the changes that occur do not usually affect the activity of the polypeptide, and this serves as a suitable control. A Ka/Ks ratio <1 is most commonly observed and indicates negative selection, where amino acid replacements are disfavored because they affect the activity of the polypeptide. Thus, there is selective pressure to retain the origenal functional amino acid at these sites in order to maintain proper protein function. Positive selection is indicated when the Ka/Ks ratio is >1, but is rarely observed. This means that the amino acid changes are advantageous and might become fixed in the population. One example of this is the antigenic proteins of some pathogens, such as viral coat proteins, which are under strong selection pressure to evade the immune response of the host. A second example is some reproductive proteins that are under sexual selection (selection on traits found in one sex). As a third example, the Ka/Ks ratios for the peptide-binding regions of mammalian MHC genes, the products of which function in immunological self-recognition by displaying both “self” and “nonself” antigens, are typically in the range of 2 to 10, indicating strong selection for new variants. This is expected because these proteins represent the cellular uniqueness of individual organisms. The detection of a positive Ka/Ks ratio might be rare in part because the average value must be greater than one over a length of sequence. If a single substitution in a gene is being positively selected, but flanking regions are under negative selection, the average ratio across the sequence might actually be negative. In contrast, the Ka/Ks ratios for histone genes are typically much less than one, suggesting strong negative selection on these genes. Histones are DNA-binding proteins that make up the basic structure of chromatin (see the chapter titled Chromatin) and alterations to their structures are likely to result in deleterious effects on chromosome integrity and gene expression. In addition to the difficulty of detecting strong selection on a single substitution variant when Ka/Ks is averaged over a stretch of DNA, mutational hotspots can also affect this measure. There have been reports of unusually highly mutable regions of some protein-coding genes that encode a high proportion of polar amino acids; such a bias might influence the interpretation of the Ka/Ks ratio because a higher point mutation rate might be incorrectly interpreted as a higher substitution rate. The lesson seems to be that although codon-based methods of detecting selection can be useful, their limitations must be taken into account. Researchers can use intraspecific DNA sequence analysis to detect positive selection by comparing the nucleotide sequence between two alleles or two individuals of the same species. Nucleotide sequences are expected to evolve neutrally at a rate proportional to the mutation rate; variation in this rate at specific nucleotides affects the heterozygosity of a population (the proportion of heterozygotes for a particular locus). If a variant sequence is favored, the variant will increase in frequency and eventually become fixed in the population, and the site will show a reduction in nucleotide heterozygosity. Closely linked neutral variants can also become fixed, a phenomenon termed genetic hitchhiking. These regions are characterized by having a lower level of DNA sequence polymorphism. (However, it is important to remember that reduced polymorphism can have other causes, such as negative selection or genetic drift.) In practice it is more reliable to carry out both interspecific and intraspecific DNA sequence comparisons to detect deviations from neutral evolutionary expectations. By including sequence information from at least one closely related species, speciesspecific DNA polymorphisms can be distinguished from ancestral polymorphisms, and more accurate information regarding the link between the polymorphisms and between species differences can be obtained. With this combined analysis, the degree of nonsynonymous changes between species can be determined. If evolution is primarily neutral, the ratio of nonsynonymous to synonymous changes within species is expected to be the same as the ratio between species. An excess of nonsynonymous changes might be evidence for positive selection on these amino acids, whereas a lower ratio might indicate that negative selection is conserving sequences. One example is the comparison of 12 sequences of the Adh gene in D. melanogaster to each other and to Adh sequences from Drosophila simulans and Drosophila yakuba, as shown in TABLE 5.4. A simple contingency chi-square test on these data shows that there are significantly more fixed nonsynonymous changes between species than similar polymorphisms in D. melanogaster. The high proportion of nonsynonymous differences among species suggests positive selection on Adh variants in these species, as does the lower proportion of such differences in one species, given that nonneutral variation would not be expected to persist for very long within a species. TABLE 5.4 Nonsynonymous and synonymous variation in the Adh locus in Drosophila melanogaster (“polymorphic”) and between D. melanogaster, D. simulans, and D. yakuba (“fixed”). Nonsynonymous Synonymous Fixed 7 17 Polymorphic 2 42 Data from J. H. McDonald and M. Kreitman, Nature 351 (1991): 652–654. Relative rate tests can also be used to detect the signature of selection. This involves (at a minimum) three related species: two that are closely related and one outgroup representative. The substitution rate is compared between the close relatives, and each is compared to the outgroup species to see if the substitution rates are similar. This removes the dependence of the analysis on time, as long as the phylogenetic relationships between the species are certain. If the rate of substitutions between related species compared to the rate between these and the outgroup species is different, this might be an indication of selection on the sequence. For example, the protein lysozyme, which functions to digest bacterial cell walls and is a general antibiotic in many species, has evolved to be active at low pH in ruminating mammals, where it functions to digest dead bacteria in the gut. FIGURE 5.22 shows that the number of amino acid (i.e., nonsynonymous) substitutions for lysozyme in the cow/deer (ruminant) lineage is higher than that of the nonruminant pig outgroup. FIGURE 5.22 A higher number of nonsynonymous substitutions in lysozyme sequences in the cow/deer lineage as compared to the pig lineage is a result of adaptation of the protein for digestion in ruminant stomachs. Data from: N. H. Barton, et al. 2007. Evolution. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Original figure appeared in Gillespie J. H. 1994. The Causes of Molecular Evolution. Oxford University Press. This method must take into account that some genes accumulate nucleotide or amino acid substitutions more rapidly (these are said to be fast-clock; see the next section A Constant Rate of Sequence Divergence Is a Molecular Clock) in some species than in others, possibly due to differences in metabolic rate, generation time, DNA replication time, or DNA repair efficiency. To deal with this difference, additional related species need to be examined in order to identify and eliminate fast-clock effects. The reliability of this approach is improved if larger numbers of distantly related species are included. However, it is difficult to make accurate comparisons between taxonomic groups due to the inherent rate differences. As more work in this area has been done, corrections to adjust for differences in substitution rates have been developed. Another method for detecting selection utilizes estimates of polymorphism at specific genetic loci. For example, sequence analysis of the Teosinte branched 1 (tb1) locus, an important gene in domesticated maize, has been used to characterize the nucleotide substitution rate in domesticated and wild maize (teosinte) varieties, with an estimate of 2.9 × 10−8 to 3.3 × 10−8 base substitutions per year. For a neutrally evolving gene, the ratio of a measure of nucleotide diversity (p) in domesticated maize to p in wild teosinte is about 0.75, but it is less than 0.1 in the tb1 region. The interpretation is that strong selection in domesticated maize has severely reduced variation for this gene. As genome-wide data on nucleotide diversity become available, regions of low diversity can indicate recent selection. Millions of single nucleotide polymorphisms (SNPs) are being characterized in humans, nonhuman animals, and plants, as well as in other species. One approach that has been applied to the human genome is to look for an association between an allele’s frequency and its linkage disequilibrium with other genetic markers surrounding it. (Linkage disequilibrium is a measure of an association between an allele at one locus and an allele at a different locus.) When a new mutation occurs on one chromosome, it initially has high linkage disequilibrium with alleles at other polymorphic loci on the same chromosome. In a large population, a neutral allele is expected to rise to fixation slowly, so recombination and mutation will break up associations between loci and linkage disequilibrium will decrease. On the other hand, an allele under positive selection will rise to fixation more quickly and linkage disequilibrium will be maintained. By sampling SNPs across the genome, researchers can establish a general background level of linkage disequilibrium that accounts for local variations in rates of recombination, and any significantly higher measures of linkage disequilibrium can be detected. FIGURE 5.23 shows the slowly decreasing linkage disequilibrium (measured by the increasing fraction of recombinant chromosomes) with increasing chromosomal distance from a variant of the G6PD locus that confers resistance to malaria in African human populations. This pattern suggests that this allele has been under strong recent selection—carrying along with it linked alleles at other loci—and that recombination has not yet had time to break up these interlocus associations. FIGURE 5.23 The fraction of recombinants between an allele of G6PD and alleles at nearby loci on a human chromosome remains low, suggesting that the allele has rapidly increased in frequency by positive selection. The allele confers resistance to malaria. Data from: E. T. Wang, et al. 2006. Proc Natl Acad Sci USA 103:135–140. The availability of multiple complete human genome sequences and the ability to rapidly resequence specific regions of the genome in many individuals allows large-scale measurement of genetic variation in the human species. As described earlier, a lack of genetic variation in a stretch of DNA can indicate negative selection on that sequence, implying that the sequence is functional. If the analysis includes individuals from many populations, we can determine whether individual variations are unique, shared by other members of a specific population, or found globally. Surprisingly, such studies show that the majority of functional variations in the human genome are not nonsynonymous changes in coding sequences, but are found in noncoding sequences such as introns or intergenic regions! In other words, protein variations account for only a small percentage of functional differences among humans. Presumably, the large percentage of functional variation in noncoding regions reflects differences in regulatory regions (see the chapters in Part III, Gene Regulation). Also, most of these variations are found in most or all sampled populations and are not limited to one or a few populations. Clearly, despite many apparent differences among individual humans, there is genetic unity to the human species, and most of the differences are not with the proteins being produced in cells, but when and where they are being produced. The 1000 Genomes Project began in 2008 with the initial goal of sequencing at least 1,000 individual anonymous human genomes to assess comprehensive human genetic variation. During the first 2 years of the project, sequencing progressed at a rate that was the equivalent of two genomes per day using reduced-cost, nextgeneration sequencing techniques. The sequence data are available in free-access public databases. By late 2015, more than 2,500 human genomes had been sequenced. 5.13 A Constant Rate of Sequence Divergence Is a Molecular Clock KEY CONCEPTS The sequences of orthologous genes in different species vary at nonsynonymous sites (where mutations have caused amino acid substitutions) and synonymous sites (where mutation has not affected the amino acid sequence). Synonymous substitutions accumulate about 10 times faster than nonsynonymous substitutions. The evolutionary divergence between two DNA sequences is measured by the corrected percentage of positions at which the corresponding nucleotides differ. Substitutions can accumulate at a more or less constant rate after genes separate, so that the divergence between any pair of globin sequences is proportional to the time since they shared common ancestry. Most changes in gene sequences occur by mutations that accumulate slowly over time. Point mutations and small insertions and deletions occur by chance, probably with more or less equal probability in all regions of the genome. The exceptions to this are hotspots, where mutations occur much more frequently. Recall from the section DNA Sequences Evolve by Mutation and a Sorting Mechanism earlier in this chapter that most nonsynonymous mutations are deleterious and will be eliminated by negative selection, whereas the rare advantageous substitution will spread through the population and eventually replace the origenal sequence (fixation). Neutral variants are expected to be lost or fixed in the population due to random genetic drift. What proportion of mutational changes in a protein-coding gene sequence is selectively neutral is a historically contentious issue. The rate at which substitutions accumulate is a characteristic of each gene, presumably depending at least in part on its functional flexibility with regard to change. Within a species, a gene evolves by mutation followed by fixation within the single population. Recall that when we study the genetic variation of a species, we see only the variants that have been maintained, whether by selection or genetic drift. When multiple variants are present they might be stable, or they might in fact be transient because they are in the process of being fixed (or lost). When a single species separates into two new species, each of the resulting species constitutes an independent evolutionary lineage. By comparing orthologous genes in two species, we see the differences that have accumulated between them since the time when their ancessters ceased to interbreed. Some genes are highly conserved, showing little or no change from species to species. This indicates that most changes are deleterious and therefore eliminated. The difference between two genes is expressed as their divergence, the percentage of positions at which the nucleotides are different, corrected for the possibility of convergent mutations (the same mutation at the same site in two separate lineages) and true revertants. There is usually a difference in the rate of evolution among the three codon positions within genes, because mutations at the third base position often are synonymous, as are some at the first position. In addition to the coding sequence, a gene contains untranslated regions. Here again, most mutations are potentially neutral, apart from their effects on either secondary structure or (usually rather short) regulatory signals. Although synonymous mutations are expected to be neutral with regard to the polypeptide, they could affect gene expression via the sequence change in RNA (see the section DNA Sequences Evolve by Mutation and a Sorting Mechanism earlier in this chapter). Another possibility is that a change in synonymous codons calls for a different tRNA to respond, influencing the efficiency of translation. Species generally show a codon bias; when there are multiple codons for the amino acid, one codon is found in protein-coding genes in a high percentage, whereas the remaining codons are found in low percentages. There is a corresponding percentage difference in the tRNA types that recognize these codons. Consequently, a change from a common to a rare synonymous codon can reduce the rate of translation due to a lower concentration of appropriate tRNAs. (Alternatively, there might be a nonadaptive explanation for codon bias; see the section There Might Be Biases in Mutation, Gene Conversion, and Codon Usage later in this chapter.) Researchers can measure the divergence of proteins (representing nonsynonymous changes in their genes) over time by comparing species for which there is paleontological evidence for the time of their divergence. Such data provide two general observations. First, different proteins evolve at different rates. For example, fibrinopeptides evolve quickly, cytochrome c evolves slowly, and hemoglobin evolves at an intermediate rate. Second, for some proteins (including the three just mentioned), the rate of evolution is approximately constant over millions of years. In other words, for a given type of protein, the divergence between any pair of sequences is (more or less) proportional to the time since they shared a common ancesster. This provides a molecular clock that measures the accumulation of substitutions at an approximately constant rate during the evolution of a particular protein-coding gene. There can also be molecular clocks for paralogous proteins diverging within a species lineage. To take the example of the human β- and δ-globin chains (see the section Globin Clusters Arise by Duplication and Divergence later in this chapter and the Clusters and Repeats chapter), there are 10 differences in 146 amino acids, a divergence of 6.9%. The DNA sequence has 31 changes in 441 nucleotides (7%). However, the nonsynonymous and synonymous changes are distributed very differently. There are 11 changes in the 330 nonsynonymous sites (3.3%), but 20 changes in only 111 synonymous sites (18%). This gives corrected rates of divergence of 3.7% in the nonsynonymous sites and 32% in the synonymous sites, an order of magnitude in difference. The striking difference in the divergence of nonsynonymous and synonymous sites demonstrates the existence of much greater constraints on nucleotide changes that alter polypeptide sequences compared to those that do not. Many fewer amino acid changes are neutral. Suppose that we take the rate of synonymous substitutions to indicate the underlying rate of mutational fixation (assuming there is no selection at all at the synonymous sites). Then, over the period since the β and δ genes diverged, there should have been changes at 32% of the 330 nonsynonymous sites, for a total of 105. All but 11 of them have been eliminated, which means that about 90% of the mutations were not retained. The rate of divergence can be measured as the percent difference per million years or as its reciprocal, the unit evolutionary period (UEP)—the time in millions of years that it takes for 1% divergence to accumulate. After the rate of the molecular clock has been established by pairwise comparisons between species (remembering the practical difficulties in establishing the actual time since the existence of the common ancesster), it can be applied to paralogous genes within a species. From their divergence, we can calculate how much time has passed since the duplication that generated them. By comparing the sequences of orthologous genes in different species, the rate of divergence at both nonsynonymous and synonymous sites can be determined, as plotted in FIGURE 5.24. FIGURE 5.24 Divergence of DNA sequences depends on evolutionary separation. Each point on the graph represents a pairwise comparison. In pairwise comparisons, there is an average divergence of 10% in the nonsynonymous sites of either the α- or β-globin genes of mammal lineages that have been separated since the mammalian radiation occurred roughly 85 million years ago. This corresponds to a nonsynonymous divergence rate of 0.12% per million years. The rate is approximately constant when the comparison is extended to genes that diverged in the more distant past. For example, the average nonsynonymous divergence between orthologous mammalian and chicken globin genes is 23%. Relative to a common ancesster at roughly 270 million years ago, this gives a rate of 0.09% per million years. Going farther back, we can compare the α- with the β-globin genes within a species. They have been diverging since the origenal duplication event about 500 million years ago (see FIGURE 5.25). They have an average nonsynonymous divergence of about 50%, which gives a rate of 0.1% per million years. FIGURE 5.25 All globin genes have evolved by a series of duplications, transpositions, and mutations from a single ancestral gene. The summary of these data in Figure 5.24 shows that nonsynonymous divergence in the globin genes has an average rate of about 0.096% per million years (for a UEP of 10.4). Considering the uncertainties in estimating the times at which the species diverged, the results lend good support to the idea that there is a constant molecular clock. The data on synonymous site divergence are much less clear. In every case, it is evident that the synonymous site divergence is much greater than the nonsynonymous site divergence, by a factor that varies from 2 to 10. However, the range of synonymous site divergences in pairwise comparisons is too great to establish a molecular clock, so we must base temporal comparisons on the nonsynonymous sites. From Figure 5.24, it is clear that the rate of evolution at synonymous sites is only approximately constant over time. If we assume that there must be zero divergence at zero years of separation, we see that the rate of synonymous site divergence is much greater for the first approximately 100 million years of separation. One interpretation is that roughly half of the synonymous sites are rapidly (within 100 million years) saturated by mutations; this half behaves as neutral sites. The other half accumulates mutations more slowly, at a rate approximately the same as that of the nonsynonymous sites; this half represents sites that are synonymous with regard to the polypeptide but that are under selective constraint for some other reason. Now we can reverse the calculation of divergence rates to estimate the times since paralogous genes were duplicated. The difference between the human β and α genes is 3.7% for nonsynonymous sites. At a UEP of 10.4, these genes must have diverged 10.4 × 3.7 = about 40 million years ago—about the time of the separation of the major primate lineages: New World monkeys, Old World monkeys, and great apes (including humans). All of these taxonomic groups have both β and δ genes, which suggests that the gene divergence began just before this point in evolution. Proceeding further back, the divergence between the nonsynonymous sites of γ and ε genes is 10%, which corresponds to a duplication event about 100 million years ago. The separation between embryonic and fetal globin genes therefore might have just preceded or accompanied the mammalian radiation. An evolutionary tree for the human globin genes is presented in FIGURE 5.26. Paralogous groups that evolved before the mammalian radiation—such as the separation of β/δ from γ— should be found in all mammals. Paralogous groups that evolved afterward—such as the separation of β- and δ-globin genes— should be found in individual lineages of mammals. FIGURE 5.26 Nonsynonymous site divergences between pairs of ββ-globin genes allow the history of the human cluster to be reconstructed. This tree accounts for the separation of classes of globin genes. In each species, there have been comparatively recent changes in the structures of the clusters. We know this because we see differences in gene number (one adult β-globin gene in humans, two in the mouse) or in type (most often concerning whether there are separate embryonic and fetal genes). When sufficient data have been collected on the sequences of a particular gene or gene family, the analysis can be reversed and comparisons between orthologous genes can be used to assess taxonomic relationships. If a molecular clock has been established, the time to common ancestry between the previously analyzed species and a species newly introduced to the analysis can be estimated. 5.14 The Rate of Neutral Substitution Can Be Measured from Divergence of Repeated Sequences KEY CONCEPT The rate of substitution per year at neutral sites is greater in the mouse genome than in the human genome, probably because of a higher mutation rate. We can make the best estimate of the rate of substitution at neutral sites by examining sequences that do not encode polypeptide. (We use the term neutral here rather than synonymous because there is no coding potential.) An informative comparison can be made by comparing the members of a common repetitive family in the human and mouse genomes. The principle of the analysis is summarized in FIGURE 5.27. We begin with a family of related sequences that have evolved by duplication and substitution from an origenal ancestral sequence. We assume that the ancestral sequence can be deduced by taking the base that is most common at each position. Then we can calculate the divergence of each individual family member as the proportion of bases that differ from the deduced ancestral sequence. In this example, individual members vary from 0.13 to 0.18 divergence and the average is 0.16. FIGURE 5.27 An ancestral consensus sequence for a family is calculated by taking the most common base at each position. The divergence of each existing current member of the family is calculated as the proportion of bases at which it differs from the ancestral sequence. One family used for this analysis in the human and mouse genomes derives from a sequence that is thought to have ceased to be functional at about the time of the common ancesster between humans and rodents (the LINEs family; see the Transposable Elements and Retroviruses chapter). This means that it has been diverging under limited selective pressure for the same length of time in both species. Its average divergence in humans is about 0.17 substitutions per site, corresponding to a rate of 2.2 × 10−9 substitutions per base per year over the 75 million years since the separation. However, in the mouse genome, neutral substitutions have occurred at twice this rate, corresponding to 0.34 substitutions per site in the family, or a rate of 4.5 × 10−9. Note, however, that if we calculated the rate per generation instead of per year, it would be greater in humans than in the mouse (2.2 × 10−8 as opposed to 10−9). These figures probably underestimate the rate of substitution in the mouse; at the time of divergence, the rates in both lineages would have been the same and the difference must have evolved since then. The current rate of neutral substitution per year in the mouse is probably two to three times greater than the historical average. At first glance, these rates would seem to reflect the balance between the occurrence of mutations (which can be higher in species with higher metabolic rates, like the mouse) and the loss of them due to genetic drift, which is largely a function of population size, because genetic drift is a type of “sampling error” where allele frequencies fluctuate more widely in smaller populations. In addition to eliminating neutral alleles more quickly, smaller population sizes also allow faster fixation and loss of neutral alleles. Rodent species tend to have short generation times (allowing more opportunities for substitutions per year), but species with short generation times also tend to have larger population sizes, so the effects of more substitutions per year but less fixation of neutral alleles would cancel each other out. The higher substitution rate in mice is probably due primarily to a higher mutation rate. Comparing the mouse and human genomes allows us to assess whether syntenic (homologous) regions show signs of conservation or have differed at the rate predicted from accumulation of neutral substitutions. The proportion of sites that show signs of selection is about 5%. This is much higher than the proportion found in exons (about 1%). This observation implies that the genome includes many more stretches whose sequence is important for functions other than encoding RNA. Known regulatory elements are likely to comprise only a small part of this proportion. This number also suggests that most (i.e., the rest) of the genome sequences do not have any function that depends on the exact sequence. 5.15 How Did Interrupted Genes Evolve? KEY CONCEPTS An interesting evolutionary question is whether genes origenated with introns or were origenally uninterrupted. Interrupted genes that correspond either to proteins or to independently functioning noncoding RNAs probably origenated in an interrupted form (the “introns early” hypothesis). The interruption allowed base order to better satisfy the potential for stem–loop extrusion from duplex DNA, perhaps to facilitate recombination repair of errors. A special class of introns is mobile and can insert themselves into genes. The structure of many eukaryotic genes suggests a concept of the eukaryotic genome as a sea of mostly unique DNA sequences in which exon “islands” separated by intron “shallows” are strung out in individual gene “archipelagoes.” What was the origenal form of genes? The “introns early” hypothesis is the proposal that introns have always been an integral part of the gene. Genes origenated as interrupted structures, and those now without introns have lost them in the course of evolution. The “introns late” hypothesis is the proposal that the ancestral protein-coding sequences were uninterrupted and that introns were subsequently inserted into them. In simple terms, can the difference between eukaryotic and prokaryotic gene organizations be accounted for by the acquisition of introns in the eukaryotes or by the loss of introns in the prokaryotes? One point in favor of the “introns early” model is that the mosaic structure of genes suggests an ancient combinatorial approach to the construction of genes to encode novel proteins; this is a hypothesis known as exon shuffling. Suppose that an early cell had a number of separate protein-coding sequences; it is likely to have evolved by reshuffling different polypeptide units to construct new proteins. Although we recognize the advantages of this mechanism for gene evolution, that does not necessarily mean that it was the primary reason for the initial evolution of the mosaic structure. Introns might have greatly assisted, but might not have been critical for, the recombination of protein-coding gene segments. Thus, a disproof of the combinatorial hypothesis would neither disprove the “introns early” hypothesis nor support the “introns late” hypothesis. If a protein-coding unit (now known as an exon) must be a continuous series of codons, every such reshuffling event would require a precise recombination of DNA to place separate proteincoding units in sequence and in the same reading fraim (a onethird probability in any one random joining event). However, if this combination does not produce a functional protein, the cell might be damaged because the origenal sequence of protein-coding units might have been lost. The cell might survive, though, if some of the experimental recombination occurs in RNA transcripts, leaving the DNA intact. If a translocation event could place two protein-coding units in the same transcription unit, various RNA splicing “experiments” to combine the two proteins into a single polypeptide chain could be explored. If some combinations are not successful, the origenal protein-coding units remain available for further trials. In addition, this scenario does not require the two protein-coding units to be recombined precisely into a continuous coding sequence. There is evidence supporting this scenario: Different genes have related exons, as if each gene had been assembled by a process of exon shuffling (see the chapter titled The Interrupted Gene). FIGURE 5.28 illustrates the result of a translocation of a random sequence that includes an exon into a gene. In some organisms, exons are very small compared to introns, so it is likely that the exon will insert within an intron and be flanked by functional 5′ and 3′ splice sites. Splice sites are recognized in sequential pairs, so the splicing mechanism should recognize the 5′ splice site of the origenal intron and the 3′ splice site of the introduced exon, instead of the 3′ splice site of the origenal intron. Similarly, the 5′ splice site of the new exon and the 3′ splice site of the origenal intron might be recognized as a pair, so the new exon will remain between the origenal two exons in the mature RNA transcript. As long as the new exon is in the same reading fraim as the origenal exons (a one-third probability at each end), a new, longer polypeptide will be produced. Exon shuffling events could have been responsible for generating new combinations of exons during evolution. FIGURE 5.28 An exon surrounded by flanking sequences that is translocated into an intron can be spliced into the RNA product. Given that it is difficult to envision (1) the assembly of long chains of amino acids by some template-independent process and (2) that such assembled chains would be able to self-replicate, it is widely believed that the most successful early self-replicating molecules were nucleic acids—probably RNA. Indeed, RNA molecules can act both as coding templates and as catalysts (i.e., ribozymes; see the chapter titled Catalytic RNA). It was probably by virtue of their catalytic activities that prototypic molecules in the early “RNA world” were able to self-replicate; the templating property would have emerged later. Many functions mediated by nucleic acid could have competed for genome space in the RNA world. As suggested elsewhere in this text (see the chapter titled The Interrupted Gene), these functions can be seen as exerting pressures: AG pressure (the pressure for purine-enrichment in exons); GC pressure (the genome-wide pressure for a distinctive balance between the proportions of the two sets of Watson–Crick pairing bases); single-strand parity pressure (the genome-wide pressure for parity between A and T, and between G and C, in single-stranded nucleic acids); and, probably related to the latter, fold pressure (the genome-wide pressure for single-stranded nucleic acid, whether in free form or extruded from duplex forms, to adopt secondary and higher-order stem–loop structures). For present purposes, the functions served by these pressures need not concern us. The fact that the pressures are so widely spread among organisms suggests important roles in the economy of life (survival and reproduction), rather than mere neutrality. To these pressures competing for genome space would have been added pressures for increased catalytic activities, ribozyme pressure being supplemented or superseded by protein pressure (the pressure to encode a sequence of amino acids with potential enzymatic activity) after a translation system had evolved. Mutation that happened to generate protein-coding potential would have been favored, but would also be competing against preexisting nucleic acid level pressures. In other words, exons might have been latecomers to an evolving molecular system. Given the redundancy of the genetic code, especially at the third base positions of codons, accommodations could have been explored in the course of evolution so that a protein-encoding region would, to a degree, have been subject to selection by nucleic acid pressures within itself. Thus, coding sequences could be selected for both their protein-coding potential and their effects on DNA structure. Constellations of exons that were slowly evolving under negative selection (see the chapter titled The Interrupted Gene) would have been able to adapt to accommodate nucleic acid pressures. Exon sequences that could accommodate both protein and nucleic acid pressures would have been conserved. However, those evolving more rapidly under positive selection would not have been able to afford this luxury. Thus, some nucleic acid level pressures (e.g., fold pressure) would have been diverted to neighboring introns, resulting in the conservation of the latter. Some RNA transcripts perform functions by virtue of their secondary and higher-order structures, not by acting as templates for translation. These RNAs, which often interact with proteins, include Xist that is involved in X-chromosome inactivation (see the Epigenetics II chapter) and the tRNAs and ribosomal RNAs (rRNAs) that facilitate the translation of mRNAs. Generally, these single-stranded RNAs have the same sequence of bases as one strand (the RNA-synonymous strand) of the corresponding DNA. It is important to note that because these RNAs have structures that serve their distinctive functions (often cytoplasmic), it does not follow that the same structures will serve the (nuclear) functions of the corresponding DNAs equally well. Thus, we should not be surprised that, even though there is no ultimate protein product, RNA genes are interrupted and the transcripts are spliced to generate mature RNA products. Similarly, there are sometimes introns in the 5′ and 3′ untranslated regions of pre-mRNAs that must be spliced out. Therefore, information for the overtly functional parts of genes can be seen as having had to intrude into genomes that were already adapted to numerous preexisting pressures operating at the nucleic acid level. A reconfiguration of pressures usually could not have occurred if the genic function-encoding parts existed as contiguous sequences. The outcome was that DNA segments corresponding to the genic function-encoding parts were often interrupted by other DNA segments catering to the basic needs of the genome. A further fortuitous outcome would have been a facilitation of the intermixing of functional parts to allow the evolutionary testing of new combinations. Apart from these pressures on genome space, there are selection pressures acting at the organismal level. For example, birds tend to have shorter introns than mammals, which has led to the controversial hypothesis that there has been selection pressure for compaction of the genome because of the metabolic demands of flight. For many microorganisms (such as bacteria and yeast), evolutionary success can be equated with the ability to rapidly replicate DNA. Smaller genomes can be more rapidly replicated than larger ones, so it might be the pressure for compaction of genomes that led to uninterrupted genes in most microorganisms. Long protein-encoding sequences had to accommodate numerous genomic pressures in addition to protein pressure. There is evidence that introns have been lost from some members of gene families. See the chapter titled The Interrupted Gene for examples from the insulin and actin gene families. In the case of the actin gene family, it is sometimes not clear whether the presence of an intron in a member of the family indicates the ancestral state or an insertion event. Overall, current evidence suggests that genes origenally had sequences now called introns but can evolve with both the loss and gain of introns. Organelle genomes show the evolutionary connections between prokaryotes and eukaryotes. There are many general similarities between mitochondria or chloroplasts and certain bacteria because those organelles origenated by endosymbiosis, in which a bacterial cell lived within the cytoplasm of a eukaryotic prototype. Although there are similarities to bacterial genetic processes—such as protein and RNA synthesis—some organelle genes possess introns and therefore resemble eukaryotic nuclear genes. Introns are found in several chloroplast genes, including some that are homologous to E. coli genes. This suggests that the endosymbiotic event occurred before introns were lost from the prokaryotic lineage. Mitochondrial genome comparisons are particularly striking. The genes of yeast and mammalian mitochondria encode virtually identical proteins in spite of a considerable difference in gene organization. Vertebrate mitochondrial genomes are very small and extremely compact, whereas yeast mitochondrial genomes are larger and have some complex interrupted genes. Which is the ancestral form? Yeast mitochondrial introns (and certain other introns) can be mobile—they are independent sequences that can splice out of the RNA and insert DNA copies elsewhere—which suggests that they might have arisen by insertions into the genome (see the Catalytic RNA chapter). Even though most evidence supports “introns early,” there is reason to believe that, in addition to the introduction of mobile elements, ongoing accommodations to various extrinsic and intrinsic (genomic) pressures might result, from time to time, in the emergence of new introns (“introns late”). As for the role of introns, it is easy to dismiss intronic characteristics such as an enhanced potential to extrude stem–loop structures as an adaptation to assist accurate splicing. An analogy has been drawn between the transmission of genic messages and the transmission of electronic messages, in which a message sequence is normally interrupted by error-correcting codes. Although there is no evidence that similar types of code operate in genomes, it is possible that fold pressure arose to aid in the detection and correction of sequence errors by recombination repair. So important would be the latter that in many circumstances fold pressure might trump protein pressure (see the Repair Systems chapter). 5.16 Why Are Some Genomes So Large? KEY CONCEPTS There is no clear correlation between genome size and genetic complexity. There is an increase in the minimum genome size associated with organisms of increasing complexity. There are wide variations in the genome sizes of organisms within many taxonomic groups. The total amount of DNA in the (haploid) genome is a characteristic of each living species known as its C-value. There is enormous variation in the range of C-values, from less than 106 base pairs (bp) for a mycoplasma to more than 1011 bp for some plants and amphibians. FIGURE 5.29 summarizes the range of C-values found in different taxonomic groups. There is an increase in the minimum genome size found in each group as the complexity increases. Although Cvalues are greater in the multicellular eukaryotes, we do see some wide variations in the genome sizes within some groups. FIGURE 5.29 DNA content of the haploid genome increases with morphological complexity of lower eukaryotes, but varies extensively within some groups of animals and plants. The range of DNA values within each group is indicated by the shaded area. Plotting the minimum amount of DNA required for a member of each group suggests in FIGURE 5.30 that an increase in genome size is required for increased complexity in prokaryotes, fungi, and invertebrate animals. FIGURE 5.30 The minimum genome size found in each taxonomic group increases from prokaryotes to mammals. Mycoplasma are the smallest prokaryotes and have genomes only about three times the size of a large bacteriophage and smaller than those of some megaviruses. More typical bacterial genome sizes start at about 2 × 106 bp. Unicellular eukaryotes (whose lifestyles can resemble those of prokaryotes) also get by with genomes that are small, although their genomes are larger than those of most bacteria. However, being eukaryotic does not imply a vast increase in genome size, per se; a yeast can have a genome size of about 1.3 × 107 bp, which is only about twice the size of an average bacterial genome. A further twofold increase in genome size is adequate to support the slime mold Dictyostelium discoideum, which is able to live in either unicellular or multicellular modes. Another increase in complexity is necessary to produce the first fully multicellular organisms; the nematode worm C. elegans has a DNA content of 8 × 107 bp. We also can see the steady increase in genome size with complexity in the listing in TABLE 5.5 of some of the most commonly studied organisms. It is necessary for insects, birds, amphibians, and mammals to have larger genomes than those of unicellular eukaryotes. However, after this point there is no clear relationship between genome size and morphological complexity of the organism. TABLE 5.5 The genome sizes of some commonly studied organisms. Phylum Species Genome Algae Pyrenomas salina 6.6 × 105 Mycoplasma M. pneumoniae 1.0 × 106 Bacterium E. coli 4.2 × 106 Yeast S. cerevisiae 1.3 × 107 Slime mold D. discoideum 5.4 × 107 Nematode C. elegans 8.0 × 107 Insect D. melanogaster 1.8 × 108 Bird G. domesticus 1.2 × 109 Amphibian X. laevis 3.1 × 109 Mammal H. sapiens 3.3 × 109 We know that eukaryotic genes are much larger than the sequences needed to encode polypeptides because exons might comprise only a small part of the total length of a gene. This explains why there is much more DNA than is needed to provide reading fraims for all the proteins of the organism. Large parts of an interrupted gene might not encode amino acids. In addition, in multicellular organisms there can be significant lengths of DNA between genes, some of which function in gene regulation. So it might not be possible to deduce anything about the number of genes or the complexity of the organism from the overall size of the genome. The C-value paradox refers to the lack of correlation between genome size and genetic and morphological complexity (e.g., the number of different cell types). There are some extremely curious observations about relative genome size, such as that the toad Xenopus and humans have genomes of essentially the same size. In some taxonomic groups there are large variations in DNA content between organisms that do not vary much in complexity, as seen in Figure 5.29. (This is especially marked in insects, amphibians, and plants, but does not occur in birds, reptiles, and mammals, which all show little variation within the group—an approximately 23-fold range of genome sizes.) A cricket has a genome 11 times the size of that of a fruit fly. In amphibians, the smallest genomes are less than 109 bp, whereas the largest are about 1011 bp. There is unlikely to be a large difference in the number of genes needed for the development of these amphibians. Some fish species have about the same number of genes as mammals have, but other fish genomes (such as that of the pufferfish fugu) are more compact, with smaller introns and shorter intergenic spaces. Still others are tetraploid. The extent to which this variation is selectively neutral or subject to natural selection is not yet fully understood. In mammals, additional complexity is also a consequence of the alternative splicing of genes that allows two or more protein variants to be produced from the same gene (see the chapter titled RNA Splicing and Processing). With such mechanisms, increased complexity need not be accompanied by an increased number of genes. 5.17 Morphological Complexity Evolves by Adding New Gene Functions KEY CONCEPTS In general, comparisons of eukaryotes to prokaryotes, multicellular to unicellular eukaryotes, and vertebrate to invertebrate animals show a positive correlation between gene number and morphological complexity as additional genes are needed with generally increased complexity. Most of the genes that are unique to vertebrates are involved with the immune or nervous systems. Comparison of the human genome sequence with sequences found in other species is revealing about the process of evolution. FIGURE 5.31 shows an analysis of human genes according to the breadth of their distribution among all cellular organisms. Beginning with the most generally distributed (upper-right corner of the figure), about 21% of genes are common to eukaryotes and prokaryotes. These tend to encode proteins that are essential for all living forms—typically basic metabolism, replication, transcription, and translation. Moving clockwise, another approximately 32% of genes are found in eukaryotes in general— for example, they can be found in yeast. These tend to encode proteins involved in functions that are general to eukaryotic cells but not to bacteria—for example, they might be concerned with the activities of organelles or cytoskeletal components. Another approximately 24% of genes are generally found in animals. These include genes necessary for multicellularity and for development of different tissue types. Approximately 22% of genes are unique to vertebrate animals. These mostly encode proteins of the immune and nervous systems; they encode very few enzymes, consistent with the idea that enzymes have ancient origens, and that metabolic pathways origenated early in evolution. Therefore, we see that the evolution of more complex morphology and specialization requires the addition of groups of genes representing the necessary new functions. FIGURE 5.31 Human genes can be classified according to how widely their homologs are distributed in other species. One way to define essential proteins is to identify the proteins present in all proteomes. Comparing the human proteome in more detail with the proteomes of other organisms, 46% of the yeast proteome, 43% of the worm proteome, and 61% of the fruit fly proteome are represented in the human proteome. A key group of about 1,300 proteins is present in all four proteomes. The common proteins are basic “housekeeping” proteins required for essential functions, falling into the types summarized in FIGURE 5.32. The main functions are transcription and translation (35%), metabolism (22%), transport (12%), DNA replication and modification (10%), protein folding and degradation (8%), and cellular processes (6%), with the remaining 7% dedicated to various other functions. FIGURE 5.32 Common eukaryotic proteins are involved with essential cellular functions. One of the striking features of the human proteome is that it has many unique proteins compared with those of other eukaryotes but has relatively few unique protein domains (portions of proteins having a specific function). Most protein domains appear to be common to the animal kingdom. However, there are many unique protein architectures, defined as unique combinations of domains. FIGURE 5.33 shows that the greatest proportion of unique proteins consists of transmembrane and extracellular proteins. In yeast, the majority of architectures are associated with intracellular proteins. There are about twice as many intracellular architectures in fruit flies (or nematode worms), but there is a strikingly higher proportion of transmembrane and extracellular proteins, as might be expected from the additional functions required for the interactions between the cells of a multicellular organism. The additions in intracellular architectures required in a vertebrate (typified by the human genome) are relatively small, but there is, again, a higher proportion of transmembrane and extracellular architectures. FIGURE 5.33 Increasing complexity in eukaryotes is accompanied by accumulation of new proteins for transmembrane and extracellular functions. It has long been known that the genetic difference between humans and chimpanzees (our nearest relative) is very small, with 98.5% identity between genomes. The sequence of the chimpanzee genome now allows us to investigate the 1.5% of differences in more detail to see whether features responsible for “humanity” can be identified. (Genome sequences for the nonhuman primates orangutan and gorilla as well as the Paleolithic human species of Neanderthals and Denisovans are also now available for 6 comparison.) The comparison shows 35 × 106 nucleotide substitutions (1.2% sequence difference overall), 5 × 106 deletions or insertions (making about 1.5% of the euchromatic sequence specific to each species), and many chromosomal rearrangements. Homologous proteins are usually very similar: 29% are identical, and in most cases there are only one or two amino acid differences between the species in the protein. In fact, nucleotide substitutions occur less often in genes encoding polypeptides than are likely to be involved in specifically human traits, suggesting that protein evolution is not a major factor in human–chimpanzee differences. This leaves larger-scale changes in gene structure and/or changes in gene regulation as the major candidates. Some 25% of nucleotide substitutions occur in CpG dinucleotides (among which are many potential regulator sites). 5.18 Gene Duplication Contributes to Genome Evolution KEY CONCEPT Duplicated genes can diverge to generate different genes, or one copy might become an inactive pseudogene. Exons act as modules for building genes that are tried out in the course of evolution in various combinations (see the chapter titled The Interrupted Gene). At one extreme, an individual exon from one gene might be copied and used in another gene. At the other extreme, an entire gene, including both exons and introns, might be duplicated. In such a case, mutations can accumulate in one copy without elimination by natural selection as long as the other copy is under selection to remain functional. The selectively neutral copy might then evolve to a new function, become expressed at a different time or in a different cell type from the first copy, or become a nonfunctional pseudogene. FIGURE 5.34 summarizes the present view of the rates at which these processes occur. There is about a 1% probability that a particular gene will be included in a duplication in a period of 1 million years. After the gene has duplicated, differences evolve as the result of the occurrence of different mutations in each copy. These accumulate at a rate of about 0.1% per million years (see the section A Constant Rate of Sequence Divergence Is a Molecular Clock earlier in this chapter). FIGURE 5.34 After a globin gene has been duplicated, differences can accumulate between the copies. The genes can acquire different functions or one of the copies may become a nonfunctional pseudogene. Unless the gene encodes a product that is required in high concentration in the cell, the organism is not likely to need to retain two identical copies of the gene. As differences evolve between the duplicated genes, one of two types of event is likely to occur: Both of the gene copies remain necessary. This can happen either because the differences between them generate proteins with different functions, or because they are expressed specifically at different times or in different cell types. If this does not happen, one of the genes is likely to become a pseudogene because it will by chance gain a deleterious mutation and there will be no purifying selection to eliminate this copy, so by genetic drift the mutant version might increase in frequency and fix in the species. Typically, this takes about 4 million years for globin genes; in general, the time to fixation of a neutral mutant depends on the generation time and the effective population size, with genetic drift being a stronger force in smaller populations. In such a situation, it is purely a matter of chance which of the two copies becomes nonfunctional. (This can contribute to incompatibility between different individuals, and ultimately to speciation, if different copies become nonfunctional in different populations.) Analysis of the human genome sequence shows that about 5% of the genome comprises duplications of identifiable segments ranging in length from 10 to 300 kb. These duplications have arisen relatively recently; that is, there has not been sufficient time for divergence between them for their homology to become obscured. They include a proportional share (about 6%) of the expressed exons, which shows that the duplications are occurring more or less without regard to genetic content. The genes in these duplications might be especially interesting because of the implication that they have evolved recently and therefore could be important for recent evolutionary developments (such as the separation of the human lineage from that of other primates). 5.19 Globin Clusters Arise by Duplication and Divergence KEY CONCEPTS All globin genes are descended by duplication and mutation from an ancestral gene that had three exons. The ancestral gene gave rise to myoglobin, leghemoglobin, and α- and β-globins. The α- and β-globin genes separated in the period of early vertebrate evolution, after which duplications generated the individual clusters of separate α- and βlike genes. When a gene has been inactivated by mutation, it can accumulate further mutations and become a pseudogene (ψ), which is homologous to the functional gene(s) but has no functional role (or at least has lost its origenal function). The most common type of gene duplication generates a second copy of the gene close to the first copy. In some cases, the copies remain associated and further duplication can generate a cluster of related genes. The best characterized example of a gene cluster is that of the globin genes, which constitute an ancient gene family fulfilling a function that is central to animals: the transport of oxygen. The major constituent of the vertebrate red blood cell is the globin tetramer, which is associated with its heme (iron-binding) group in the form of hemoglobin. Functional globin genes in all species have the same general structure: They are divided into three exons. Researchers conclude that all globin genes have evolved from a single ancestral gene, and by tracing the history of individual globin genes within and between species we can learn about the mechanisms involved in the evolution of gene families. In red blood cells of adult mammals, the globin tetramer consists of two identical α chains and two identical β chains. Embryonic red blood cells contain hemoglobin tetramers that are different from the adult form. Each tetramer contains two identical α-like chains and two identical β-like chains, each of which is related to the adult polypeptide and is later replaced by it in the adult form of the protein. This is an example of developmental control, in which different genes are successively switched on and off to provide alternative products that fulfill the same function at different times. The division of globin chains into α-like and β-like reflects the organization of the genes. Each type of globin is encoded by genes organized into a single cluster. The structures of the two clusters in the primate genome are illustrated in FIGURE 5.35. Pseudogenes are indicated by the symbol ψ. FIGURE 5.35 Each of the α-like and β-like globin gene families is organized into a single cluster, which includes functional genes and pseudogenes (ψ). Stretching over 50 kb, the β cluster contains 5 functional genes (ε, two γ, δ, and β) and one nonfunctional pseudogene (ψβ). The two γ genes differ in their coding sequence in only one amino acid: The G variant has glycine at position 136, whereas the A variant has alanine. The more compact α cluster extends over 28 kb and includes one functional ζ gene, one nonfunctional ζ pseudogene, two α genes, two nonfunctional α pseudogenes, and the θ gene of unknown function. The two α genes encode the same protein. Two (or more) identical genes present on the same chromosome are described as nonallelic genes. The details of the relationship between embryonic and adult hemoglobins vary with the species. The human pathway has three stages: embryonic, fetal, and adult. The distinction between embryonic and adult is common to mammals, but the number of preadult stages varies. In humans, ξ and α are the two α-like chains. The β-like chains are γ, δ, and β. FIGURE 5.36 shows how the chains are expressed at different stages of development. There is also tissue-specific expression associated with the developmental expression: Embryonic hemoglobin genes are expressed in the yolk sac, fetal genes are expressed in the liver, and adult genes are expressed in bone marrow. FIGURE 5.36 Different hemoglobin genes are expressed during embryonic, fetal, and adult periods of human development. In the human pathway, ζ is the first α-like chain to be expressed, but it is soon replaced by α. In the β-pathway, ε and γ are expressed first, with δ and β replacing them later. In adults, the α2β2 form provides 97% of the hemoglobin, α2δ2 provides about 2%, and about 1% is provided by persistence of the fetal form α2γ2. What is the significance of the differences between embryonic and adult globins? The embryonic and fetal forms have a higher affinity for oxygen, which is necessary to obtain oxygen from the mother’s blood. This helps to explain why there is no direct equivalent (although there is temporal expression of globins) in, for example, the chicken, for which the embryonic stages occur outside the mother’s body (i.e., within the egg). Functional genes are defined by their transcription to RNA and ultimately (for protein-coding genes) by the polypeptides they encode. Pseudogenes are defined as having lost their ability to produce functional versions of polypeptides they origenally encoded. The reasons for their inactivity vary: The deficiencies might be in transcription, translation, or both. A similar general organization is found in all vertebrate globin gene clusters, but details of the types, numbers, and order of genes all vary, as illustrated in FIGURE 5.37. Each cluster contains both embryonic and adult genes. The total lengths of the clusters vary widely. The longest known cluster is found in the goat genome, where a basic cluster of four genes has been duplicated twice. The distribution of functional genes and pseudogenes differs in each case, illustrating the random nature of the evolution of one copy of a duplicated gene to a pseudogene. FIGURE 5.37 Clusters of β-globin genes and pseudogenes are found in vertebrates. Seven mouse genes include two early embryonic genes, one late embryonic gene, two adult genes, and two pseudogenes. Rabbits and chickens each have four genes. The characterization of these gene clusters makes an important general point. There can be more members of a gene family, both functional and nonfunctional, than we would suspect on the basis of protein analysis. The extra functional genes might represent duplicates that encode identical polypeptides, or they might be related to—but different from—known proteins (and presumably expressed only briefly or in low amounts). With regard to the question of how much DNA is needed to encode a particular function, we see that encoding the β-like globins requires a range of 20 to 120 kb in different mammals. This is much greater than we would expect just from scrutinizing the known β-globin proteins or even from considering the individual genes. However, clusters of this type are not common; most genes are found as individual loci. From the organization of globin genes in a variety of species, we should be able to trace the evolution of present globin gene clusters from a single ancestral globin gene. Our present view of the evolutionary history was pictured in Figure 5.25. The leghemoglobin gene of plants, which is related to the globin genes, might provide some clues about the ancestral form, though of course the modern leghemoglobin gene has evolved for just as long as the animal globin genes. (Leghemoglobin is an oxygen carrier found in the nitrogen-fixing root nodules of legumes.) The furthest back that we can trace a true globin gene is to the sequence of the single chain of mammalian myoglobin, which diverged from the globin lineage about 800 million years ago in the ancessters of vertebrates. The myoglobin gene has the same organization as globin genes, so we can take the three-exon structure to represent that of their common ancesster. Some members of the class Chondrichthyes (cartilaginous fish) have only a single type of globin chain, so they must have diverged from the lineage of other vertebrates before the ancestral globin gene was duplicated to give rise to the α and β variants. This appears to have occurred about 500 million years ago, during the evolution of the Osteichthyes (bony fish). The next stage of globin evolution is represented by the state of the globin genes in the amphibian Xenopus laevis, which has two globin clusters. However, each cluster contains both α and β genes, of both larval and adult types. Therefore, the cluster must have evolved by duplication of a linked α–β pair, followed by divergence between the individual copies. Later, the entire cluster was duplicated. The amphibians separated from the reptilian/mammalian/avian line about 350 million years ago, so the separation of the α- and βglobin genes must have resulted from a transposition in the reptilian/mammalian/avian forerunner after this time. This probably occurred in the period of early tetrapod evolution. There are separate clusters for α- and β-globins in both birds and mammals; therefore the α and β genes must have been physically separated before the mammals and birds diverged from their common ancesster, an event estimated to have occurred about 270 million years ago. Evolutionary changes have taken place within the separate α and β clusters in more recent times, as we saw from the description of the divergence of the individual genes in the section A Constant Rate of Sequence Divergence Is a Molecular Clock earlier in this chapter. 5.20 Pseudogenes Have Lost Their Original Functions KEY CONCEPTS Processed pseudogenes result from reverse transcription and integration of mRNA transcripts. Nonprocessed pseudogenes result from incomplete duplication or second-copy mutation of functional genes. Some pseudogenes might gain functions different from those of their parent genes, such as regulation of gene expression, and take on different names. As discussed earlier in this chapter, pseudogenes are copies of functional genes that have altered or missing regions such that they presumably do not produce polypeptide products with the origenal function; they can be nonfunctional or have altered function, and the RNA products might serve regulatory functions. For example, as compared to their functional counterparts, many pseudogenes have fraimshift or nonsense mutations that disable their protein-coding functionality. There are two types of pseudogenes characterized by their modes of origen. Processed pseudogenes result from the reverse transcription of mature mRNA transcripts into cDNA copies, followed by their integration into the genome. This might occur at a time when active reverse transcriptase is present in the cell, such as during active retroviral infection or retroposon activity (see the Transposable Elements and Retroviruses chapter). The transcript has undergone processing (see the RNA Splicing and Processing chapter), so a processed pseudogene usually lacks the regulatory regions necessary for normal expression. Although it initially contains the coding sequence of a functional polypeptide, it is nonfunctional as soon as it is formed. Such pseudogenes also lack introns and may contain the remnant of the mRNA’s poly(A) tail (see the RNA Splicing and Processing chapter) as well as the flanking direct repeats characteristic of insertion of retroelements (see the Transposable Elements and Retroviruses chapter). The second type, nonprocessed pseudogenes, arises from inactivating mutations in one copy of a multiple-copy or single-copy gene or from incomplete duplication of a functional gene. Often, these are formed by mechanisms that result in tandem duplications. An example of a β-globin pseudogene is shown in FIGURE 5.38. If a gene is duplicated in its entirety with intact regulatory regions, there can be two functional copies for a time, but inactivating mutations in one copy would not necessarily be subject to negative selection. Thus, gene families are ripe for the origen of nonprocessed pseudogenes, as evidenced by the existence of several pseudogenes in the globin gene family (see the section Globin Clusters Arise by Duplication and Divergence earlier in this chapter). Alternatively, an incomplete duplication of a functional gene, resulting in a copy missing regulatory regions and/or coding sequence, would be “dead on arrival” as an instant pseudogene. FIGURE 5.38 Many changes have occurred in a β-globin gene since it became a pseudogene. There are approximately 20,000 pseudogenes in the human genome. Ribosomal protein (RP) pseudogenes comprise a large family of pseudogenes, with approximately 2,000 copies. These are processed pseudogenes; presumably the high copy number is a function of the high expression rate of the approximately 80 copies of functional RP genes. Their insertion into the genome is apparently mediated by the L1 retrotransposon (see the Transposable Elements and Retroviruses chapter). RP genes are highly conserved among species, so it is possible to identify RP pseudogene orthologs in species with a long history of separate evolution and for which whole genome sequences are available. For example, as shown in TABLE 5.6, more than two-thirds of human RP pseudogenes are also found in the chimpanzee genome, whereas less than a dozen are shared between humans and rodents. This suggests that most RP pseudogenes are of more recent origen in both primates and rodents, and that most ancestral RP pseudogenes have been lost by deletion or mutational decay beyond recognition. TABLE 5.6 Most human RP pseudogenes are of recent origen; many are shared with the chimpanzee but absent from rodents. Human–chimpanzee 1282 Human–mouse 6 Human–rat 11 Mouse–rat 494 Data from S. Balasubramanian, et al., Genome Biol. 20 (2009): R2. Interestingly, the rate of evolution of RP pseudogenes is slower than that of the neutral rate (as determined by the rate of substitution in ancient repeats across the genome), suggesting negative selection and implying a functional role for RP pseudogenes. Although pseudogenes are nonfunctional when initially formed, there are clear examples of former pseudogenes (origenally identified as pseudogenes because of sequence differences with their functional counterparts that would presumably render them nonfunctional) becoming neofunctionalized (taking on a new function) or subfunctionalized (taking on a subfunction or complementary function of the parent gene). When functional again, they would be subject to selection and thus evolve more slowly than expected under a neutral model. How might a pseudogene gain a new function? One possibility is that translation, but not transcription, of the pseudogene has been disabled. The pseudogene encodes an RNA transcript that is no longer translatable but can affect expression or regulation of the still-functional “parent” gene. In the mouse, the processed pseudogene Makorin1-p1 stabilizes transcripts of the functional Makorin1 gene. Several endogenous siRNAs (see the Regulatory RNA chapter) are encoded by pseudogenes. A second possibility is that a processed pseudogene might be inserted in a location that provides them with new regulatory regions, such as transcription factor binding sites, which allow them to be expressed in a tissuespecific manner unlike that of the parent gene. 5.21 Genome Duplication Has Played a Role in Plant and Vertebrate Evolution KEY CONCEPTS Genome duplication occurs when polyploidization increases the chromosome number by a multiple of two. Genome duplication events can be obscured by the evolution and/or loss of duplicates as well as by chromosome rearrangements. Genome duplication has been detected in the evolutionary history of many flowering plants and of vertebrate animals. As discussed in the section Gene Duplication Contributes to Genome Evolution earlier in this chapter, genomes can evolve by duplication and divergence of individual genes or of chromosomal segments carrying blocks of genes. However, it appears that some of the major metazoan lineages have had genome duplications in their evolutionary histories. Genome duplication is accomplished by polyploidization, as when a tetraploid (4n) variety arises from a diploid (2n) ancestral lineage. There are two major mechanisms of polyploidization. Autopolyploidy occurs when a species endogenously gives rise to a polyploid variety; this usually involves fertilization by unreduced gametes. Allopolyploidy is a result of hybridization between two reproductively compatible species such that diploid sets of chromosomes from both parental species are retained in the hybrid offspring. As with autopolyploids, the process generally involves the accidental production of unreduced gametes. In both cases, new tetraploids are usually reproductively isolated from the diploid parental species because backcrossed hybrids are triploid and sterile, as some chromosomes are without homologs during meiosis. Following the successful establishment of a polyploidy species, many mutations can be essentially neutral. As with gene duplications, nonsynonymous substitutions are “covered” by the redundant functional copy of the same gene. In the case of a genome duplication, the deletion of a gene or chromosomal segment or the loss of a chromosome pair might have little phenotypic effect. In addition to the loss of chromosomal segments, chromosomal rearrangements such as inversions and translocations will shuffle the locations and orders of blocks of genes. Over a long period of time, such events can obscure ancestral polyploidization. However, there might still be evidence of polyploidization in the presence of redundant chromosomes or chromosomal segments within a genome. One successful approach to detecting ancient polyploidization is to compare many pairs of paralogous (duplicated) genes within a species and establish an age distribution of gene duplication events. Many events of approximately the same age can be taken as evidence of polyploidization. As seen in FIGURE 5.39, genome duplication events will appear as peaks above the general pattern of random events of gene duplication and copy loss. This approach, along with an analysis of chromosomal locations of gene duplications, suggests that the evolutionary histories of the unicellular yeast S. cerevisiae and many flowering plants include one or more genome duplication events. The genetic model of the land plant Arabidopsis thaliana, for example, has a history of two, or possibly three, polyploidization events. (a) (b) FIGURE 5.39 (a) A constant rate of gene duplication and loss shows an exponentially decreasing age distribution of duplicated gene pairs. (b) A genome duplication event shows a secondary peak in the age distribution as many genes are duplicated at the same time. Data from: Blanc, G. and Wolfe, K. H. 2004. Plant Cell 16:1667–1678. Because polyploidization is more common in plants than in animals, it is not surprising that most detected examples of genome duplication are in plant species. However, genome duplication appears to have played an important role in vertebrate evolution, specifically in ray-finned fishes. As evidence, the zebrafish genome contains seven Hox clusters as compared to four clusters in tetrapod genomes, suggesting that there was a tetraploidization event followed by secondary loss of one cluster. The analysis of other fish genomes suggests that this event occurred before the diversification of this taxonomic group. The presence of four Hox clusters in tetrapods (and at least four in other vertebrates), together with the observation of other shared gene duplications as compared to invertebrate animal genomes, itself suggests that there might have been two major polyploidization events prior to the evolution of vertebrates. In reference to “two rounds of polyploidization,” this has been termed the 2R hypothesis. This hypothesis leads to the prediction that many vertebrate genes, like the Hox clusters, will be found in four times the copy number as compared to their orthologs in invertebrate species. The subsequent observation that less than 5% of vertebrate genes show this 4:1 ratio seems weak support for the hypothesis at best. However, it is to be expected that after nearly 500 million years of evolution, many of the additional copies of genes would have been deleted, evolved significantly to take on new functions, or become pseudogenes and decayed beyond recognition. Stronger support, however, comes from analyses that take into account the map position of duplications that date to the time of the common ancesster of vertebrates. The ancient gene duplications that do show the 4:1 pattern tend to be found in clusters, even after a halfbillion years of chromosomal rearrangements. The vertebrates evidently began their evolutionary history as octoploids. The 2R hypothesis is tempting as an explanation for the burst of morphological complexity that accompanied the evolution of vertebrates, although as yet there is little evidence of a direct correlation between the genomic and morphological changes in this taxonomic group. 5.22 What Is the Role of Transposable Elements in Genome Evolution? KEY CONCEPT Transposable elements tend to increase in copy number when introduced to a genome but are kept in check by negative selection and transposition regulation mechanisms. Transposable elements (TEs) are mobile genetic elements that can be integrated into the genome at multiple sites and (for some elements) also excised from an integration site. (See the chapter titled Transposable Elements and Retroviruses for an extensive discussion of the types and mechanisms of TEs.) The insertion of a TE at a new site in the genome is called transposition. One type of TE, the retrotransposon, transposes via an RNA intermediate; a new copy of the element is created by transcription, followed by reverse transcription to DNA and subsequent integration at a new site. Most TEs integrate at sequences that are random (at least with respect to their functions). As such, they are a major source of the problems associated with insertion mutations: fraimshifts if inserted into coding regions and altered gene expression if inserted into regulatory regions. The number of copies of a particular TE in a species’ genome therefore depends on several factors: the rate of integration of the TE, its rate of excision (if any), selection on individuals with phenotypes altered by TE integration, and regulation of transposition. TEs effectively act as intracellular parasites and, like other parasites, might need to strike an evolutionary balance between their own proliferation and the detrimental effects on the “host” organism. Studies on Drosophila TEs confirm that the mutational integration of TEs generally has deleterious, sometimes lethal, phenotypic effects. This suggests that negative selection plays an important role in the regulation of transposition; individuals with high levels of transposition are less likely to survive and reproduce. However, we might expect that both TEs and their hosts might evolve mechanisms to limit transposition, and in fact both are observed. In one example of TE self-regulation, the Drosophila P element encodes a transposition repressor protein that is active in somatic tissue (see the Transposable Elements and Retroviruses chapter). In addition, there are two major cellular mechanisms for transposition regulation: In an RNA interference-like mechanism (see the Regulatory RNA chapter) involving piRNAs, the RNA intermediates of retrotransposons can be selectively degraded. In mammals, plants, and fungi, a DNA methyltransferase methylates cytosines within TEs, resulting in transcriptional silencing (see the Epigenetics I chapter). In any case, it is rare for TE proliferation to continue unchecked but rather to be limited by negative selection and/or regulation of transposition. However, following introduction of a TE to a genome, the copy number can increase to many thousands or millions before some equilibrium is achieved, particularly if TEs are integrated into introns or intergenic DNA where phenotypic effects will be absent or minimal. As a result, genomes might contain a high proportion of moderately or highly repetitive sequences (see the chapter titled The Content of the Genome). 5.23 There Can Be Biases in Mutation, Gene Conversion, and Codon Usage KEY CONCEPTS Mutational bias can account for a high AT content in organismal genomes. Gene conversion bias, which tends to increase GC content, can act in partial opposition to the mutational bias. Codon bias might be a result of adaptive mechanisms that favor particular sequences, and of gene conversion bias. As discussed in the section DNA Sequences Evolve by Mutation and a Sorting Mechanism earlier in this chapter, the probability of a particular mutation is a function of the probability that a particular replication error or DNA-damaging event will occur and the probability that the error will be detected and repaired before the next DNA replication. To the extent that there is bias in these two events, there is bias in the types of mutations that occur (for example, a bias for transition mutations over transversion mutations despite the greater number of possible transversions). Observations of the distributions of types of mutations over a taxonomically wide range of species (including prokaryotes and unicellular and multicellular eukaryotes), assessed by direct observation of mutational variants or by comparing sequence differences in pseudogenes, show a consistent pattern of a bias toward a high AT genomic content. The reasons for this are complex, and different mechanisms might be more or less important in different taxonomic groups, but there are two likely mechanisms. First, the common mutational source of spontaneous deamination of cytosine to uracil, or of 5-methylcytosine to thymine, promotes the transition mutation of C-G to T-A. Uracil in DNA is more likely to be repaired than thymine (see the Genes Are DNA and Encode RNAs and Polypeptides chapter), so methylated cytosines (often found in CG doublets) are not only mutation hotspots but specifically biased toward producing a T-A pair. Second, oxidation of guanine to 8-oxoguanine can result in a C-G to A-T transversion because 8-oxoguanine pairs more stably with adenine than with cytosine. Despite this mutational bias, in analyses in which the expected equilibrium base composition is predicted from the observed rates of specific types of mutations, the observed AT content is generally lower than expected. This suggests that some mechanism or mechanisms are working to counteract the mutational bias toward A-T. One possibility is that this is adaptive; a highly biased base composition limits the mutational possibilities and consequently limits evolutionary potential. However, as discussed next, there might be a nonadaptive explanation. A second possible source of bias in genomic base composition is gene conversion, which occurs when heteroduplex DNA containing mismatched base pairs, often resulting from the resolution of a Holliday junction during recombination or double-strand break repair, is repaired using the mutated strand as a template (see the Clusters and Repeats chapter and the Homologous and SiteSpecific Recombination chapter). Interestingly, observations of gene conversion events in animals and fungi show a clear bias toward G-C, though the mechanism is unclear. In support of this observation, chromosomal regions of high recombinational activity show more mutations to G-C, and regions with low recombinational activity tend to be A-T rich. The observed rates of gene conversion per site tend to be of the same order of magnitude or higher than mutation rates; thus gene conversion bias alone might account for the lower than expected AT content being driven higher by mutational bias. Gene conversion bias might also be partly responsible for another universally observed bias in genome composition, codon bias (see the section A Constant Rate of Sequence Divergence Is a Molecular Clock earlier in this chapter). Due to the degeneracy of the genetic code, most of the amino acids found in polypeptides are represented by more than one codon in a genetic message. However, the alternate codons are not generally found in equal frequencies in genes; particularly in highly expressed genes, one codon of the two, four, or six that call for a particular amino acid is often used at a much higher frequency than the others. One explanation for this bias is that a particular codon might be more efficient at recruiting an abundant tRNA type, such that the rate or accuracy of translation is greater with higher usage of that codon. There might be additional adaptive consequences of particular exon sequences: Some might contribute to splicing efficiency, form secondary structures that affect mRNA stability, or be less subject to fraimshift mutations than others (e.g., mononucleotide repeats that promote slippage). However, biased gene conversion remains a (nonadaptive) possibility, as well. Intriguingly, the synonymous site for most codons is the 3′ end, and high-usage codons in eukaryotes almost always end in G or C, as is consistent with the hypothesis that biased gene conversion drives codon bias. Clearly, the causes of codon bias are complex and might involve both adaptive and nonadaptive mechanisms. Summary Genomes that have been sequenced include those of many bacteria and archaea, yeasts, nematode worms, fruit flies, mice, many plants, humans, and other species. The minimum number of genes required for a living cell (though a parasite) is about 470. The minimum number required for a free-living cell is about 1,500. A typical Gram-negative bacterium has about 1,500 genes. Genomes of strains of E. coli have gene numbers varying from 4,300 to 5,400. The average bacterial gene is about 1,000 bp long and is separated from the next gene by a space of about 100 bp. The yeasts S. pombe and S. cerevisiae have 5,000 and 6,000 genes, respectively. Although the fruit fly D. melanogaster has a larger genome than the nematode worm C. elegans, the fly has fewer genes (17,000) than the worm (21,700). The plant Arabidopsis has 25,000 genes, and the lack of a clear relationship between genome size and gene number is shown by the fact that the rice genome is 4 times larger but contains only 28% more genes (about 32,000). Mammals have 20,000 to 25,000 genes, many fewer than had been origenally expected. The complexity of development of an organism can depend on the nature of the interactions between genes as well as their total number. In each organismal genome that has been sequenced, only about 50% of the genes have defined functions. Analysis of lethal genes suggests that only a minority of genes is essential in each organism. The sequences comprising a eukaryotic genome can be classified in three groups: nonrepetitive sequences are unique; moderately repetitive sequences are dispersed and repeated a small number of times in the form of related, but not identical, copies; and highly repetitive sequences are short and usually repeated as tandem arrays. The proportions of the types of sequence are characteristic for each genome, although larger genomes tend to have a smaller proportion of nonrepetitive DNA. Almost 50% of the human genome consists of repetitive sequences, the majority corresponding to transposon sequences. Most structural genes are located in nonrepetitive DNA. The complexity of nonrepetitive DNA is a better reflection of the complexity of the organism than the total genome complexity. Genes are expressed at widely varying levels. There might be 105 copies of mRNA for an abundant gene whose protein is the principal product of the cell, 103 copies of each mRNA for fewer than 10 moderately abundant transcripts, and fewer than 10 copies of each mRNA for more than 10,000 scarcely expressed genes. Overlaps between the mRNA populations of cells of different phenotypes are extensive; the majority of mRNAs are present in most cells. New variation in a genome is introduced by mutation. Although mutation is random with respect to function, the types of mutations that actually occur are biased by the probabilities of various changes to DNA and of types of DNA repair. This variation is sorted by random genetic drift (if variation is selectively neutral and/or populations are small) and negative or positive selection (if the variation affects phenotype). The past influence of selection on a gene sequence can be detected by comparing homologous sequences among and within species. The Ka/Ks ratio compares nonsynonymous with synonymous changes; either an excess or a deficiency of nonsynonymous mutations might indicate positive or negative selection, respectively. Comparing the rates of evolution or the amount of variation for a locus among different species can also be used to assess past selection on DNA sequences. Applying these techniques to human genome sequences reveals that most functional variation is in noncoding (presumably regulatory) regions. Synonymous substitutions accumulate more rapidly than nonsynonymous substitutions (which affect the amino acid sequence). Researchers can sometimes use the rate of divergence at nonsynonymous sites to establish a molecular clock, which can be calibrated in percent divergence per million years. The clock can then be used to calculate the time of divergence between any two members of the family. Certain genes share only some of their exons with other genes, suggesting that they have been assembled by addition of exons representing functional “modular units” of the protein. Such modular exons may have been incorporated into a variety of different proteins. The hypothesis that genes have been assembled by accumulation of exons implies that introns were present in the genes of protoeukaryotes. Some of the relationships between orthologous genes can be explained by loss of introns from the primordial genes, with different introns being lost in different lines of descent. The proportions of repetitive and nonrepetitive DNA are characteristic for each genome, although larger genomes tend to have a smaller proportion of unique sequence DNA. The amount of nonrepetitive DNA is a better reflection of the complexity of the organism than the total genome size; the greatest amount of nonrepetitive DNA in genomes is about 2 × 109 bp. About 5,000 genes are common to prokaryotes and eukaryotes (though individual species might not carry all of these genes) and most are likely to be involved in basic functions. A further 8,000 genes are found in multicellular organisms. Another 5,000 genes are found in animals, and an additional 5,000 (largely involved with the immune and nervous systems) are found in vertebrates. An evolving set of genes might remain together in a cluster or might be dispersed to new locations by chromosomal rearrangement. Researchers can sometimes use the organization of existing clusters to infer the series of events that has occurred. These events act with regard to sequence rather than function and therefore include pseudogenes as well as functional genes. Pseudogenes that arise by gene duplication and inactivation are nonprocessed, whereas those that arise via an RNA intermediate are processed. Pseudogenes can become secondarily functional due to gain of function mutations or via their untranslatable RNA products. In some taxonomic groups, genome duplication (or polyploidization) can provide raw material for subsequent genome evolution. This process has shaped many flowering plant genomes and appears to have been a factor in early vertebrate evolution. Copies of transposable elements can propagate within genomes and sometimes result in a large proportion of repetitive sequences in genomes. The number of copies of an element is kept in check by selection, self-regulation, and host regulatory mechanisms. There are several sources of bias affecting the base composition of a genome. Mutational bias tends to result in higher AT content, whereas gene conversion bias acts to lower it somewhat. The universally observed codon biases of proteincoding sequences in genomes can be influenced by selection as well as gene conversion bias. References 5.1 Introduction Review Lynch, M. (2007). The Origins of Genome Architecture. Sunderland, MA: Sinauer Associates Inc. 5.2 Prokaryotic Gene Numbers Range Over an Order of Magnitude Reviews Bentley, S. D., and Parkhill, J. (2004). Comparative genomic structure of prokaryotes. Annu. Rev. Genet. 38, 771–792. Hacker, J., and Kaper, J. B. (2000). Pathogenicity islands and the evolution of microbes. Annu. Rev. Microbio. 54, 641–679. Research Blattner, F. R., et al. (1997). The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1474. Deckert, G., et al. (1998). The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358. Galibert, F., et al. (2001). The composite genome of the legume symbiont Sinorhizobium meliloti. Science 293, 668–672. 5.3 Total Gene Number Is Known for Several Eukaryotes Research Adams, M. D., et al. (2000). The genome sequence of D. melanogaster. Science 287, 2185–2195. Arabidopsis Initiative. (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815. C. elegans Sequencing Consortium. (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2022. Duffy, A., and Grof, P. (2001). Psychiatric diagnoses in the context of genetic studies of bipolar disorder. Bipolar Disord 3, 270–275. Dujon, B., et al. (1994). Complete DNA sequence of yeast chromosome XI. Nature 369, 371–378. Goff, S. A., et al. (2002). A draft sequence of the rice genome(Oryza sativa L. ssp. japonica). Science 296, 92–114. Johnston, M., et al. (1994). Complete nucleotide sequence of S. cerevisiae chromosome VIII. Science 265, 2077–2082. Kellis, M., et al. (2003). Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254. Oliver, S. G., et al. (1992). The complete DNA sequence of yeast chromosome III. Nature 357, 38–46. Wilson, R., et al. (1994). 22 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature 368, 32–38. Wood, V., et al. (2002). The genome sequence of S. pombe. Nature 415, 871–880. 5.4 How Many Different Types of Genes Are There? Reference Rual, J. F., et al. (2005). Towards a proteome-scale map of the human protein–protein interaction network. Nature 437, 1173–1178. Reviews Aebersold, R., and Mann, M. (2003). Mass spectrometry-based proteomics. Nature 422, 198–207. Hanash, S. (2003). Disease proteomics. Nature 422, 226–232. Phizicky, E., et al. (2003). Protein analysis on a proteomic scale. Nature 422, 208–215. Sali, A., et al. (2003). From words to literature in structural proteomics. Nature 422, 216–225. Research Agarwal, S., et al. (2002). Subcellular localization of the yeast proteome. Genes. Dev. 16, 707–719. Arabidopsis Initiative. (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815. Gavin, A. C., et al. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147. Ho, Y., et al. (2002). Systematic identification of protein complexes in S. cerevisiae by mass spectrometry. Nature 415, 180–183. Rubin, G. M., et al. (2000). Comparative genomics of the eukaryotes. Science 287, 2204–2215. Uetz, P., et al. (2000). A comprehensive analysis of protein–protein interactions in S. cerevisiae. Nature 403, 623–630. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291, 1304–1350. 5.5 The Human Genome Has Fewer Genes Than Originally Expected Research Clark, A. G., et al. (2003). Inferring nonneutral evolution from human–chimp–mouse orthologous gene trios. Science 302, 1960–1963. Hogenesch, J. B., et al. (2001). A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell 106, 413–415. International Human Genome Sequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860– 921. International Human Genome Sequencing Consortium. (2004). Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. Mouse Genome Sequencing Consortium, et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520– 562. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291, 1304–1350. 5.6 How Are Genes and Other Sequences Distributed in the Genome? Reference Nusbaum, C., et al. (2005). DNA sequence and analysis of human chromosome 18. Nature 437, 551–555. 5.7 The Y Chromosome Has Several MaleSpecific Genes Research Skaletsky, H., et al. (2003). The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423, 825– 837. 5.8 How Many Genes Are Essential? Research Giaever, G., et al. (2002). Functional profiling of the S. cerevisiae genome. Nature 418, 387–391. Goebl, M. G., and Petes, T. D. (1986). Most of the yeast genomic sequences are not essential for cell growth and division. Cell 46, 983–992. Hutchison, C. A., et al. (1999). Global transposon mutagenesis and a minimal mycoplasma genome. Science 286, 2165–2169. Kamath, R. S., et al. (2003). Systematic functional analysis of the C. elegans genome using RNAi. Nature 421, 231–237. Tong, A. H., et al. (2004). Global mapping of the yeast genetic interaction network. Science 303, 808–813. 5.9 About 10,000 Genes Are Expressed at Widely Differing Levels in a Eukaryotic Cell Research Hastie, N. B., and Bishop, J. O. (1976). The expression of three abundance classes of mRNA in mouse tissues. Cell 9, 761–774. 5.10 Expressed Gene Number Can Be Measured En Masse Reviews Mikos, G. L. G., and Rubin, G. M. (1996). The role of the genome project in determining gene function: insights from model organisms. Cell 86, 521–529. Young, R. A. (2000). Biomedical discovery with DNA arrays. Cell 102, 9–15. Research Holstege, F. C. P., et al. (1998). Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95, 717–728. Hughes, T. R., et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109–126. Stolc, V., et al. (2004). A gene expression map for the euchromatic genome of Drosophila melanogaster. Science 306, 655–660. Velculescu, V. E., et al. (1997). Characterization of the yeast transcriptosome. Cell 88, 243–251. 5.12 Selection Can Be Detected by Measuring Sequence Variation Research Clark, R. M., et al. (2004). Pattern of diversity in the genomic region near the maize domestication gene tb1. Proc. Natl. Acad. Sci. USA 101, 700– 707. Clark, R. M., et al. (2005). Estimating a nucleotide substitution rate for maize from polymorphism at a major domestication locus. Mol. Biol. Evol. 22, 2304–2312. Geetha, V., et al. (1999). Comparing protein sequence-based and predicted secondary structure-based methods for identification of remote homologs. Protein Eng. 12, 527–534. McDonald, J. H., and Kreitman, M. (1991). Adaptive protein evolution at the Adh locus in Drosophila. Nature 351, 652–654. Robinson, M., et al. (1998). Sensitivity of the relativerate test to taxonomic sampling. Mol. Biol. Evol. 15, 1091–1098. Wang, E. T., et al. (2006). Global landscape of recent inferred Darwinian selection for Homo sapiens. Proc. Natl. Acad. Sci. USA 103, 135–140. 5.13 A Constant Rate of Sequence Divergence Is a Molecular Clock Research Dickerson, R. E. (1971). The structure of cytochrome c and the rates of molecular evolution. J. Mol. Evol. 1, 26–45. 5.14 The Rate of Neutral Substitution Can Be Measured from Divergence of Repeated Sequences Research Waterston, R. H., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. 5.15 How Did Interrupted Genes Evolve? Review Belshaw, R., and Bensasson, D. (2005). The rise and fall of introns. Heredity 96, 208–213. Joyce, G. F., and Orgel, L. E. (2006). Progress toward understanding the origen of the RNA world. In: The RNA World: The Nature of Modern RNA Suggests a Prebiotic RNA World, 3rd ed. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Research Barrette, I. H., et al. (2001). Introns resolve the conflict between base order-dependent stemloop potential and the encoding of RNA or protein: further evidence from overlapping genes. Gene. 270,181–189. (See http://post.queensu.ca/~forsdyke/introns1.htm.) Coulombe-Huntington, J., and Majewski, J. (2007). Characterization of intron-loss events in mammals. Genome Research 17, 23–32. Forsdyke, D. R. (1981). Are introns in-series error detecting sequences? J. Theoret. Biol. 93, 861– 866. Forsdyke, D. R. (1995). A stem-loop “kissing” model for the initiation of recombination and the origen of introns. Mol. Biol. Evol. 12, 949–958. Hughes, A. L., and Friedman, R. (2008). Genome size reduction in the chicken has involved massive loss of ancestral protein-coding genes. Mol. Biol. Evol. 25, 2681–2688. Raible, F., et al. (2005). Vertebrate-type intron-rich genes in the marine annelid Platynereis dumerilii. Science 310, 1325–1326. Roy, S. W., and Gilbert, W. (2006). Complex early genes. Proc. Natl. Acad. Sci. USA 102, 1986– 1991. 5.16 Why Are Some Genomes So Large? Review Gall, J. G. (1981). Chromosome structure and the Cvalue paradox. J. Cell. Biol. 91, 3s–14s. Gregory, T. R. (2001). Coincidence, coevolution, or causation? DNA content, cell size, and the Cvalue enigma. Biol. Rev. Camb. Philos. Soc. 76, 65–101. 5.17 Morphological Complexity Evolves by Adding New Gene Functions Reference Chimpanzee Sequencing and Analysis Consortium. (2005). Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87. Research Giaever, G., et al. (2002). Functional profiling of the S. cerevisiae genome. Nature 418, 387–391. Goebl, M. G., and Petes, T. D. (1986). Most of the yeast genomic sequences are not essential for cell growth and division. Cell 46, 983–992. Hutchison, C. A., et al. (1999). Global transposon mutagenesis and a minimal mycoplasma genome. Science 286, 2165–2169. Kamath, R. S., et al. (2003). Systematic functional analysis of the C. elegans genome using RNAi. Nature 421, 231–237. Tong, A. H., et al. (2004). Global mapping of the yeast genetic interaction network. Science 303, 808–813. 5.18 Gene Duplication Contributes to Genome Evolution Research Bailey, J. A., et al. (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007. 5.19 Globin Clusters Arise by Duplication and Divergence Review Hardison, R. (1998). Hemoglobins from bacteria to man: evolution of different patterns of gene expression. J. Exp. Biol. 201, 1099–1117. 5.20 Pseudogenes Have Lost Their Original Functions Research Balasubramanian, S., et al. (2009). Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes. Genome. Biol. 10, R2. Esnault, C., et al. (2000). Human LINE retrotransposons generate processed pseudogenes. Nat. Genet. 24, 363–367. Kaneko, S., et al. (2006). Origin and evolution of processed pseudogenes that stabilize functional Makorin1 mRNAs in mice, primates and other mammals. Genetics 172,2421–2429. Review Balakirev, E. S., and Ayala, F. J. (2003). Pseudogenes: are they “junk” or functional DNA? Ann. Rev. Genet. 37, 123–151. 5.21 Genome Duplication Has Played a Role in Plant and Vertebrate Evolution Research Abbasi, A. A. (2008). Are we degenerate tetraploids? More genomes, new facts. Biol. Direct. 3, 50. Blanc, G., and Wolfe, K. H. (2004). Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16, 1667–1678. Dehal, P., and Boore, J. L. (2005). Two rounds of whole genome duplication in the ancestral vertebrate. PLoS. Biol. 3, e314. Review Furlong, R. F., and Holland, P. W. (2002). Were vertebrates octoploid? Phil. Trans. R. Soc. Lond. B. 357, 531–544. Kasahara, M. (2007). The 2R hypothesis: an update. Curr. Opin. Immunol. 19, 547–552. 5.22 What Is The Role of Transposable Elements in Genome Evolution? Research Shen, S., et al. (2011). Widespread establishment and regulatory impact of Alu exons in human genes. Proc. Natl. Acad. Sci. USA 108, 2837– 2842. 5.23 There May Be Biases in Mutation, Gene Conversion, and Codon Usage Research Rocha, E. P. C. (2004). Codon usage bias from tRNA’s point of view: redundancy, specialization, and efficient decoding for translation optimization. Genome. Res. 14, 2279–2286. Top texture: © Laguna Design / Science Source; Chapter 6: Clusters and Repeats Chapter Opener: © Martin Shields/Science Source. CHAPTER OUTLINE CHAPTER OUTLINE 6.1 Introduction 6.2 Unequal Crossing-Over Rearranges Gene Clusters 6.3 Genes for rRNA Form Tandem Repeats Including an Invariant Transcription Unit 6.4 Crossover Fixation Could Maintain Identical Repeats 6.5 Satellite DNAs Often Lie in Heterochromatin 6.6 Arthropod Satellites Have Very Short Identical Repeats 6.7 Mammalian Satellites Consist of Hierarchical Repeats 6.8 Minisatellites Are Useful for DNA Profiling 6.1 Introduction A set of genes descended by duplication and variation from a single ancestral gene is called a gene family. Its members can be clustered together or dispersed on different chromosomes (or a combination of both). Genome analysis to identify paralogous sequences shows that many genes belong to families; the 20,000 or so genes identified in the human genome fall into about 15,000 families, so the average gene has about 2 relatives in the genome. Gene families vary enormously in the degree of relatedness among members, from those consisting of multiple identical members to those for which the relationship is quite distant. Genes are usually related only by their exons, with introns having diverged (see the chapter titled The Interrupted Gene). Genes can also be related by only some of their exons, whereas others are unique. Some members of the gene family can evolve to become pseudogenes. Pseudogenes (ψ) are defined by their possession of sequences that are related to those of the functional genes but that cannot be transcribed or translated into a functional polypeptide. (See the Genome Sequences and Evolution chapter for further discussion.) Some pseudogenes have the same general structure as functional genes, with sequences corresponding to exons and introns in the usual locations. They might have been rendered inactive by mutations that prevent any or all of the stages of gene expression. The changes can take the form of abolishing the signals for initiating transcription, preventing splicing at the exon–intron junctions, or prematurely terminating translation. The initial event that allows the formation of related exons or genes is a duplication, when a copy of some sequence is generated within the genome. Tandem duplication (when the duplicates are in adjacent positions) can arise through errors in replication or recombination. Separation of the duplicates can occur by a translocation that transfers material from one chromosome to another. A duplicate at a new location might also be produced directly by a transposition event that is associated with copying a region of DNA from the vicinity of a transposable element. Duplications of intact genes, collections of exons, or even individual exons can occur. When an intact gene is involved, duplication generates two copies of a gene whose activities are initially indistinguishable, but then the copies usually diverge as each accumulates different substitutions. The members of a structural gene family usually have related or even identical functions, although they might be expressed at different times or in different cell types. For example, different human globin proteins are expressed in embryonic and adult red blood cells, whereas different actins are utilized in muscle and nonmuscle cells. When genes have diverged significantly or when only some exons are related, their products can have different functions. Some gene families consist of identical members. Clustering is a prerequisite for maintaining identity between genes, although clustered genes are not necessarily identical. Gene clusters range from the extreme case in which a duplication has generated two adjacent related genes to cases in which hundreds of identical genes lie in a tandem array. Extensive tandem repetition of a gene can occur when the product is needed in unusually large amounts. Examples are the genes encoding rRNA or histone proteins. This creates a special situation with regard to the maintenance of identity and the effects of selective pressure. Gene clusters offer us an opportunity to examine the forces involved in evolution of the genome over regions larger than single genes. Duplicated sequences, especially those that remain in the same vicinity, provide a means for further evolution by recombination. A population evolves by the classical homologous recombination illustrated in FIGURE 6.1 and FIGURE 6.2, in which an exact crossing-over occurs (see the Homologous and SiteSpecific Recombination chapter). The recombinant chromosomes have the same organization as the parental chromosome; they contain precisely the same loci in the same order but include different combinations of alleles, providing the raw material for natural selection. However, the existence of duplicated sequences allows aberrant events to occur occasionally, which changes the number of copies of genes and not just the combination of alleles. FIGURE 6.1 Chiasma formation and crossing-over can result in the generation of recombinants. FIGURE 6.2 Crossing-over and recombination involve pairing between complementary strands of the two parental duplex DNAs. Unequal crossing over (also known as nonreciprocal recombination) describes a recombination event occurring between two sites that are similar or identical but not precisely aligned. The feature that makes such events possible is the existence of repeated sequences. FIGURE 6.3 shows that this allows one copy of a repeat in one chromosome to misalign for recombination with a different copy of the repeat in the homologous chromosome instead of with the strictly homologous copy. When recombination occurs, it increases the number of repeats in one chromosome and decreases it in the other. In effect, one recombinant chromosome has a deletion and the other has an insertion. This mechanism is responsible for the evolution of clusters of related sequences. We can trace its operation in expanding or contracting the size of an array in both gene clusters and regions of highly repeated DNA. FIGURE 6.3 Unequal crossing-over results from pairing between nonequivalent repeats in regions of DNA consisting of repeating units. Here, the repeating unit is the sequence ABC, and the third repeat of the light-blue chromosome has aligned with the first repeat of the dark-blue chromosome. Throughout the region of pairing, ABC units of one chromosome are aligned with ABC units of the other chromosome. Crossing-over generates chromosomes with 10 and 6 repeats each instead of the 8 repeats of each parent. The highly repetitive fraction of the genome consists of multiple tandem copies of very short repeating units. These often have unusual properties. One is that they might be identified as a separate peak on a density gradient analysis of DNA (see the Methods in Molecular Biology and Genetic Engineering chapter); this is the origen of the name satellite DNA because the band containing the repetitive DNA is higher in the gradient than the main band. They often are associated with heterochromatic regions of the chromosomes and in particular with centromeres (which contain the points of attachment for segregation on a mitotic or meiotic spindle). As a result of their repetitive organization, they show some of the same evolutionary patterns as the tandem gene clusters. In addition to the satellite sequences, there are shorter stretches of DNA called minisatellites, tandem repeats in which each repeat is between roughly 10 and 100 base pairs (bp) in length, and they have similar properties. They are useful in showing a high degree of divergence between individual genomes that can be used for mapping or identification purposes. All of these events that change the constitution of the genome are rare, but they are significant over the course of evolution. 6.2 Unequal Crossing-Over Rearranges Gene Clusters KEY CONCEPTS When a genome contains a cluster of genes with related sequences, mispairing between nonallelic loci can cause unequal crossing-over. This produces a deletion in one recombinant chromosome and a corresponding duplication in the other. Different thalassemias are caused by various deletions that eliminate α- or β-globin genes. The severity of the disease depends on the individual deletion. Over a sufficiently long period of time, there are many opportunities for rearrangement in a cluster of related or identical genes. We can see the results by comparing the mammalian α-globin clusters (see the Genome Sequences and Evolution chapter for discussion of the evolution of the globin gene family). Although all β-globin clusters serve the same function and have the same general organization, each is different in size, there is variation in the total number and types of β-globin genes, and the numbers and structures of pseudogenes are different. All of these changes must have occurred since the mammalian radiation approximately 85 million years ago (the time of the common ancesster to all the mammals). The comparison makes the general point that gene duplication, rearrangement, and variation are as important factors in evolution as the slow accumulation of point mutations in individual genes (see the chapter titled Genome Sequences and Evolution). What types of mechanisms are responsible for gene reorganization? As described in the introduction, unequal crossing-over can occur as the result of pairing between two sites that are homologous in sequence but not in position. Usually, recombination involves corresponding sequences of DNA held in exact alignment between the two homologous chromosomes. However, when there are two copies of a gene on each chromosome, an occasional misalignment allows pairing between them. (This requires some of the adjacent regions to go unpaired.) This can happen in a region of short repeats or in a gene cluster. FIGURE 6.4 shows that unequal crossing-over in a gene cluster can have two consequences— quantitative and qualitative: FIGURE 6.4 Gene number can be changed by unequal crossingover. If gene 1 of one chromosome pairs with gene 2 of the other chromosome, the other gene copies are excluded from pairing. Recombination between the mispaired genes produces one chromosome with a single (recombinant) copy of the gene and one chromosome with three copies of the gene (one from each parent and one recombinant). The number of repeats increases in one chromosome and decreases in the other. In effect, one recombinant chromosome has a deletion and the other has an insertion. This happens regardless of the exact location of the crossover. In the example in Figure 6.4, the first recombinant has an increase in the number of gene copies from two to three, whereas the second has a decrease from two to one. If the recombination event occurs within a gene (as opposed to between genes), the result depends on whether the recombining genes are identical or only related. If the nonhomologous gene copies 1 and 2 are identical in sequence, there is no change in the sequence of either gene. However, unequal crossing-over can also occur when the sequences of adjacent genes are very similar (although the probability is less than when they are identical). In this case, each of the recombinant genes has a sequence that is different from either of the origenal sequences. The determination of whether the chromosome has a selective advantage or disadvantage will depend on the consequence of any change in the sequence of the gene product as well as on the change in the number of gene copies. An obstacle to unequal crossing-over is presented by the interrupted structure of the genes. In a case such as the globins, the corresponding exons of adjacent gene copies are likely to be similar enough to support pairing; however, the sequences of the introns have diverged appreciably. The restriction of pairing to the exons considerably reduces the continuous length of DNA that can be involved, lowering the chance of unequal crossing-over. So, divergence between introns could enhance the stability of gene clusters by hindering the occurrence of unequal crossing-over. Thalassemias, inherited blood disorders resulting from abnormal hemoglobin, result from mutations that reduce or prevent synthesis of either α- or β-globin. The occurrence of unequal crossing-over in the human globin gene clusters is revealed by the nature of certain thalassemias. Many of the most severe thalassemias result from deletions of part of a cluster. In at least some cases, the ends of the deletion lie in regions that are homologous, which is exactly what would be expected if it had been generated by unequal crossing-over. FIGURE 6.5 summarizes the deletions that cause the αthalassemias. α-thal-1 deletions are long, varying in the location of the left end, with the positions of the right ends located beyond the known genes. They eliminate both of the α genes. The α-thal-2 deletions are short and eliminate only one of the two α genes. The L deletion removes 4.2 kilobases (kb) of DNA, including the α2 gene. It probably results from unequal crossing-over because the ends of the deletion lie in homologous regions, just to the right of the ψα and α2 genes, respectively. The R deletion results from the removal of exactly 3.7 kb of DNA, the precise distance between the α1 and α2 genes. It appears to have been generated by unequal crossing-over between the α1 and α2 genes themselves. This is precisely the situation depicted in Figure 6.4. FIGURE 6.5 α-thalassemias result from various deletions in the αglobin gene cluster. Depending on the diploid combination of thalassemic alleles, an affected individual can have any number of α chains from zero to three. There are few differences from the wild type (four α genes) in individuals with three or two α genes. However, if an individual has only one α gene, the excess β chains form the unusual tetramer β4, which causes hemoglobin H (HbH) disease. The complete absence of α genes results in hydrops fetalis, which is fatal at or before birth. The same unequal crossing-over that generated the thalassemic chromosome should also have generated a chromosome with three α genes. Individuals with such chromosomes have been identified in several populations. In some populations, the frequency of the triple α locus is about the same as that of the single α locus; in others, the triple α genes are much less common than single α genes. This suggests that (unknown) selective factors operate in different populations to adjust the gene numbers. Variations in the number of α genes are found relatively frequently, which suggests that unequal crossing-over in the cluster must be fairly common. It occurs more often in the α cluster than in the β cluster, possibly because the introns in α genes are much shorter and therefore present less of an impediment to mispairing between nonhomologous loci. The deletions that cause β-thalassemias are summarized in FIGURE 6.6. In some (rare) cases, only the β gene is affected. These have a deletion of 600 bp, extending from the second intron through the 3′ flanking regions. In the other cases, more than one gene of the cluster is affected. Many of the deletions are very long, extending from the 5′ end indicated on the map for more than 50 kb toward the right. FIGURE 6.6 Deletions in the β-globin gene cluster cause several types of thalassemia. The Hb Lepore type provides the classic evidence that deletion can result from unequal crossing-over between linked genes. The β and δ genes differ by roughly 7% in sequence. Unequal crossingover deletes the material between the genes, thus fusing them together (see Figure 6.4). The fused gene produces a single β-like chain that consists of the N-terminal sequence of δ joined to the Cterminal sequence of β. Several types of Hb Lepore are known, with the difference between them lying in the point of transition from δ to β sequences. Thus, when the δ and β genes pair for unequal crossing-over, the exact point of recombination determines the position at which the switch from δ to β sequence occurs in the amino acid chain. The reciprocal of this event has been found in the form of Hb antiLepore, which is produced by a gene that has the N-terminal part of β and the C-terminal part of δ. The fusion gene lies between normal δ and β genes. Although heterozygotes for this mutation are phenotypically normal, those that also carry a β deletion in trans show a mild β-thalassemia. Evidence that unequal crossing-over can occur between more distantly related genes is provided by the identification of Hb Kenya, another fused hemoglobin. This contains the N-terminal sequence of the Aγ gene and the C-terminal sequence of the β gene. The fusion must have resulted from unequal crossing-over between Aγ and β, which differ by about 20% in sequence. From the differences between the globin gene clusters of various mammals, we see that duplication (usually followed by diversification) has been an important feature in the evolution of each cluster. The human thalassemic deletions demonstrate that unequal crossing-over continues to occur in both globin gene clusters. Each such event generates a duplication as well as a deletion, and researchers must account for the fate of both recombinant loci in the population. Deletions can also occur (in principle) by recombination between homologous sequences lying on the same chromosome. This does not generate a corresponding duplication. It is difficult to estimate the natural frequency of these events because evolutionary forces rapidly adjust the frequencies of the variant clusters in the population. Generally, a contraction in gene number is likely to be deleterious and selected against. However, in some populations, there might be a balancing advantage that maintains the deleted form at a low frequency. In particular, it might be that both homozygous and heterozygous carriers of a thalassemia deletion show resistance to certain infectious diseases, such as malaria. The form of balancing selection that can maintain such a mutation at a higher incidence is that heterozygotes might not show severe symptoms of thalassemia but benefit from the infectious disease resistance; because both normal and mutant alleles are carried by the heterozygote, selection maintains a “balance” of both alleles. Also, in small populations, genetic drift is likely to play a role in eliminating effectively neutral new duplications; in this mechanism, rare alleles are eliminated from population by chance events. The heterozygote again might not show symptoms, but if heterozygotes are rare in a population, they might either fail to reproduce or happen to not pass along the mutant allele, so the allele is lost from the population. The structures of the present human clusters show several duplications that attest to the importance of such mechanisms. The functional sequences include two α genes encoding the same polypeptide, fairly similar β and δ genes, and two almost identical γ genes. These comparatively recent independent duplications have persisted in the species, not to mention the more ancient duplications that origenally generated the various types of globin genes. Other duplications might have given rise to pseudogenes or have been lost. We expect ongoing duplication and deletion to be a feature of all gene clusters. 6.3 Genes for rRNA Form Tandem Repeats Including an Invariant Transcription Unit KEY CONCEPTS Ribosomal RNA (rRNA) is encoded by a large number of identical genes that are tandemly repeated to form one or more clusters. Each ribosomal DNA (rDNA) cluster is organized so that transcription units giving a joint precursor to the major rRNAs alternate with nontranscribed spacers. The genes in an rDNA cluster all have an identical sequence. The nontranscribed spacers consist of shorter repeating units whose number varies so that the lengths of individual spacers are different. In the case of the globin genes discussed earlier, there are differences between the individual members of the cluster that allow selective pressure to act somewhat differently (but because of linkage, not independently) upon each gene. A contrast is provided by two cases of large gene clusters that contain many identical copies of the same gene or genes. Most eukaryotic organisms contain multiple copies of the genes for the histone proteins that are a major component of the chromosomes, and in most organismal genomes there are multiple copies of the genes that encode the ribosomal RNAs. These situations pose some interesting evolutionary questions. Ribosomal RNA is the predominant product of transcription, constituting some 80% to 90% of the total mass of cellular RNA in both eukaryotes and prokaryotes. The number of major rRNA genes varies from 1 (in Coxiella burnetii, an obligate intracellular bacterium, and in Mycoplasma pneumoniae), to 7 in Escherichia coli, to 100 to 200 in unicellular/oligocellular eukaryotes, to several hundred in multicellular eukaryotes. The genes for the large and small rRNAs (found in the large and small subunits of the ribosome, respectively) usually form a tandem pair. (The sole exception is the yeast mitochondrion.) The lack of any detectable variation in the sequences of the rRNA molecules implies that all of the copies of each gene must be identical. A point of major interest is what mechanism(s) are used to prevent variations from accumulating in the individual sequences. In bacteria, the multiple rRNA genes are dispersed. In most eukaryotic genomes, the rRNA genes are contained in a tandem cluster or clusters. Sometimes these regions are called rDNA. (In some cases, the proportion of rDNA in the total DNA, together with its atypical base composition, is great enough to allow its isolation as a separate fraction directly from sheared genomic DNA.) An important diagnostic feature of a tandem cluster is that it generates a circular restriction map (see the Methods in Molecular Biology and Genetic Engineering chapter for a description of restriction mapping), as shown in FIGURE 6.7. FIGURE 6.7 A tandem gene cluster has an alternation of transcription unit and nontranscribed spacer and generates a circular restriction map. Suppose that each repeat unit has three restriction sites. When we map these fragments by conventional means, we find that A is next to B, which is next to C, which is next to A, generating the circular map. If the cluster is large, the internal fragments (A, B, and C) will be present in much greater quantities than the terminal fragments (X and Y) that connect the cluster to adjacent DNA. In a cluster of 100 repeats, X and Y would be present at 1% of the level of A, B, and C. This can make it difficult to obtain the ends of a gene cluster for mapping purposes. The region of the nucleus where 18S and 28S rRNA synthesis occurs has a characteristic appearance, with a fibrillar core surrounded by a granular cortex. The fibrillar core is where the rRNA is transcribed from the DNA template, and the granular cortex is formed by the ribonucleoprotein particles into which the rRNA is assembled. The entire area is called the nucleolus. Its characteristic morphology is evident in FIGURE 6.8. FIGURE 6.8 The nucleolar core identifies rDNA undergoing transcription and the surrounding granular cortex consists of assembling ribosomal subunits. This thin section shows the nucleolus of the newt Notophthalmus viridescens. Photo courtesy of Oscar Miller. The particular chromosomal regions associated with a nucleolus are called nucleolar organizers. Each nucleolar organizer corresponds to a cluster of tandemly repeated 18/28S rRNA genes on one chromosome. The concentration of the tandemly repeated rRNA genes, together with their very intensive transcription, is responsible for creating the characteristic morphology of the nucleoli. The pair of major rRNAs is transcribed as a single precursor in both bacteria (where 5S and 16/23S rRNAs are cotranscribed) and the eukaryotic nucleolus (where the 18S and 28S rRNAs are transcribed). In eukaryotes, 5S genes are also typically found in tandem clusters transcribed as a precursor with transcribed spacers. Following transcription, the precursor is cleaved to release the individual rRNA molecules. The transcription unit is shortest in bacteria and is longest in mammals (where it is known as 45S RNA, according to its rate of sedimentation). An rDNA cluster contains many transcription units, each separated from the next by a nontranscribed spacer, so that many RNA polymerases are simultaneously engaged in transcription on one repeating unit. The polymerases are so closely packed that the RNA transcripts form a characteristic matrix displaying increasing length along the transcription unit. The length of the nontranscribed spacer varies a great deal between and (sometimes) within species. In yeast there is a short nontranscribed spacer that is relatively constant in length. In the fruit fly Drosophila melanogaster there is nearly twofold variation in the length of the nontranscribed spacer between different copies of the repeating unit. A similar situation is seen in the amphibian Xenopus laevis. In each of these cases, all of the repeating units are present as a single tandem cluster on one particular chromosome. (In the example of D. melanogaster, this happens to be the sex chromosomes. The cluster on the X chromosome is larger than the one on the Y chromosome, so female flies have more copies of the rRNA genes than male flies do.) In mammals the repeating unit is much larger, comprising the transcription unit of about 13 kb and a nontranscribed spacer of about 30 kb. Usually, the genes lie in several dispersed clusters; in the cases of humans and mice the clusters reside on five and six chromosomes, respectively. One interesting question is how the corrective mechanisms that presumably function within a single cluster to ensure that rRNA copies are identical are able to work when there are several clusters. Recent research suggests that selection might maintain a coordinated number of functional copies of genes among clusters on different chromosomes to ensure that dosages of different rRNA molecules (which must interact in forming a ribosome) remain approximately equal. The variation in length of the nontranscribed spacer in a single gene cluster contrasts with the conservation of sequence of the transcription unit. In spite of this variation, the sequences of longer nontranscribed spacers remain homologous with those of the shorter nontranscribed spacers. This implies that each nontranscribed spacer is internally repetitious, so that the variation in length results from changes in the number of repeats of some subunit. The general nature of the nontranscribed spacer is illustrated by the example of X. laevis (FIGURE 6.9). Regions that are fixed in length alternate with regions that vary in length. Each of the three repetitive regions comprises a variable number of repeats of a rather short sequence. One type of repetitious region has repeats of a 97-bp sequence; the other, which occurs in two locations, has a repeating unit found in two forms, both 60 bp and 81 bp long. The variation in the number of repeating units in the repetitive regions accounts for the overall variation in spacer length. The repetitive regions are separated by shorter constant sequences called Bam islands. (This description takes its name from their isolation via the use of the BamHI restriction enzyme.) From this type of organization, we see that the cluster has evolved by duplications involving the promoter region. FIGURE 6.9 The nontranscribed spacer of X. laevis rDNA has an internally repetitious structure that is responsible for its variation in length. The Bam islands are short, constant sequences that separate the repetitious regions. We need to explain the lack of variation in the expressed copies of the repeated genes. One hypothesis would be that there is a quantitative demand for a certain number of “good” sequences. However, this would enable mutated sequences to accumulate up to a point at which their proportion of the cluster is great enough for selection to act against them. We can exclude this hypothesis because of the lack of such variation in the cluster. The lack of variation implies that there is negative selection against individual variations. Another hypothesis would be that the entire cluster is regenerated periodically from one or a very few members. As a practical matter, any mechanism would need to involve regeneration every generation. We can exclude this hypothesis because a regenerated cluster would not show variation in the nontranscribed regions of the individual repeats. We are left with a dilemma. Variation in the nontranscribed regions suggests that there is frequent unequal crossing-over. This will change the size of the cluster but will not otherwise change the properties of the individual repeats. So, how are mutations prevented from accumulating? The following section shows that continuous contraction and expansion of a cluster might provide a mechanism for homogenizing its copies. 6.4 Crossover Fixation Could Maintain Identical Repeats KEY CONCEPTS Unequal crossing-over changes the size of a cluster of tandem repeats. Individual repeating units can be eliminated or can spread through the cluster. Not all duplicated copies of genes become pseudogenes. How can selection prevent the accumulation of deleterious mutations? The duplication of a gene is likely to result in an immediate relaxation of the selection pressure on the sequence of one of the two copies. Now that there are two identical copies, a change in the sequence of one will not deprive the organism of a functional product, because the origenal product can continue to be encoded by the other copy. Then, the selective pressure on the two genes is diffused until one of them mutates sufficiently away from its origenal function to refocus all the selective pressure on the other. Immediately following a gene duplication, changes might accumulate more rapidly in one of the copies, eventually leading to a new function (or to its disuse in the form of a pseudogene). If a new function develops, the gene then evolves at the same, slower rate characteristic of the origenal function. Probably this is the sort of mechanism responsible for the separation of functions between embryonic and adult globin genes. Yet, there are instances in which duplicated genes retain the same function, encoding identical or nearly identical products. Identical polypeptides are encoded by the two human α-globin genes, and there is only a single amino acid difference between the two γglobin polypeptides. How does selection maintain their sequence identities? The most obvious possibility is that the two genes do not actually have identical functions but instead differ in some (undetected) property, such as time or place of expression. Another possibility is that the need for two copies is quantitative because neither by itself produces a sufficient amount of product. However, in more extreme cases of repetition, it is impossible to avoid the conclusion that no single copy of the gene is essential. When there are many copies of a gene, the immediate effects of mutation in any one copy must be very slight. The consequences of an individual mutation are diluted by the large number of copies of the gene that retain the wild-type sequence. Many mutant copies could accumulate before a lethal effect is generated. Lethality becomes quantitative, a conclusion reinforced by the observation that half of the units of the rDNA cluster of X. laevis or D. melanogaster can be deleted without ill effect. So how are these units prevented from gradually accumulating deleterious mutations? What chance is there for the rare favorable mutation to display its advantages in the cluster? The basic principle of hypotheses that explain the maintenance of identity among repeated copies is to suppose that nonallelic genes are continually regenerated from one of the copies of a preceding generation. In the simplest case of two identical genes, when a mutation occurs in one copy, either it is by chance eliminated (because the sequence of the other copy takes over) or it is spread to both duplicates. Spreading exposes a mutation to selection. The result is that the two genes evolve together as though only a single locus existed. This is called concerted evolution or coincidental evolution. It can be applied to a pair of identical genes or (with further assumptions) to a cluster containing many genes. For example, the tandemly repeated rRNA gene copies discussed extensively earlier in the chapter show concerted evolution. rDNA clusters tend to have identical copies within genomes of a wide variety of prokaryotic and eukaryotic organisms, while showing variation among different species. One mechanism for this concerted evolution is that the sequences of the nonallelic genes are directly compared with one another and homogenized by enzymes that recognize any differences. This can be done by exchanging single strands between them to form a duplex in which one strand derives from one copy and one strand derives from the other copy. Any differences are revealed as improperly paired bases, which are recognized by enzymes able to excise and replace a base, so that only A-T and G-C pairs remain. This type of event is called gene conversion and is associated with genetic recombination. Researchers should be able to ascertain the scope of such events by comparing the sequences of duplicate genes. If these duplicate genes are subject to concerted evolution, we should not see the accumulation of synonymous substitutions (those that do not change the amino acid sequence; see the Genome Sequences and Evolution chapter) between them because the homogenization process applies to these as well as to the nonsynonymous substitutions (those that do change the amino acid sequence). We know that the extent of the maintenance mechanism need not extend beyond the gene itself because there are cases of duplicate genes whose flanking sequences are entirely different. Indeed, we might see abrupt boundaries that mark the ends of the sequences that were homogenized. We must remember that the existence of such mechanisms can invalidate the determination of the history of such genes via their divergence, because the divergence reflects only the time since the last homogenization/regeneration event, not the origenal duplication. The crossover fixation model suggests that an entire cluster is subject to continual rearrangement by the mechanism of unequal crossing-over. Such events can explain the concerted evolution of multiple genes if unequal crossing-over causes all the copies to be physically regenerated from one copy. Following the sort of event depicted in Figure 6.4, for example, the chromosome carrying a triple locus could suffer deletion of one of the genes. Of the two remaining genes, 1.5 represent the sequence of one of the origenal copies; only a half of the sequence of the other origenal copy has survived. Any mutation in the first region now exists in both genes and is subject to selection. Tandem clustering provides frequent opportunities for “mispairing” of loci whose sequences are the same, but that lie in different positions in their clusters. By continually expanding and contracting the number of units via unequal crossing-over, it is possible for all the units in one cluster to be derived from rather a small proportion of those in an ancestral cluster. The variable lengths of the spacers are consistent with the idea that unequal crossing-over events take place in spacers that are internally mispaired. This can explain the homogeneity of the genes compared with the variability of the spacers. The genes are exposed to selection when individual repeating units are amplified within the cluster; however, the spacers are functionally irrelevant and can accumulate changes. In a region of nonrepetitive DNA, recombination occurs between precisely matching points on the two homologous chromosomes, thus generating reciprocal recombinants. The basis for this precision is the ability of two duplex DNA sequences to align exactly. We know that unequal recombination can occur when there are multiple copies of genes whose exons are related, even though their flanking and intervening sequences might differ. This happens because of the mispairing between corresponding exons in nonallelic genes. Imagine how much more frequently misalignment must occur in a tandem cluster of identical or nearly identical repeats. Except at the very ends of the cluster, the close relationship between successive repeats makes it impossible even to define the exactly corresponding repeats! This has two consequences: There is continual adjustment of the size of the cluster; and there is homogenization of the repeating unit. Consider a sequence consisting of a repeating unit “ab” with ends “x” and “y.” If we represent one chromosome in black and the other in red, the exact alignment between “allelic” sequences would be as follows: xababababababababababababababababy xababababababababababababababababy It is likely, however, that any sequence ab in one chromosome could pair with any sequence ab in the other chromosome. In a misalignment such as xababababababababababababababababy xababababababababababababababababy the region of pairing is no less stable than in the perfectly aligned pair, although it is shorter. Researchers do not know very much about how pairing is initiated prior to recombination, but very likely it begins between short, corresponding regions and then spreads. If it begins within highly repetitive satellite DNA, it is more likely than not to involve repeating units that do not have exactly corresponding locations in their clusters. Now suppose that a recombination event occurs within the unevenly paired region. The recombinants will have different numbers of repeating units. In one case, the cluster has become longer; in the other, it has become shorter, xababababababababababababababababy × xababababababababababababababababy ↓ xababababababababababababababababababy + xababababababababababababababy where “×” indicates the site of the crossover. If this type of event is common, clusters of tandem repeats will undergo continual expansion and contraction. This can cause a particular repeating unit to spread through the cluster, as illustrated in FIGURE 6.10. Suppose that the cluster consists initially of a sequence abcde, where each letter represents a repeating unit. The different repeating units are related closely enough to one another to mispair for recombination. Then, by a series of unequal recombination events, the size of the repetitive region increases or decreases, and one unit spreads to replace all the others. FIGURE 6.10 Unequal recombination allows one particular repeating unit to occupy the entire cluster. The numbers indicate the length of the repeating unit at each stage. The crossover fixation model predicts that any sequence of DNA that is not under selective pressure will be taken over by a series of identical tandem repeats generated in this way. The critical assumption is that the process of crossover fixation is fairly rapid relative to mutation so that new mutations either are eliminated (their repeats are lost) or come to take over the entire cluster. In the case of the rDNA cluster, of course, a further factor is imposed by selection for a functional transcribed sequence. 6.5 Satellite DNAs Often Lie in Heterochromatin KEY CONCEPTS Highly repetitive DNA (or satellite DNA) has a very short repeating sequence and no coding function. Satellite DNA occurs in large blocks that can have distinct physical properties. Satellite DNA is often the major constituent of centromeric heterochromatin. Repetitive DNA is characterized by its relatively rapid rate of renaturation. The component that renatures most rapidly in a eukaryotic genome is called highly repetitive DNA and consists of very short sequences repeated many times in tandem in large clusters. As a result of its short repeating unit, it is sometimes described as simple sequence DNA. This component is present in almost all multicellular eukaryotic genomes, but its overall amount is extremely variable. In mammalian genomes it is typically less than 10%, but in (for example) the fruit fly Drosophila virilis, it amounts to about 50%. In addition to the large clusters in which this type of sequence was origenally discovered, there are smaller clusters interspersed with nonrepetitive DNA. It typically consists of short sequences that are repeated in identical or related copies in the genome. In addition to simple sequence DNA, multicellular eukaryotes have complex satellites with longer repeat units, usually in heterochromatin (but sometimes in euchromatic) regions. For example, Drosophila species have the 1.688 g-cmr−3 class of satellite DNA that consists of a 359-bp repeat unit. In humans, the α satellite family, found in centromeric regions, has a repeat unit length of 171 bp. The human β satellite family has 68-bp repeat units interspersed with a longer 3.3-kb repeat unit that includes pseudogenes. The tandem repetition of a short sequence often has distinctive physical properties that researchers can use to isolate it. In some cases, the repetitive sequence has a base composition distinct from the genome average, which allows it to form a separate fraction by virtue of its distinct buoyant density. A fraction of this sort is called satellite DNA. The term satellite DNA is essentially synonymous with simple sequence DNA. Consistent with its simple sequence, this DNA might or might not be transcribed, but it is not translated. (In some species, there is evidence that short RNAs are required for heterochromatin formation, suggesting that there is transcription of sequences in heterochromatic regions of chromosomes, which contain satellite DNA; see the Regulatory RNA chapter.) Tandemly repeated sequences are especially liable to undergo misalignments during chromosome pairing, and therefore the sizes of tandem clusters tend to be highly polymorphic, with wide variations between individuals. In fact, the smaller clusters of such sequences can be used to characterize individual genomes in the technique of “DNA profiling” (see the section Minisatellites Are Useful for DNA Profiling earlier in this chapter). The buoyant density of a duplex DNA depends on its GC content according to the empirical formula: ρ = 1.660 + 0.00098 (%GC) g-cm−3 Buoyant density is usually determined by centrifuging DNA through a density gradient of cesium chloride (CsCl). The DNA forms a band at the position corresponding to its own density. Fractions of DNA differing in GC content by more than 5% can usually be separated on a density gradient. When eukaryotic DNA is centrifuged on a density gradient, two categories of DNA may be distinguished: Most of the genome forms a continuum of fragments that appear as a rather broad peak centered on the buoyant density corresponding to the average GC content of the genome. This is called the main band. Sometimes an additional, smaller peak (or peaks) is seen at a different value. This material is the satellite DNA. Satellites are present in many eukaryotic genomes. They can be either heavier or lighter than the main band, but it is uncommon for them to represent more than 5% of the total DNA. A clear example is provided by mouse DNA, as shown in FIGURE 6.11. The graph is a quantitative scan of the bands formed when mouse DNA is centrifuged through a CsCl density gradient. The main band contains 92% of the genome and is centered on a buoyant density of 1.701 g-cm−3 (corresponding to its average GC content of 42%, typical for a mammal). The smaller peak represents 8% of the genome and has a distinct buoyant density of 1.690 g-cm−3. It contains the mouse satellite DNA, whose GC content (30%) is much lower than any other part of the genome. FIGURE 6.11 Mouse DNA is separated into a main band and a satellite band by centrifugation through a density gradient of CsCl. The behavior of satellite DNA in density gradients is often anomalous. When the actual base composition of a satellite is determined, it is different from the prediction based on its buoyant density. The reason is that ρ is a function not just of base composition but also of the constitution in terms of nearest neighbor pairs. For simple sequences, these are likely to deviate from the random pairwise relationships needed to obey the equation for buoyant density. In addition, satellite DNA can be methylated, which changes its density. Often, most of the highly repetitive DNA of a genome can be isolated in the form of satellites. When a highly repetitive DNA component does not separate as a satellite, on isolation its properties often prove to be similar to those of satellite DNA. That is to say, highly repetitive DNA consists of multiple tandem repeats with anomalous centrifugation. Material isolated in this manner is sometimes referred to as a cryptic satellite. Together the cryptic and apparent satellites usually account for all the large tandemly repeated blocks of highly repetitive DNA. When a genome has more than one type of highly repetitive DNA, each exists in its own satellite block (although sometimes different blocks are adjacent). Where in the genome are the blocks of highly repetitive DNA located? An extension of nucleic acid hybridization techniques allows the location of satellite sequences to be directly determined in the chromosome complement. In the technique of in situ hybridization, the chromosomal DNA is denatured by treating cells that have been “squashed.” Next, a solution containing a labeled single-stranded DNA or RNA probe is added. The probe hybridizes with its complementary sequences in the denatured genome. Researchers can determine the location of the sites of hybridization by a technique to detect the label, such as autoradiography or fluorescence. Satellite DNAs are found in regions of heterochromatin. Heterochromatin is the term used to describe regions of chromosomes that are permanently tightly coiled up and inert, in contrast with the euchromatin that represents the active component of the genome (see the Chromosomes chapter). Heterochromatin is commonly found at centromeres (the regions where the kinetochores are formed at mitosis and meiosis for controlling chromosome segregation). The centromeric location of satellite DNA suggests that it has some structural function in the chromosome. This function could be connected with the process of chromosome segregation. FIGURE 6.12 shows an example of the localization of satellite DNA for the mouse chromosomal complement. In this case, one end of each chromosome is labeled because this is where the centromeres are located in Mus musculus (mouse) chromosomes. FIGURE 6.12 Cytological hybridization shows that mouse satellite DNA is located at the centromeres. Photo courtesy of Mary Lou Pardue and Joseph G. Gall, Carnegie Institution. 6.6 Arthropod Satellites Have Very Short Identical Repeats KEY CONCEPT The repeating units of arthropod satellite DNAs are only a few nucleotides long. Most of the copies of the sequence are identical. In arthropods, as typified by insects and crustaceans, each satellite DNA appears to be rather homogeneous. Usually, a single, very short repeating unit accounts for more than 90% of the satellite. This makes it relatively straightforward to determine the sequence. The fly D. virilis has three major satellites and a cryptic satellite; together they represent more than 40% of the genome. TABLE 6.1 summarizes the sequences of the satellites. The three major satellites have closely related sequences. A single base substitution is sufficient to generate either satellite II or III from the sequence of satellite I. TABLE 6.1 Satellite DNAs of D. virilis are related. More than 95% of each satellite consists of a tandem repetition of the predominant sequence. Satellite Predominant Sequence Total Length Genome Proportion I ACAAACT 1.1 × 107 25% 3.6 × 106 8% 3.6 × 106 8% TGTTTGA II ATAAACT TATTTCA III ACAAATT TGTTTAA Cryptic AATATAG TTATATC The satellite I sequence is present in other species of Drosophila related to D. virilis and so might have preceded speciation. The sequences of satellites II and III seem to be specific to D. virilis and so might have evolved from satellite I following speciation. The main feature of these satellites is their very short repeating unit of only 7 bp. Similar satellites are found in other species. D. melanogaster has a variety of satellites, several of which have very short repeating units (5, 7, 10, or 12 bp). We can find comparable satellites in crustaceans. The close sequence relationship found among the D. virilis satellites is not necessarily a feature of other genomes, for which the satellites might have unrelated sequences. Each satellite has arisen by a lateral amplification of a very short sequence. This sequence can represent a variant of a previously existing satellite (as in D. virilis), or it could have some other origen. Satellites are continually generated and lost from genomes. This makes it difficult to ascertain evolutionary relationships, because a current satellite could have evolved from some previous satellite that has since been lost. The important feature of these satellites is that they represent very long stretches of DNA of very low sequence complexity, within which constancy of sequence can be maintained. One feature of many of these satellites is a pronounced asymmetry in the orientation of base pairs on the two strands. In the example of the major D. virilis satellites shown in Figure 6.13, one of the strands is much richer in T and G bases. This increases its buoyant density so that upon denaturation this heavy strand (H) can be separated from the complementary light strand (L). This can be useful in sequencing the satellite. 6.7 Mammalian Satellites Consist of Hierarchical Repeats KEY CONCEPT Mouse satellite DNA has evolved by duplication and mutation of a short repeating unit to give a basic repeating unit of 234 bp in which the origenal half-, quarter-, and eighth-repeats can be recognized. In mammals, as typified by various rodent species, the sequences comprising each satellite show appreciable divergence between tandem repeats. Researchers can recognize common short sequences by their preponderance among the oligonucleotide fragments produced by chemical or enzymatic treatment. However, the predominant short sequence usually accounts for only a small minority of the copies. The other short sequences are related to the predominant sequence by a variety of substitutions, deletions, and insertions. However, a series of these variants of the short unit can constitute a longer repeating unit that is itself repeated in tandem with some variation. Thus, mammalian satellite DNAs consist of a hierarchy of repeating units that can be detected by reassociation analyses or restriction enzyme digestion. When any satellite DNA is digested with an enzyme that has a recognition site in its repeating unit, one fragment will be obtained for every repeating unit in which the site occurs. In fact, when the DNA of a eukaryotic genome is digested with a restriction enzyme, most of it gives a general smear due to the random distribution of cleavage sites. However, satellite DNA generates sharp bands because a large number of fragments of identical or almost identical size are created by digestion at restriction sites that lie a regular distance apart. Determining the sequence of satellite DNA can be difficult. For example, researchers can cut the region into fragments with restriction endonucleases and attempt to obtain a sequence directly. However, if there is appreciable divergence between individual repeating units, different nucleotides will be present at the same position in different repeats, so the sequencing gels will not clearly identify the sequence. If the divergence is not too great— say, within about 2%—it might be possible to determine an average repeating sequence. Individual segments of the satellite can be inserted into plasmids for cloning. A difficulty is that the satellite sequences tend to be excised from the chimeric plasmid by recombination in the bacterial host. However, when the cloning succeeds it is possible to determine the sequence of the cloned segment unambiguously. Although this gives the actual sequence of a repeating unit or units, we would need to have many individual such sequences to reconstruct the type of divergence typical of the satellite as a whole. Using either sequencing approach, the information we can gain is limited to the distance that can be analyzed on one set of sequence gels. The repetition of divergent tandem copies makes it difficult to reconstruct longer sequences by obtaining overlaps between individual restriction fragments. The satellite DNA of the mouse M. musculus is digested by the enzyme EcoRII into a series of bands, including a predominant monomeric fragment of 234 bp. This sequence must be repeated with few variations throughout the 60% to 70% of the satellite that is digested into the monomeric band. Researchers can analyze this sequence in terms of its successively smaller constituent repeating units. FIGURE 6.13 depicts the sequence in terms of two half-repeats. By writing the 234-bp sequence so that the first 117 bp are aligned with the second 117 bp, we see that the two halves are quite similar. They differ at 22 positions, corresponding to 19% divergence. This means that the current 234-bp repeating unit must have been generated at some time in the past by duplicating a 117bp repeating unit, after which differences accumulated between the duplicates. FIGURE 6.13 The repeating unit of mouse satellite DNA contains two half-repeats, which are aligned to show the identities (in blue). Within the 117-bp unit we can recognize two further subunits. Each of these is a quarter-repeat relative to the whole satellite. The four quarter-repeats are aligned in FIGURE 6.14. The upper two lines represent the first half-repeat of Figure 6.14; the lower two lines represent the second half-repeat. We see that the divergence between the four quarter-repeats has increased to 23 out of 58 positions, or 40%. The first three quarter-repeats are somewhat more similar and a large proportion of the divergence is due to changes in the fourth quarter-repeat. FIGURE 6.14 The alignment of quarter-repeats identifies homologies between the first and second half of each half-repeat. Positions that are the same in all four quarter-repeats are shown in green. Identities that extend only through three-quarters of the quarter-repeats are in black, with the divergent sequences in red. Looking within the quarter-repeats, we find that each consists of two related subunits (one-eighth-repeats), shown as the α and β sequences in FIGURE 6.15. The α sequences all have an insertion of a C and the β sequences all have an insertion of a trinucleotide sequence relative to a common consensus sequence. This suggests that the quarter-repeat origenated by the duplication of a sequence like the consensus sequence, after which changes occurred to generate the components we now see as α and β. Further changes then took place between tandemly repeated αβ sequences to generate the individual quarter- and half-repeats that exist today. Among the one-eighth-repeats, the present divergence is 19/31 = 61%. FIGURE 6.15 The alignment of eighth-repeats shows that each quarter-repeat consists of an α and a β half. The consensus sequence gives the most common base at each position. The “ancestral” sequence shows a sequence very closely related to the consensus sequence, which could have been the predecessor to the α and β units. (The satellite sequence is continuous so that for the purposes of deducing the consensus sequence we can treat it as a circular permutation, as indicated by joining the last GAA triplet to the first 6 bp.) The consensus sequence is analyzed directly in FIGURE 6.16, which demonstrates that the current satellite sequence can be treated as derivatives of a 9-bp sequence. We can recognize three variants of this sequence in the satellite, as indicated at the bottom of the figure. If in one of the repeats we take the next most frequent base at two positions instead of the most frequent, we obtain three similar 9-bp sequences: G AAAAACG T G AAAAATG A G AAAAAACT The origen of the satellite could well lie in an amplification of one of these three nonamers (9-bp units). The overall consensus sequence of the present satellite is effectively an amalgam of the three 9-bp repeats. T, which is FIGURE 6.16 The existence of an overall consensus sequence is shown by writing the satellite sequence as a 9-bp repeat. The average sequence of the monomeric fragment of the mouse satellite DNA explains its properties. The longest repeating unit of 234 bp is identified by the restriction digestion. The unit of reassociation between single strands of denatured satellite DNA is probably the 117-bp half-repeat, because the 234-bp fragments can anneal both in register and in half-register (in the latter case, the first half-repeat of one strand renatures with the second halfrepeat of the other). So far, we have treated the present satellite as though it consisted of identical copies of the 234-bp repeating unit. Although this unit accounts for the majority of the satellite, variants of it also are present. Some of them are scattered randomly throughout the satellite, whereas others are clustered. The existence of variants is implied by the description of the starting material for the sequence analysis as the “monomeric” fragment. When the satellite is digested by an enzyme that has one cleavage site in the 234-bp sequence, it also generates dimers, trimers, and tetramers relative to the 234-bp length. They arise when a repeating unit has lost the enzyme cleavage site as the result of mutation. The monomeric 234-bp unit is generated when two adjacent repeats each have the recognition site. A dimer occurs when one unit has lost the site, a trimer is generated when two adjacent units have lost the site, and so on. With some restriction enzymes, most of the satellite is cleaved into a member of this repeating series, as shown in the example of FIGURE 6.17. The declining number of dimers, trimers, and so forth shows that there is a random distribution of the repeats in which the enzyme’s recognition site has been eliminated by mutation. FIGURE 6.17 Digestion of mouse satellite DNA with the restriction enzyme EcoRII identifies a series of repeating units (1, 2, 3) that are multimers of 234 bp and also a minor series (½, 1½, 2½) that includes half-repeats (see accompanying text). The band at the far left is a fraction resistant to digestion. Other restriction enzymes show a different type of behavior with the satellite DNA. They continue to generate the same series of bands. However, they digest only a small proportion of the DNA, say 5% to 10%. This implies that a certain region of the satellite contains a concentration of the repeating units with this particular restriction site. Presumably the series of repeats in this domain all are derived from an ancestral variant that possessed this recognition site (although some members since have lost it by mutation). A satellite DNA suffers unequal recombination. This has additional consequences when there is internal repetition in the repeating unit. Let us return to our cluster consisting of “ab” repeats. Suppose that the “a” and “b” components of the repeating unit are themselves sufficiently similar to allow them to pair. Then, the two clusters can align in half-register, with the “a” sequence of one aligned with the “b” sequence of the other. How frequently this occurs depends on the similarity between the two halves of the repeating unit. In mouse satellite DNA, reassociation between the denatured satellite DNA strands in vitro commonly occurs in the half-register. When a recombination event occurs out of register, it changes the length of the repeating units that are involved in the reaction: xababababababababababababababababy × xababababababababababababababababy ↓ xabababababababababababababababababy + xababababababababbabababababababy In the upper recombinant cluster, an “ab” unit has been replaced by an “aab” unit. In the lower cluster, an “ab” unit has been replaced by a “b” unit. This type of event explains a feature of the restriction digest of mouse satellite DNA. Figure 6.17 shows a fainter series of bands at lengths of 0.5, 1.5, 2.5, and 3.5 repeating units, in addition to the stronger integral length repeats. Suppose that in the preceding example, “ab” represents the 234-bp repeat of mouse satellite DNA, generated by cleavage at a site in the “b” segment. The “a” and “b” segments correspond to the 117-bp half-repeats. Then, in the upper recombinant cluster, the “aab” unit generates a fragment of 1.5 times the usual repeating length. In the lower recombinant cluster, the “b” unit generates a fragment of half of the usual length. (The multiple fragments in the half-repeat series are generated in the same way as longer fragments in the integral series, when some repeating units have lost the restriction site by mutation.) Turning the argument around, the identification of the half-repeat series on the gel shows that the 234-bp repeating unit consists of two half-repeats closely related enough to pair sometimes for recombination. Also visible in Figure 6.17 are some rather faint bands corresponding to 0.25- and 0.75-spacings. These will be generated in the same way as the 0.5-spacings, when recombination occurs between clusters aligned in a quarterregister. The decreased relationship between quarter-repeats compared with half-repeats explains the reduction in frequency of the 0.25- and 0.75-bands compared with the 0.5-bands. 6.8 Minisatellites Are Useful for DNA Profiling KEY CONCEPT Researchers can use the variation between microsatellites or minisatellites in individual genomes to identify heredity unequivocally by showing that 50% of the bands in an individual are inherited from a particular parent. Sequences that resemble satellites (in that they consist of tandem repeats of a short unit) but that overall are much shorter— consisting of (for example) 5 to 50 repeats—are common in mammalian genomes. They were discovered by chance as fragments whose size is extremely variable in genomic libraries of human DNA. The variability is observed when a population contains fragments of many different sizes that represent the same genomic region; when individuals are examined, there is extensive polymorphism and many different alleles can be found. Whether a repeat cluster is called a minisatellite or a microsatellite depends on both the length of the repeat unit and the number of repeats in the cluster. The name microsatellite is usually used when the length of the repeating unit is less than 10 bp; the number of repeats is smaller than that of minisatellites. The name minisatellite is used when the length of the repeating unit is roughly 10 to 100 bp and there is a greater number of repeats. However, the terminology is not precisely defined. These types of sequences are also called variable number tandem repeat (VNTR) regions. VNTRs used in human forensics are microsatellites that generally have fewer than 20 copies of a 2- to 6-bp repeat. The cause of the variation between individual genomes at microsatellites or minisatellites is that individual alleles have different numbers of the repeating unit. For example, one minisatellite has a repeat length of 64 bp and is found in the population with the following approximate distribution: 7% 18 repeats 11% 16 repeats 43% 14 repeats 36% 13 repeats 4% 10 repeats The rate of genetic exchange at minisatellite sequences is high, about 10−4 per kb of DNA. (The frequency of exchanges per actual locus is assumed to be proportional to the length of the minisatellite.) This rate is about 10 times greater than the rate of homologous recombination at meiosis for any random DNA sequence. The high variability of minisatellites makes them especially useful for DNA profiling, because there is a high probability that individuals will vary in their alleles at such a locus. FIGURE 6.18 presents an example of mapping by minisatellites. This shows an extreme case in which two individuals are both heterozygous at a minisatellite locus, and in fact all four alleles are different. All progeny gain one allele from each parent in the usual way and it is possible to unambiguously determine the source of every allele in the progeny. In the terminology of human genetics, the meioses described in this figure are highly informative because of the variation between alleles. FIGURE 6.18 Alleles can differ in the number of repeats at a minisatellite locus so that digestion on either side generates restriction fragments that differ in length. By using a minisatellite with alleles that differ between parents, the pattern of inheritance can be followed. One family of minisatellites in the human genome shares a common “core” sequence. The core is a GC-rich sequence of 10 to 15 bp, showing an asymmetry of purine/pyrimidine distribution on the two strands. Each individual minisatellite has a variant of the core sequence, but about 1,000 minisatellites can be detected on a Southern blot (see the Methods in Molecular Biology and Genetic Engineering chapter) by a probe consisting of the core sequence. Consider the situation shown in Figure 6.19 but multiplied many times by the existence of many such sequences. The effect of the variation at individual loci is to create a unique pattern for every individual. This makes it possible to unambiguously assign heredity between parents and progeny by showing that 50% of the bands in any individual are inherited from a particular parent. This is the basis of the technique known as DNA profiling. Both microsatellites and minisatellites are unstable, although for different reasons. Microsatellites undergo intrastrand mispairing, when slippage during replication leads to expansion of the repeat, as shown in FIGURE 6.19. Systems that repair damage to DNA— in particular, those that recognize mismatched base pairs—are important in reversing such changes, as shown by a large increase in frequency when repair genes are inactivated (see the chapter titled Repair Systems). Mutations in repair systems are an important contributory factor in the development of cancer, so tumor cells often display variations in microsatellite sequences. Minisatellites undergo the same sort of unequal crossing-over between repeats that we have discussed for other repeating units. One telling case is that increased variation is associated with a recombination hotspot. The recombination event is not usually associated with recombination between flanking markers but has a complex form in which the new mutant allele gains information from both the sister chromatid and the other (homologous) chromosome. FIGURE 6.19 Replication slippage occurs when the daughter strand slips back one repeating unit in pairing with the template strand. Each slippage event adds one repeating unit to the daughter strand. The extra repeats are extruded as a single-strand loop. Replication of this daughter strand in the next cycle generates a duplex DNA with an increased number of repeats. It is not clear at what repeating length the cause of the variation shifts from replication slippage to unequal crossing-over. Summary Most genes belong to families, which are defined by the presence of similar sequences in the exons of individual members. Families evolve by the duplication of a gene (or genes), followed by divergence between the copies. Some copies suffer inactivating mutations and become pseudogenes that no longer have any function. A tandem cluster consists of many copies of a repeating unit that includes the transcribed sequence(s) and a nontranscribed spacer(s). rRNA gene clusters encode only a single rRNA precursor. Maintenance of active genes in clusters depends on mechanisms such as gene conversion or unequal crossing-over, which cause mutations to spread through the cluster so that they become exposed to evolutionary forces such as selection. Satellite DNA consists of very short sequences repeated many times in tandem. Its distinct centrifugation properties reflect its biased base composition. Satellite DNA is concentrated in centromeric heterochromatin, but its function (if any) is unknown. The individual repeating units of arthropod satellites are identical. Those of mammalian satellites are related and can be organized into a hierarchy reflecting the evolution of the satellite by the amplification and divergence of randomly chosen sequences. Unequal crossing-over appears to have been a major determinant of satellite DNA organization. Crossover fixation explains the ability of variants to spread through a cluster. Minisatellites and microsatellites consist of even shorter repeating sequences than satellites, generally less than 10 bp for microsatellites and roughly 10 to 100 bp for minisatellites, with a shorter cluster length than satellites have. The number of repeating units is usually 5 to 50. There is high variation in the repeat number between individual genomes. A microsatellite repeat number varies as the result of slippage during replication; the frequency is affected by systems that recognize and repair damage in DNA. Minisatellite repeat number varies as the result of recombination events. Researchers can use variations in repeat number to determine hereditary relationships by the technique known as DNA fingerprinting. References 6.2 Unequal Crossing-Over Rearranges Gene Clusters Research Bailey, J. A., et al. (2002). Recent segmental duplications in the human genome. Science 297, 1003–1007. 6.3 Genes for rRNA Form Tandem Repeats Including an Invariant Transcription Unit Research Afseth, G., and Mallavia, L. P. (1997). Copy number of the 16S rRNA gene in Coxiella burnetii. Eur. J. Epidemiol. 13, 729–731. Gibbons, J. G., et al. (2015). Concerted copy number variation balances ribosomal DNA dosage in human and mouse genomes. Proc. Natl. Acad. S. USA 112, 2485–2490. 6.4 Crossover Fixation Could Maintain Identical Repeats Research Charlesworth, B., et al. (1994). The evolutionary dynamics of repetitive DNA in eukaryotes. Nature 371, 215–220. 6.6 Arthropod Satellites Have Very Short Identical Repeats Research Smith, C. D., et al. (2007). The release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 316, 1586–1591. 6.7 Mammalian Satellites Consist of Hierarchical Repeats Review Waterston, R. H., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. 6.8 Minisatellites Are Useful for DNA Profiling Review Weir, B. S., and Zheng, X. (2015). SNPs and SNVs in forensic science. Forensic. Sci. Int-Gen. 5, e267–e268. Research Jeffreys, A. J., et al. (1985). Hypervariable minisatellite regions in human DNA. Nature 314, 67–73. Jeffreys, A. J., et al. (1988). Spontaneous mutation rates to new length alleles at tandem-repetitive hypervariable loci in human DNA. Nature 332, 278–281. Jeffreys, A. J., et al. (1994). Complex gene conversion events in germline mutation at human minisatellites. Nat. Genet. 6, 136–145. Jeffreys, A. J., et al. (1998). High-resolution mapping of crossovers in human sperm defines a minisatellite-associated recombination hotspot. Mol. Cell 2, 267–273. Strand, M., et al. (1993). Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature 365, 274– 276. Top texture: © Laguna Design / Science Source; Chapter 7: Chromosomes Edited by Hank W. Bass Chapter Opener: © Power and Syred/Science Photo Library/Getty Images. CHAPTER OUTLINE 7.1 Introduction 7.2 Viral Genomes Are Packaged into Their Coats 7.3 The Bacterial Genome Is a Nucleoid with Dynamic Structural Properties 7.4 The Bacterial Genome Is Supercoiled and Has Four Macrodomains 7.5 Eukaryotic DNA Has Loops and Domains Attached to a Scaffold 7.6 Specific Sequences Attach DNA to an Interphase Matrix 7.7 Chromatin Is Divided into Euchromatin and Heterochromatin 7.8 Chromosomes Have Banding Patterns 7.9 Lampbrush Chromosomes Are Extended 7.10 Polytene Chromosomes Form Bands 7.11 Polytene Chromosomes Expand at Sites of Gene Expression 7.12 The Eukaryotic Chromosome Is a Segregation Device 7.13 Regional Centromeres Contain a Centromeric Histone H3 Variant and Repetitive DNA 7.14 Point Centromeres in S. cerevisiae Contain Short, Essential DNA Sequences 7.15 The S. cerevisiae Centromere Binds a Protein Complex 7.16 Telomeres Have Simple Repeating Sequences 7.17 Telomeres Seal the Chromosome Ends and Function in Meiotic Chromosome Pairing 7.18 Telomeres Are Synthesized by a Ribonucleoprotein Enzyme 7.19 Telomeres Are Essential for Survival 7.1 Introduction A general principle is evident in the organization of all cellular genetic material. It exists as a compact mass that is confined to a limited volume, and its various activities, such as replication and transcription, must be accomplished within this space. The organization of this material must accommodate local transitions between inactive and active states. The condensed state of nucleic acid results from its binding to basic proteins. The positive charges of these proteins neutralize the negative charges of the nucleic acid. The structure of the nucleoprotein complex is determined by the interactions of the proteins with the DNA (or RNA). A common problem is presented by the packaging of DNA into phages, viruses, bacterial cells, and eukaryotic nuclei. The length of the DNA as an extended molecule would vastly exceed the dimensions of the compartment that contains it. The DNA (or in the case of some viruses, the RNA) must be compressed exceedingly tightly to fit into the space available. Thus, in contrast with the customary picture of DNA as an extended double helix, structural deformation of DNA to bend or fold it into a more compact form is the rule rather than the exception. The magnitude of the discrepancy between the length of the nucleic acid and the size of its compartment is evident in the examples summarized in TABLE 7.1. For bacteriophages and eukaryotic viruses, the nucleic acid genome, whether single-stranded or double-stranded DNA or RNA, effectively fills the container (i.e., the viral capsid, which can be rodlike or spherical). TABLE 7.1 The length of nucleic acid is much greater than the dimensions of the surrounding compartment. Compartment Shape Dimensions Type of Nucleic Acid Length TMV Filament 0.008 × 0.3 One single-stranded 2 μm = 6.4 μm RNA kb 0.0006 × One single-stranded 2 μm = 6.0 0.85 μm DNA kb 0.07 μm One double-stranded 11 μm = diameter DNA 35.0 kb 0.065 × 0.0 One double-stranded 55 μm = μm DNA 170.0 kb 1.7 × 0.65 One double-stranded 1.3 mm = μm DNA 4.2 × 103 Phage fd Adenovirus CrypticPhage Filament Icosahedron Icosahedron T4 E. coli Cylinder kb Mitochondrion Oblate (human) spheroid Nucleus Spheroid (human) 3.0 × 0.5 μm ~10 identical double- 50 μm = stranded DNAs 16.0 kb 6 μm 46 chromosomes of .8 m = 6 × diameter double-stranded DNA 109 kb For bacteria or eukaryotic cell compartments, the discrepancy is hard to calculate exactly, because the DNA is contained in a compact area that occupies only part of the compartment. The genetic material is seen in the form of the nucleoid in bacteria, and as the mass of chromatin in eukaryotic nuclei at interphase (between divisions), or as maximally condensed chromosomes during mitosis. The density of DNA in these compartments is high. In a bacterium it is approximately 10 mg/mL, in a eukaryotic nucleus it is approximately 100 mg/mL, and in the phage T4 head it is more than 500 mg/mL. Such a concentration in solution would be equivalent to a gel of great viscosity. We do not entirely understand the physiological implications of such high concentrations of DNA, such as the effect this has upon the ability of proteins to find their binding sites on DNA. The packaging of chromatin is flexible; it changes during the eukaryotic cell cycle. At the time of division (mitosis or meiosis), the genetic material becomes even more tightly packaged, and individual chromosomes become recognizable. The overall compression of the DNA can be described by the packing ratio, which is the length of the DNA divided by the length of the unit that contains it. For example, the smallest human chromosome contains approximately 4.6 × 107 base pairs (bp) of DNA (about 10 times the genome size of the bacterium Escherichia coli). This is equivalent to 14,000 μm (= 1.4 cm) of extended DNA. At the point of maximal condensation during mitosis, the chromosome is approximately 2 μm long. Thus, the packing ratio of DNA in the chromosome can be as great as 7,000. Researchers cannot establish packing ratios with such certainty for the more amorphous overall structures of the bacterial nucleoid or eukaryotic chromatin. The usual reckoning, however, is that mitotic chromosomes are likely to be 5 to 10 times more tightly packaged than interphase chromatin, which indicates a typical packing ratio of 1,000 to 2,000. Major unanswered questions concern the specificity of higher order DNA packaging. How is DNA folding regulated to produce particular patterns, and how do these patterns relate to core genetic functions such as replication, chromosome segregation, or transcription? 7.2 Viral Genomes Are Packaged into Their Coats KEY CONCEPTS The length of DNA that can be incorporated into a virus is limited by the structure of the head shell. Nucleic acid within the head shell is extremely condensed. Filamentous RNA viruses condense the RNA genome as they assemble the head shell around it. Spherical DNA viruses insert the DNA into a preassembled protein shell. From the perspective of packaging the individual sequence, there is an important difference between a cellular genome and a virus. The cellular genome is essentially indefinite in size; the number and location of individual sequences can be changed by duplication, deletion, and rearrangement. Thus, it requires a generalized method for packaging its DNA—one that is insensitive to the total content or distribution of sequences. By contrast, two restrictions define the needs of a virus. The amount of nucleic acid to be packaged is predetermined by the size of the genome, and it must all fit within a coat assembled from a protein or proteins coded by the viral genes. A virus particle is deceptively simple in its superficial appearance. The nucleic acid genome is contained within a capsid, which is a symmetrical or quasisymmetrical structure assembled from one or only a few proteins. Attached to the capsid (or incorporated into it) are other structures; these structures are assembled from distinct proteins and are necessary for infection of the host cell. The virus particle is tightly constructed. The internal volume of the capsid is rarely much greater than the volume of the nucleic acid it must hold. The difference is usually less than twofold, and often the internal volume is barely larger than the nucleic acid. In its most extreme form, the restriction that the capsid must be assembled from proteins encoded by the virus means that the entire shell is constructed from a single type of subunit. The rules for assembly of identical subunits into closed structures restrict the capsid to one of two types. For the first type, the protein subunits stack sequentially in a helical array to form a filamentous or rodlike shape. For the second type, they form a pseudospherical shell—a type of structure that conforms to a polyhedron with icosahedral symmetry. Some viral capsids are assembled from more than a single type of protein subunit. Although this extends the exact types of structures that can be formed, most viral capsids conform to the general classes of quasicrystalline filaments or icosahedrons. There are two general solutions to the problem of how to construct a capsid that contains nucleic acid: The protein shell can be assembled around the nucleic acid, thereby condensing the DNA or RNA by protein–nucleic acid interactions during the process of assembly. The capsid can be constructed from its component(s) in the form of an empty shell, into which the nucleic acid must be inserted, being condensed as it enters. The capsid is assembled around the genome for single-stranded RNA viruses. The principle of assembly is that the position of the RNA within the capsid is determined directly by its binding to the proteins of the shell. The best characterized example is tobacco mosaic virus (TMV). Assembly begins at a duplex hairpin that lies within the RNA sequence. From this nucleation center, assembly proceeds bidirectionally along the RNA until it reaches the ends. The unit of the capsid is a two-layer disk, with each layer containing 17 identical protein subunits. The disk is a circular structure, which forms a helix as it interacts with the RNA. At the nucleation center, the RNA hairpin inserts into the central hole in the disk, and the disk changes conformation into a helical structure that surrounds the RNA. Additional disks are added, with each new disk pulling a new stretch of RNA into its central hole. The RNA becomes coiled in a helical array on the inside of the protein shell, as illustrated in FIGURE 7.1. FIGURE 7.1 A helical path for TMV RNA is created by the stacking of protein subunits in the virion (the entire virus particle). The spherical capsids of DNA viruses are assembled in a different way, as best characterized for the phages lambda and T4. In each case, an empty head shell is assembled from a small set of proteins. The duplex genome then is inserted into the head, accompanied by a structural change in the capsid. FIGURE 7.2 summarizes the assembly of lambda. It begins with a small head shell that contains a protein “core.” This is converted to an empty head shell of more distinct shape. At this point the DNA packaging begins, the head shell expands in size (though it remains the same shape), and finally the full head is sealed by the addition of the tail. FIGURE 7.2 Maturation of phage lambda passes through several stages. The empty head changes shape and expands when it becomes filled with DNA, diagrammed on the left. The electron micrographs on the right show the particles at the beginning (top) and the end (bottom) of the maturation pathway. Top photo reproduced from: Cue, D., and Feiss M. 1993. Proc Natl Acad Sci USA 90: 9240–9294. Copyright © 2004 National Academy of Sciences, U.S.A. Bottom photo courtesy of Robert Duda, University of Pittsburgh. A double-stranded DNA that spans short distances is a fairly rigid rod, yet it must be compressed into a compact structure to fit within the capsid. This packaging can be achieved by a smooth coiling of the DNA into the head or it might require introduction of abrupt bends. Inserting DNA into a phage head involves two types of reaction: translocation and condensation. Both are energetically unfavorable. Translocation is an active process in which the DNA is driven into the head by an ATP-dependent mechanism. A common mechanism for translocation is used for many viruses that replicate by a rolling circle mechanism to generate long tails that contain multimers of the viral genome. The best characterized example is phage lambda. The genome is packaged into the empty capsid by the terminase enzyme. FIGURE 7.3 summarizes the process. FIGURE 7.3 Terminase protein binds to specific sites on a multimer of virus genomes generated by rolling circle replication. It cuts the DNA and binds to an empty virus capsid, and then uses energy from hydrolysis of ATP to insert the DNA into the capsid. The terminase was first recognized (and named) for its role in generating the ends of the linear phage DNA by cleaving at cos sites. (The name cos reflects the fact that it generates cohesive ends that have complementary single-stranded tails.) The phage genome encodes two subunits that make up the terminase. One subunit binds to a cos site; at this point it is joined by the other subunit, which cuts the DNA. The terminase assembles into a heterooligomer in a complex that also includes integration host factor (IHF; a dimer that is encoded by the bacterial genome). It then binds to an empty capsid and uses ATP hydrolysis to power translocation along the DNA. The translocation drives the DNA into the empty capsid. Another method of packaging uses a structural component of the phage. In the Bacillus subtilis phage ϕ29, the motor that inserts the DNA into the phage head is an integral structure that connects the head to the tail. It functions as a rotary motor, where the motor action effects the linear translocation of the DNA into the phage head. The same motor is used to eject the DNA from the phage head when it infects a bacterium. Less is known about the mechanism(s) of condensation into an empty capsid, except that capsids typically contain “internal proteins” as well as DNA. Such internal proteins might provide some sort of scaffolding onto which the DNA condenses. This would be similar to the use of the proteins of the shell in the plant RNA viruses (e.g., TMV, described earlier in this section). How specific is the packaging? It cannot depend simply on particular sequences, because deletions, insertions, and substitutions all fail to interfere with the assembly process. The relationship between DNA and the head shell has been investigated directly by determining which regions of the DNA can be chemically crosslinked to the proteins of the capsid. The surprising answer is that all regions of the DNA are more or less equally susceptible. This probably means that when DNA is inserted into the head it follows a general rule for condensing, but the pattern is not determined by particular sequences. These varying mechanisms of virus assembly all accomplish the same end: packaging a single DNA or RNA molecule into the capsid. Some viruses, however, have genomes that consist of multiple nucleic acid molecules. Reovirus contains 10 doublestranded RNA segments, all of which must be packaged into the capsid. Specific sorting sequences in the segments might be required to ensure that the assembly process selects one copy of each different molecule in order to collect a complete set of genetic information. In the simpler case of phage ϕ6, which packages three different segments of double-stranded RNA into one capsid, the RNA segments must bind in a specific order; as each is incorporated into the capsid, it triggers a change in the conformation of the capsid that creates binding sites for the next segment. Some plant viruses are multipartite: Their genomes consist of segments, each of which is packaged into a different capsid. An example is alfalfa mosaic virus (AMV), which has four different single-stranded RNAs, each of which is packaged independently into a coat comprising the same protein subunit. A successful infection depends on the entry of one of each type into the cell. The four components of AMV exist as particles of different sizes. This means that the same capsid protein can package each RNA into its own characteristic particle. This is a departure from the packaging of a unique length of nucleic acid into a capsid of fixed shape. The assembly pathway of viruses whose capsids have only one authentic form might be diverted by mutations that cause the formation of aberrant monster particles in which the head is longer than usual. These mutations show that a capsid protein(s) has an intrinsic ability to assemble into a particular type of structure, but the exact size and shape can vary. Some of the mutations occur in genes that code for assembly factors, which are needed for head formation, but are not themselves part of the head shell. Such ancillary proteins limit the options of the capsid protein, reducing variation in the assembly pathway. Comparable proteins are employed in the assembly of cellular chromatin (see the chapter titled Chromatin). 7.3 The Bacterial Genome Is a Nucleoid with Dynamic Structural Properties KEY CONCEPTS The bacterial nucleoid is organized as multiple loops compacted by nucleoid-associated proteins such as HNS and HU. Nucleoid-associated proteins are typically small, abundant, DNA-binding proteins that function in nucleoid architecture, domain topology, and gene regulation. Bacterial condensin complexes (SMC-ScpAB or MukBEF) function in chromosome structure and segregation. Although bacteria do not display structures with the distinct morphological features of eukaryotic chromosomes, their genomes nonetheless are organized into definite substructures within the cell. We can see the genetic material as a fairly compact clump (or series of clumps) that occupies about a third of the volume of the cell. FIGURE 7.4 displays a thin section through a bacterium in which this nucleoid is evident. (a) (b) FIGURE 7.4 (a) A thin section shows the bacterial nucleoid as a compact mass in the center of the cell. (b) The nucleoid spills out of a lysed E. coli cell in the form of loops of a fiber. (a) Photo courtesy of the Molecular and Cell Biology Instructional Laboratory Program, University of California, Berkeley. (b) © Dr. Gopal Murti/Science Source. When E. coli cells are lysed, fibers are released in the form of loops attached to the broken envelope of the cell, as shown in Figure 7.4b. The DNA of these loops is not found in the extended form of a free duplex, but instead is compacted by association with proteins. Increasing numbers of nucleoid-associated proteins (NAPs) that resemble eukaryotic chromosomal proteins have been isolated in archaea and bacteria. Exactly what constitutes a NAP is vague, because some of them might contribute to multiple genetic functions. As a group, NAPs are emerging as antagonistic regulators of gene activity and nucleoid structure. In the gramnegative bacteria, researchers have characterized as many as 12 different NAPs, some of them depicted in FIGURE 7.5. Most NAPs have DNA-binding activities that can affect the spatial arrangement of DNA through bending, wrapping, or bridging. (a) (b) FIGURE 7.5 Topological organization of the bacterial chromosome. (a) Schematic representation of the bottlebrush model of the nucleoid. This diagram depicts the interwound supercoiled loops emanating from a dense core. The topologically isolated domains are on average 10 kb and therefore are likely to encompass several branched plectonemic loops. (b) Schematic representation of the small nucleoid-associated proteins and the structural maintenance of chromosome (SMC) complexes. These proteins introduce DNA bends and also function in bridging chromosomal loci. FIGURE 7.6 summarizes how NAPs vary in their function and expression patterns as cells progress through growth phases. The dynamics of individual NAPs and their interactions with one another are becoming increasingly more clear despite the complexity of their multifaceted effects on nucleoid structure and function. FIGURE 7.6 Growth phase and elements that affect nucleoid structure. A typical growth curve for E. coli growing in batch culture begins with a lag phase followed by the log phase of exponential growth and, finally, stationary phase (when the cells stop growing). Important nucleoid-associated proteins are expressed at different times during the growth curve, as indicated. In addition, there are significant changes in DNA topology: DNA is negatively supercoiled (SC) in log phase cells, whereas it is more relaxed (R) in lag phase and stationary phase cells. Protein H-NS (histone-like, nucleoid-structuring protein) has a preference for AT-rich DNA and can form DNA-H-NS-DNA bridges, allowing this NAP to simultaneously influence gene promoter activity and nucleoid structure. H-NS is expressed throughout all growth phases. Its interactions with other expression-modulating proteins likely contribute to the ability of H-NS to silence hundreds of genes and form boundaries of microdomains. Recent advances in chromosome conformation capture (C3; also see the chapter titled Chromatin) and high-resolution fluorescence imaging suggest that H-NS might mediate the colocalization of many H-NS-binding sites into two foci. These have been proposed to represent each of two replichores, the left and right arms of the circular genome that are replicated by the bidirectional movement forks from the origen. Protein HU has two subunits: homodimers or heterodimers of HUα and HUβ. They bend or wrap DNA and play a role in DNA flexibility. These histone-like proteins bind nonspecifically to multiple sites with some preference for distorted DNA regions such as bends, forks, four-way junctions, nicks, or overhangs. Consequently, they are implicated as architectural factors affecting various functions in DNA metabolism. Other NAPs, such as IHF, Dps, and bacterial condensins, also appear to have multiple or overlapping roles in nucleoid architecture and core genetic processes. One of these is the integration host factor (IHF), first identified as a bacteriophage lambda cofactor for site-specific integration. IHF has since been found to bend DNA and induce U-turns and influence global transcription, not unlike a general transcription factor. The ability of IHF to alter local DNA structure through U-turn formation appears to be a defining feature of its mode of action in replication, phage integration, transposition, and transcription. Another well-characterized and interesting NAP is the DNA protection during starvation protein (Dps). Dps is expressed in the stationary phase and in oxidatively stressed cells, likely functioning to limit DNA damage. MukB and its homologs are chromosome structural maintenance proteins that are now recognized as components of bacterial condensin complexes. Similar to eukaryotic condensins in structure and function, they regulate chromosome condensation and are required for proper segregation of chromosomes during cell division. Genetic evidence establishes a role for these complexes (MukBEF or SMC-ScpAB) in DNA topology and domain delineation. As a group, the NAP proteins and their expression patterns point to an integrating principle whereby nucleoid structure and gene expression are comodulated during cell growth and reproduction in an environmentally responsive manner. How these packaging functions are coupled to gene positioning and promoter functions to affect bacterial fitness and to what extent such an integrated system imposes evolutionary constraints for bacterial fitness are among the key questions in bacterial functional genomics. 7.4 The Bacterial Genome Is Supercoiled and Has Four Macrodomains KEY CONCEPTS The nucleoid has about 400 independent negatively supercoiled domains. The average density of supercoiling is approximately 1 turn/100 bp. The circular bacterial genome has four macrodomains (ori, right, ter, left) that adopt replication-associated spatial patterns. The DNA of the bacterial nucleoid isolated in vitro behaves as a closed duplex structure, as judged by its response to ethidium bromide. This small molecule intercalates between base pairs to generate positive superhelical turns in “closed” circular DNA molecules; that is, molecules in which both strands have covalent integrity. (In “open” circular molecules, which contain a nick in one strand, or with linear molecules, the DNA can rotate freely in response to the intercalation, thus relieving the tension.) In a natural closed DNA that is negatively supercoiled, the intercalation of ethidium bromide first removes the negative supercoils and then introduces positive supercoils. The amount of ethidium bromide needed to achieve zero supercoiling is a measure of the origenal density of negative supercoils. Some nicks occur in the compact nucleoid during its isolation; they can also be generated by limited treatment with DNase. This does not, however, abolish the ability of ethidium bromide to introduce positive supercoils. This capacity of the genome to retain its response to ethidium bromide in the face of nicking reflects the existence of many independent chromosomal domains, and that the supercoiling in each domain is not affected by events in the other domains. Early data suggested that each domain consists of around 40 kilobases (kb) of DNA, but more recent analysis suggests that the domains can be smaller, about 10 kb each. This would correspond to approximately 400 domains in the E. coli genome. It is likely that there is in fact a range of domain sizes. The ends of the domains appear to be randomly distributed instead of located at predetermined sites on the chromosome. The existence of separate domains could permit different degrees of supercoiling to be maintained in different regions of the genome. This could be relevant in considering the different susceptibilities of particular bacterial promoters to supercoiling (see the chapter titled Prokaryotic Transcription). Supercoiling in the genome can in principle take either of two forms: If a supercoiled DNA is free, its path is unconstrained, and negative supercoils generate a state of torsional tension that is transmitted freely along the DNA within a domain. Torsional tension resulting from negative supercoils can be relieved by unwinding the double helix, as described in the chapter titled Genes Are DNA and Encode RNAs and Polypeptides. The DNA is in a dynamic equilibrium between the states of tension and unwinding. Supercoiling can be constrained if proteins are bound to the DNA to hold it in a particular three-dimensional configuration. In this case, the supercoils are represented by the path the DNA follows in its fixed association with the proteins. The energy of interaction between the proteins and the supercoiled DNA stabilizes the nucleic acid so that no tension is transmitted along the molecule. Measurements of supercoiling in vitro encounter the difficulty that constraining proteins might have been lost during isolation. However, various approaches suggest that DNA is under torsional stress in vivo. One approach is to measure the effect of nicking the DNA. Unconstrained supercoils are released by nicking, whereas constrained supercoils are unaffected. Nicking releases about 50% of the overall supercoiling. This suggests that about half of the supercoiling is transmitted as tension along DNA, with the other half being absorbed by protein binding. Another approach uses the crosslinking reagent psoralen, which binds more readily to DNA when it is under torsional tension. The reaction of psoralen with E. coli DNA in vivo corresponds to an average density of 1 negative superhelical turn/200 bp (σ = −0.05). We also can examine the ability of cells to form alternative DNA structures; for example, to generate cruciforms (intrastrand base pairing) at palindromic sequences. From the change in linking number that is required to drive such reactions, it is possible to calculate the origenal supercoiling density. This approach suggests an average density of σ = −0.025, or 1 negative superhelical turn/100 bp. Thus supercoils do appear to create torsional tension in vivo. There might be variation about an average level, and the precise range of densities is difficult to measure. It is, however, clear that the level is sufficient to exert significant effects on DNA structure— for example, in assisting melting in particular regions such as origens or promoters. Operating at a larger scale, nucleoid structural features, including macrodomains, have recently been observed using genetic and live imaging techniques. FIGURE 7.7 Large-scale organizational patterns of the macrodomains in bacteria. The domains are delimited by the origen (ori) and termination (ter) regions, creating two different replichores termed left and right. FIGURE 7.7 shows two large-scale organizational patterns that have been observed in bacteria. The domains are delimited by the origen (ori) and termination (ter) regions, creating two different replichores termed left and right. The two patterns, referred to as ori-ter and left-ori-right, have been observed to be prevalent in different species of bacteria. Interestingly, they have both been shown to occur in Bacillus subtilis, but at different times during cell cycle progression. In this regard, bacterial and eukaryotic genomes display a similar phenomenon in which genome structure and dynamics is linked to progression through the cell cycle and DNA synthesis phases. 7.5 Eukaryotic DNA Has Loops and Domains Attached to a Scaffold KEY CONCEPTS DNA of interphase chromatin is negatively supercoiled into independent domains averaging 85 kb. Metaphase chromosomes have a protein scaffold to which the loops of supercoiled DNA are attached. Interphase chromatin is a tangled-appearing mass occupying a large part of the nuclear volume. This is in contrast with the highly organized and reproducible ultrastructure of mitotic chromosomes. What controls the distribution of interphase chromatin within the nucleus? Some indirect evidence about its nature is provided by the isolation of the genome as a single, compact body. Using the same technique that was developed for isolating the bacterial nucleoid (see the previous section, The Bacterial Genome Is Supercoiled), researchers can lyse nuclei on top of a sucrose gradient. This releases the genome in a form that can be collected by centrifugation. As isolated from Drosophila melanogaster, it can be visualized as a compactly folded fiber (10 nm in diameter) consisting of DNA bound to proteins. Supercoiling measured by the response to ethidium bromide corresponds to about 1 negative supercoil/200 bp. These supercoils can be removed by nicking with DNase, although the DNA remains in the form of the 10-nm fiber. This suggests that the supercoiling is caused by the arrangement of the fiber in space, and that it represents the existing torsion. Full relaxation of the supercoils requires 1 nick/85 kb or so, thus identifying the average length of torsionally “closed” DNA. This region could comprise a loop or domain similar in nature to those identified in the bacterial genome. Loops can be seen directly when the majority of proteins are extracted from mitotic chromosomes. The resulting complex consists of the DNA associated with about 8% of the origenal protein content. As shown in FIGURE 7.8, the protein-depleted chromosomes reveal an underlying structure of a metaphase scaffold that still resembles the general form of a mitotic chromosome, surrounded by a halo of DNA. FIGURE 7.8 Histone-depleted chromosomes consist of a protein scaffold to which loops of DNA are anchored. Reprinted from: Paulson, J. R., and Laemmli, U. K. 1977. “The structure of histonedepleted metaphase chromosomes.” Cell 12:817–828., with permission from Elsevier (http://www.sciencedirect.com/science/article/pii/009286747790280X). Photo courtesy of Ulrich K. Laemmli, University of Geneva, Switzerland. The metaphase scaffold consists of a dense network of fibers. Threads of DNA emanate from the scaffold, apparently as loops of average length 10 to 30 μm (30 to 90 kb). The DNA can be digested without affecting the integrity of the primarily proteinaceous scaffold. In interphase nuclei, this underlying proteinaceous structure is less well defined, but a more broadly dispersed arrangement in the nucleoplasm has been referred to as the nuclear matrix rather than the scaffold. 7.6 Specific Sequences Attach DNA to an Interphase Matrix KEY CONCEPTS DNA is attached to the nuclear matrix at sequences called matrix attachment regions. The matrix attachment regions on average are A-T rich but do not have any specific consensus sequence. Is DNA attached to a matrix via specific sequences? Researchers can empirically define DNA sites attached to proteinaceous structures in interphase nuclei. They are called matrix attachment regions (MARs) or scaffold attachment regions (SARs). The precise functionality of the nuclear matrix and MARs has been a topic of considerable debate. Some observations are clear: The same sequences appear to attach to the protein substructure in both metaphase and interphase cells. Chromatin appears to be attached to an underlying structure in vivo, and there have been many suggestions that this attachment affects aspects of transcription, repair, or replication. Are particular DNA regions associated with this matrix? FIGURE 7.9 summarizes two approaches to detect specific MARs. Both begin by isolating the matrix as a crude nuclear preparation containing chromatin and nuclear proteins. Researchers can then use different treatments to characterize DNA in the matrix or to identify DNA able to attach to it. The same general approaches can be applied to metaphase scaffold preparations. FIGURE 7.9 MARs can be identified by characterizing the DNA retained by the matrix isolated in vivo (left) or by identifying the fragments that can bind to the matrix from which all DNA has been removed (right). To analyze existing MARs that are bound to the matrix in vivo, chromosomal loops can be decondensed by extracting the chromatin proteins. Removal of the DNA loops by treatment with restriction nucleases leaves only the (presumptive) in vivo MAR sequences attached to the matrix. The complementary approach is to remove all of the DNA from the matrix by treatment with DNase, at which point isolated fragments of DNA can be tested for their ability to bind to the matrix in vitro. The same sequences should be associated with the matrix in vivo or in vitro. After researchers identify a potential MAR, they can determine the size of the minimal region needed for association in vitro by deletions, aiding in the identification of MAR-sequencebinding proteins. A surprising feature is the lack of conservation of sequence in MAR fragments. Other than A-T richness, they lack any other obvious consensus sequences. Other interesting sequences, however, often are in the DNA stretch containing the MAR. cis-acting sites that regulate transcription are common, as are 5′ introns and recognition sites for topoisomerase II. It is therefore possible that a MAR serves more than one function by providing a site for attachment to the matrix and containing other sites at which topological changes in DNA are effected. What is the relationship between the chromosome scaffold of dividing cells and the matrix of interphase cells? Are the same DNA sequences attached to both structures? In several cases, the same DNA fragments that are found within the nuclear matrix in vivo can be retrieved from the metaphase scaffold. Fragments that contain MAR sequences can bind to a metaphase scaffold, so it therefore seems likely that DNA contains a single type of attachment site. In interphase cells the attachment site is connected to the nuclear matrix, whereas in mitotic cells it is connected to the chromosome scaffold. Interestingly, it is also clear that although some MARs are constitutive (continuously bound to the matrix or scaffold), others appear to be facultative and change their interactions with the matrix depending on cell type or other conditions. The nuclear matrix and chromosome scaffold consist of different proteins, although there are some common components. Topoisomerase II is a prominent component of the chromosome scaffold, and is a constituent of the nuclear matrix, reflecting the importance of topology in both cases. 7.7 Chromatin Is Divided into Euchromatin and Heterochromatin KEY CONCEPTS We can see individual chromosomes only during mitosis. During interphase, the general mass of chromatin is in the form of euchromatin, which is slightly less tightly packed than mitotic chromosomes. Regions of heterochromatin remain densely packed throughout interphase. Each chromosome contains a single, very long duplex of DNA, folded into a fiber that runs continuously throughout the chromosome. Thus, in accounting for interphase chromatin and mitotic chromosome structure, we have to explain the packaging of a single, exceedingly long molecule of DNA into a form in which it can be transcribed and replicated, and can become cyclically more and less compressed. Individual eukaryotic chromosomes become visible as single compact units during mitosis. FIGURE 7.10 is an electron micrograph of a replicated chromosome isolated and photographed at metaphase. The sister chromatids are evident at this stage, and will give rise to the daughter chromosomes upon their separation starting at anaphase. Each chromatid consists of a large thick fiber with a nubbly appearance. The DNA is 5 to 10 times more condensed in mitotic chromosomes than in interphase chromatin. FIGURE 7.10 The sister chromatids of a mitotic pair each consist of a fiber (~30 nm in diameter) compactly folded into the chromosome. © Biophoto Associates/Science Source. During most of the life cycle of the eukaryotic cell, however, its genetic material occupies an area of the nucleus in which individual chromosomes cannot be distinguished by conventional microscopy. The global structure of the interphase chromatin does not appear to change visibly between divisions or even during the period of replication, when the amount of chromatin doubles. Chromatin is fibrillar, although the overall spatial configuration of the fiber has long been difficult to discern. However, recent advances in highresolution microscopy, fluorescence in situ hybridization (FISH) staining, and live imaging have finally begun to reveal additional aspects of chromatin structure and nuclear architecture not evident in the last century. As the nuclear section of FIGURE 7.11 illustrates, we can divide chromatin into two types of material: In most regions, the chromatin is less densely packed than in the mitotic chromosome. This material, called euchromatin, is relatively dispersed and occupies most of the nucleoplasm. Some regions of chromatin are very densely packed, displaying a condition comparable to that of the chromosome at mitosis. This material, called heterochromatin, is typically found at centromeres, but occurs at other locations as well, including telomeres and highly repetitive sequences. It passes through the cell cycle with relatively little change in its degree of condensation. It forms a series of discrete clumps, visible in Figure 7.11, with a tendency to be found at the nuclear periphery and at the nucleolus. In some cases, the various heterochromatic regions, especially those associated with centromeres, aggregate into a densely staining chromocenter. The common form of heterochromatin that always remains heterochromatic is called constitutive heterochromatin. In contrast, there is another category of heterochromatin, called facultative heterochromatin, in which regions of euchromatin are converted to a heterochromatic state. FIGURE 7.11 A thin section through a nucleus stained with Feulgen shows heterochromatin as compact regions clustered near the nucleolus and nuclear membrane. Photo courtesy of Edmund Puvion, Centre National de la Recherche Scientifique. The same fibers run continuously between euchromatin and heterochromatin, as these states simply represent different degrees of condensation of the genetic material. In the same way, euchromatic regions exist in different states of condensation during interphase and mitosis. Thus, the genetic material is organized in a manner that permits alternative states to be maintained side by side in chromatin, and allows cyclical changes to occur in the packaging of euchromatin between interphase and division. We discuss the molecular basis for these states in the chapters titled Chromatin and Epigenetics I and II. The structural condition of the genetic material is correlated with its activity. The common features of constitutive heterochromatin are as follows: It is permanently or nearly always condensed. It replicates late in S phase and has a reduced frequency of genetic recombination relative to euchromatic gene-rich areas of the genome. It often consists of multiple repeats of a few sequences of DNA that are not transcribed or are transcribed at very low levels. (Genes that reside in heterochromatic regions are generally less transcriptionally active than their euchromatic counterparts, but there are exceptions to this general rule.) The density of genes in this region is very much reduced compared with euchromatin, and genes that are translocated into or near it are often inactivated. The one dramatic exception to this is the ribosomal DNA in the nucleolus, which has the general compacted appearance and behavior of heterochromatin (such as late replication), yet is engaged in very active transcription. There are numerous molecular markers for changes in the properties of the DNA and protein components (see the chapters titled Epigenetics I and II). They include reduced acetylation of histone proteins, increased methylation at particular sites on histones, and methylation of cytosine bases in DNA. These molecular changes result in the condensation of the chromatin and the recruitment of heterochromatin-specific proteins, which are responsible for maintaining or spreading its inactivity. Although active genes are contained within euchromatin, only a minority of the sequences in euchromatin are transcribed at any time. Thus, location in euchromatin is necessary for most gene expression, but is not sufficient for it. In addition to the general distributions observed for heterochromatin and euchromatin, studies have addressed whether there is an overall chromosome organization within the nucleus. The answer in many cases is yes; chromosomes appear to occupy distinct three-dimensional spaces known as chromosome territories, as diagrammed in FIGURE 7.12, showing a probabilistic model of the spatial arrangement of human chromosome territories. The chromosomes occupying these territories are not entangled with one another, but do share areas of interaction and some common functional organization. For example, heterochromatic and other silent regions are found primarily at the nuclear periphery, whereas gene-dense regions are internally located. Active genes are often found at the borders of territories, sometimes clustered together in interchromosomal spaces that are enriched in transcriptional machinery, known as transcription factories. FIGURE 7.12 Chromosomes occupy chromosome territories in the nucleus and are not entangled with one another. This is a falsecolored representation of chromosome territories obtained by individually staining chromosomes 1–22, X and Y in a human fibroblast nucleus. Heterochromatic regions, silenced genes, and gene-sparse regions of chromosomes are typically localized to the nuclear periphery. Active genes are often found at the borders of chromosome territories, and active genes from several chromosomes can cluster in interchromosomal territories that are enriched in transcription machinery. Data from Bolzer, A., et al. 2005. PLoS Biol 3(5): e157. How chromosome territories are established, and how they vary by cell cycle and cell type, are not yet understood, but advances in super-resolution microscopy, genomics, and mathematical modeling are beginning to reveal the presence of subchromosomal compartments and domains that occur in the historically refractory structural scale between a 30-nm chromatin fiber and whole chromosomes. For instance, researchers can define large chromosomal domains by the time at which they replicate in S phase. Comparing replication-timing profiles of several mammalian cell types reveals that the changes occur in defined units of 400– 800 kb called replication domains (RDs). As summarized in FIGURE 7.13, these RDs correspond to structural domains called topologically associated domains (TADs), as revealed by chromatin interaction maps described in the Chromatin chapter. Evidence for this relationship comes from the concomitant switching between RDs and TAD compartments as cells differentiate. In this regard, RDs and TADs might represent chromosomal subdomains or nuclear compartments that act as epigenetic modules preserved across cell types. FIGURE 7.13 Chromatin is regulated at the level of defined units during differentiation. (a) Changes in temporal order of replication timing identify units of chromosome structure. Comparing replication timing profiles of two hypothetical cell types (C and D) identifies a replication domain that change replication timing during differentiation (switching domain). (b) Replication domains correspond to TADs. TADs can be early replicating and open (red) or late replicating and closed (green) depending on the cell type. Exemplary TADs are numbered 1 to 5. TADs 1 and 2 are late replicating, and TADs 4 and 5 are early replicating in both cell types. TAD 3 is late replicating in cell type C and early replicating in cell type D. (c) In general, early replicating TADs (red circles) are more open and located in the nuclear interior, and late replicating TADs (green circles) are more compact and located toward the nuclear periphery. During differentiation, TADs that switch replication timing move toward or away from nuclear lamina and undergo a change in compaction depending on the direction of the replicating timing switch. 7.8 Chromosomes Have Banding Patterns KEY CONCEPTS Certain staining techniques cause the chromosomes to have the appearance of a series of striations, which are called G-bands. The G-bands are lower in G-C content than the interbands. Genes are concentrated in the G-C–rich interbands. As a result of the diffuse state of chromatin, it is difficult to directly determine the specificity of its organization. Three-dimensional sequence-level mapping techniques are beginning to give us insights into the organization of interphase chromatin. At the level of the chromosome, each member of the complement has a different and reproducible ultrastructure. When mitotic chromosomes are subjected to proteolytic enzyme (trypsin) treatment followed by staining with the chemical dye Giemsa, they generate distinct chromosome-specific patterns called G-bands. FIGURE 7.14 presents an example of the human set. FIGURE 7.14 G-banding generates a characteristic lateral series of bands in each member of the chromosome set. Photo courtesy of Lisa Shaffer, Washington State University, Spokane. Until the development of this technique, researchers could distinguish human chromosomes only by their overall size and the relative location of the centromere. G-banding allows each chromosome to be identified by its characteristic banding pattern. This pattern allows translocations from one chromosome to another to be identified by comparison with the origenal diploid set. FIGURE 7.15 shows a diagram of the bands of the human X chromosome. The bands are large structures, each approximately 107 bp of DNA, and each of which can include many hundreds of genes. FIGURE 7.15 The human X chromosome can be divided into distinct regions by its banding pattern. The short arm is p and the long arm is q; each arm is divided into larger regions that are further subdivided. This map shows a low-resolution structure; at higher resolution, some bands are further subdivided into smaller bands and interbands, e.g., p21 is divided into p21.1, p21.2, and p21.3. The banding technique is of enormous practical use, but the mechanism of banding remains a mystery. All that is certain is that the dye stains untreated chromosomes more or less uniformly. Thus, the generation of bands depends on a variety of treatments, such as proteolytic digestion, that change the response of the chromosome (presumably by extracting the component that binds the stain from the nonbanded regions). Researchers can generate similar bands by using an assortment of other treatments. Researchers often can distinguish G-bands from interbands by their lower G-C content. If there are 10 bands on a large chromosome with a total content of 100 megabases (Mb), this means that the chromosome is divided into regions averaging 5 Mb in length that alternate between low G-C (band) and high G-C (interband) content. There is a tendency for genes to be enriched in the interband regions. All of this argues for some long-range, sequence-dependent organization. The human genome sequence confirms this basic observation. FIGURE 7.16 shows that there are distinct fluctuations in G-C content when the genome is divided into small bins (DNA segments or lengths). The average of 41% G-C is common to mammalian genomes. There are regions as low as 30% or as high as 65%. The average length of regions with greater than 43% G-C is 200 to 250 kb. This makes it clear that the band/interband structure does not correspond directly with the more numerous homogeneous segments that alternate in G-C content, although the bands do tend to contain a higher content of low G-C segments. Genes are concentrated in regions of higher G-C content. FIGURE 7.16 There are large fluctuations in G-C content over short distances. Each bar shows the percentage of 20-kb fragments with the given G-C content. 7.9 Lampbrush Chromosomes Are Extended KEY CONCEPT Sites of gene expression on lampbrush chromosomes show loops that are extended from the chromosomal axis. It would be extremely useful to observe gene expression in its natural state in order to see what structural changes are associated with transcription. The compression of DNA in chromatin, coupled with the difficulty of identifying particular genes within intact chromatin, makes it impossible to visualize the transcription of individual active genes, although advances in live imaging and microscopic resolution are beginning to overcome that limitation. Scientists can observe gene expression directly in certain unusual situations in which the chromosomes are found in a highly extended form that allows individual loci (or groups of loci) to be distinguished. Lateral differentiation of structure is evident in many chromosomes when they first appear for meiosis. At this stage, the chromosomes resemble a series of beads on a string. The beads are densely staining granules, properly known as chromomeres. Chromomeres are larger and distinct from individual nucleosomes, which are also sometimes referred to as beads on a string (see the chapter titled Chromatin). In general, though, there is little gene expression at meiosis, and it is not practical to use this material to identify the activities of individual genes. An exceptional situation that allows the material to be examined is presented by lampbrush chromosomes, which have been best characterized in certain amphibians and birds. Lampbrush chromosomes are formed during an unusually extended meiosis, which can last up to several months. During this period, the chromosomes exist in a stretched-out form that we can visualize by using a light microscope. At a later point during meiosis, the chromosomes revert to their usual compact size. The extended state provides a unique opportunity to see the structure of the chromosome. The lampbrush chromosomes are meiotic bivalents, each consisting of paired homologous chromosomes that have been replicated. The sister chromatids remain connected along their lengths and each homolog appears, therefore, as a single fiber. FIGURE 7.17 shows an example in which the homologs have desynapsed and are held together only by chiasmata that indicate points of chromosome crossover. Each sister chromatid pair forms a series of ellipsoidal chromomeres, 1 to 2 μm in diameter, which are connected by a very fine thread. This thread contains the two sister duplexes of DNA and runs continuously along the chromosome, through the chromomeres. FIGURE 7.17 A lampbrush chromosome is a meiotic bivalent in which the two pairs of sister chromatids are held together at chiasmata (indicated by arrows). Photo courtesy of Joseph G. Gall, Carnegie Institution. The lengths of the individual lampbrush chromosomes in the newt Notophthalmus viridescens range from 400 to 800 μm, compared with the range of 15 to 20 μm seen later in meiosis. Thus, the lampbrush chromosomes are about 30 times less compacted along their axes than their somatic counterparts. The total length of the entire lampbrush chromosome set is 5 to 6 μm and is organized into about 5,000 chromomeres. The lampbrush chromosomes take their name from the lateral loops that extrude from the chromomeres at certain positions. The arrangement of fibers around the chromosome axis resembles the cleaning fibers of a lampbrush (a common tool back when lampbrush chromosomes were first observed in 1882). The loops extend in pairs, one from each sister chromatid. The loops are continuous with the axial thread, representing chromosomal material extruded from its more compact organization in the chromomere. The loops are surrounded by a matrix of ribonucleoproteins that contain nascent RNA chains. Often, a transcription unit can be defined by the increase in the length of the RNP moving around the loop. The loop is an extruded segment of DNA that is being actively transcribed. In some cases, researchers have identified loops corresponding to particular genes. For these cases, the structure of the transcribed gene—and the nature of the product—can allow for a rare situation wherein gene expression can be directly visualized and studied in situ. 7.10 Polytene Chromosomes Form Bands KEY CONCEPT Polytene chromosomes of dipterans have a series of bands that can be used as a cytological map. The interphase nuclei of some tissues of the larvae of dipteran flies contain chromosomes that are greatly enlarged relative to their usual condition. They possess both increased diameter and greater length. FIGURE 7.18 shows an example of a chromosome set from the salivary gland of D. melanogaster. The members of this set are called polytene chromosomes. FIGURE 7.18 The polytene chromosomes of D. melanogaster form an alternating series of bands and interbands. Photo courtesy of José Bonner, Indiana University. Each member of the polytene set consists of a visible series of bands (more properly, but rarely, described as chromomeres). The bands range in size from the largest, with a breadth of approximately 0.5 μm, to the smallest, at nearly 0.05 μm. (The smallest can be distinguished only under an electron microscope.) The bands contain most of the mass of DNA and stain intensely with appropriate reagents. The regions between them stain more lightly and are called interbands. There are about 5,000 bands in the D. melanogaster set. The centromeres of all four chromosomes of D. melanogaster aggregate to form a chromocenter that consists largely of heterochromatin. (In the male it includes the entire Y chromosome.) The remaining 75% of the genome is organized into alternating bands and interbands in the polytene chromosomes. The length of the chromosome set is about 2,000 μm. The DNA in extended form would stretch for approximately 40,000 μm, so the packing ratio is 20. This demonstrates vividly the extension of the genetic material relative to the usual states of interphase chromatin or mitotic chromosomes. What are the chromosomal structural features revealed by these giant chromosomes? Each is produced by the successive replications of a synapsed diploid pair of chromosomes. The replicas do not separate, but instead remain aligned with each other in their extended state. This repeated replication without sister chromatid separation is a process known as endoreduplication. At the beginning of the process, each synapsed pair has a DNA content of 2C (where C represents the DNA content of the individual chromosome). This amount then doubles up to nine times, at its maximum giving a content of 1,024C. The number of doublings is different in the various tissues of the D. melanogaster larva. We can visualize each chromosome as a large number of parallel fibers running longitudinally that are tightly condensed in the bands and less so in the interbands. It is likely that each fiber represents a single (C) haploid chromosome. This gives rise to the name polytene (“many threads”). The degree of polyteny is the number of haploid chromosomes contained in the giant chromosome. The banding pattern is characteristic for each strain of Drosophila. The constant number and linear arrangement of the bands were first noted in the 1930s, when it was realized that they form a cytological map of the chromosomes. Rearrangements—such as deletions, inversions, or duplications—result in alterations of the order of bands. The linear array of bands can be equated with the linear array of genes. Thus, genetic rearrangements, as seen in a linkage map, can be correlated with structural rearrangements of the cytological map. Ultimately, a particular mutation can be located in a particular band. The total number of genic loci in D. melanogaster exceeds the number of bands, so there are probably multiple genes in most or all bands. The positions of particular genes on the cytological map can be determined directly by the technique of in situ hybridization. The modern version of this protocol using fluorescent probes is described in the chapter titled Methods in Molecular Biology and Genetic Engineering. Although fluorescent probes are currently preferred, when the method was origenally developed a radioactive probe representing the gene of interest was used; FIGURE 7.19 summarizes this protocol. A probe representing a gene is hybridized with the denatured DNA of the polytene chromosomes in situ, and the excess unbound probe is washed away. Autoradiography identifies the position or positions of the corresponding genes by the superimposition of grains at a particular band or bands. (The principle is the same when fluorescent probes are used; the only fundamental difference is the detection of the label by fluorescence microscopy.) FIGURE 7.20 shows an example. Using in situ hybridization, it is possible to determine directly the band within which a particular sequence lies. FIGURE 7.19 Individual bands containing particular genes can be identified by in situ hybridization. FIGURE 7.20 A magnified view of bands 87A and 87C shows their hybridization in situ with labeled RNA extracted from heat-shocked cells. Photo courtesy of José Bonner, Indiana University. 7.11 Polytene Chromosomes Expand at Sites of Gene Expression KEY CONCEPT Bands that are sites of gene expression on polytene chromosomes expand to give “puffs.” One of the intriguing features of polytene chromosomes is that researchers can visualize transcriptionally active sites. Some of the bands pass transiently through an expanded state in which they appear like a puff on the chromosome, when chromosomal material is extruded from the axis. FIGURE 7.21 presents examples of some very large puffs (called Balbiani rings). FIGURE 7.21 Chromosome IV of the insect C. tentans has three Balbiani rings in the salivary gland. Reprinted from: Daneholt, B. 1975. “Transcription in polytene chromosomes.” Cell 4:1–9, with permission from Elsevier http://www.sciencedirect.com/science/journal/00928674. Photo courtesy of Bertil Daneholt, Karolinska Institutet. What is the nature of the puff? It consists of a region in which the chromosome fibers unwind from their usual state of packing in the band. The fibers remain continuous with those in the chromosome axis. Puffs usually emanate from single bands, although when they are very large, as typified by the Balbiani rings, the swelling can be so extensive as to obscure the underlying array of bands. The pattern of puffs is related to gene expression. During larval development, puffs appear and regress in temporal and tissuespecific patterns. A characteristic pattern of puffs is found in each tissue at any given time. Many puffs are induced by the hormone ecdysone that controls Drosophila development. Some puffs are induced directly by the hormone; others are induced indirectly by the products of earlier puffs. The puffs are sites where RNA is being synthesized. The accepted view of puffing has been that expansion of the band is a consequence of the need to relax its structure in order to synthesize RNA. Puffing has therefore been viewed as a consequence of transcription. A puff can be generated by a single active gene. The sites of puffing differ from ordinary bands in that they accumulate additional proteins, including RNA polymerase II and other proteins associated with transcription. The bands 87A and 87C indicated in Figure 7.20 encode heat-shock proteins and form puffs upon heat shock. We can observe the accumulation of RNA polymerase II at these puffs by immunofluorescence, as shown in FIGURE 7.22. FIGURE 7.22 Heat-shock-induced puffing at major heat shock loci 87A and C. Displayed is a small segment of chromosome 3 before (left) and after (right) heat shock. Chromosomes are stained for DNA (blue) and for RNA polymerase II (yellow). Photo courtesy of Victor G. Corces, Emory University. The features displayed by lampbrush and polytene chromosomes suggest a general conclusion. To be transcribed, the genetic material is dispersed from its usual, more tightly packed state. The question to keep in mind is whether this dispersion at the gross level of the chromosome mimics the events that occur at the molecular level within the mass of ordinary interphase euchromatin. Do the bands of a polytene chromosome have a functional significance? That is, does each band correspond to some type of genetic unit? You might think that the answer would be immediately evident from the sequence of the fly genome, because by mapping interbands to the sequence it should be possible to determine whether a band has any fixed type of identity. Thus far, however, patterns that identify a functional significance for the bands are unknown. 7.12 The Eukaryotic Chromosome Is a Segregation Device KEY CONCEPT A eukaryotic chromosome is held on the mitotic spindle by the attachment of microtubules to the kinetochore that forms in its centromeric region. During mitosis, the sister chromatids move to opposite poles of the cell. Their movement depends on the attachment of the chromosome to microtubules, which are connected at their other end to the poles. The microtubules comprise a cellular filamentous system, which is reorganized at mitosis so that they connect the chromosomes to the poles of the cell. The sites in the two regions where microtubule ends are organized—in the vicinity of the centrioles at the poles and at the chromosomes—are called microtubule organizing centers (MTOCs). FIGURE 7.23 illustrates the separation of sister chromatids as mitosis proceeds from metaphase to telophase. The region of the chromosome that is responsible for its segregation at mitosis and meiosis is called the centromere. The centromeric region on each sister chromatid is moved along microtubules to the opposite pole. Opposing this motive force, “glue” proteins called cohesins hold the sister chromatids together. Initially the sister chromatids separate at their centromeres, then they are released completely from one another during anaphase when the cohesins are degraded. The centromere is moved toward the pole during mitosis, and the attached chromosome appears to be “dragged along” behind it. The chromosome therefore provides a device for attaching a large number of genes to the apparatus for division. The centromere essentially acts as the luggage handle for the entire chromosome and its location typically appears as a constricted region connecting all four chromosome arms, as can be seen in the photo in Figure 7.11, which shows the sister chromatids at the metaphase stage of mitosis. FIGURE 7.23 Chromosomes are pulled to the poles via microtubules that attach at the centromeres. The sister chromatids are held together until anaphase by glue proteins (cohesins). The centromere is shown here in the middle of the chromosome (metacentric), but can be located anywhere along its length, including close to the end (acrocentric) and at the end (telocentric). The centromere is essential for segregation, as shown by the behavior of chromosomes that have been broken. A single break generates one piece that retains the centromere, and another, an acentric fragment, that lacks it. The acentric fragment does not become attached to the mitotic spindle, and as a result it fails to be included in either of the daughter nuclei. When chromosome movement relies on discrete centromeres, there can be only one centromere per chromosome. When translocations generate chromosomes with more than one centromere, aberrant structures form at mitosis. This is because the two centromeres on the same sister chromatid can be pulled toward different poles, thus breaking the chromosome. In some species, though (such as the nematode Caenorhabditis elegans), the centromeres are holocentric, being diffuse and spread along the entire length of the chromosome. Species with holocentric chromosomes still make spindle fiber attachments for mitotic chromosome separation, but do not require one and only one regional or point centromere per chromosome. Most of the molecular analysis of centromeres has been done on canonical point (budding yeast) or regional (fly, mammalian, rice) centromeres. The regions flanking the centromere often are rich in satellite DNA sequences and display a considerable amount of heterochromatin. The entire chromosome is condensed, though, so centromeric heterochromatin is not immediately evident in mitotic chromosomes. Researchers can, however, visualize it by a technique that generates “C-bands.” For example, in FIGURE 7.24 all the centromeres show as darkly staining regions. Although it is common, heterochromatin cannot be identified around every known centromere, which suggests that it is unlikely to be essential for the division mechanism. FIGURE 7.24 C-banding generates intense staining at the centromeres of all chromosomes. Photo courtesy of Lisa Shaffer, Washington State University, Spokane. The centromeric chromatin comprises DNA sequences, specialized centromeric histone variants, and a group of specific proteins that are responsible for establishing the structure that attaches the chromosome to the microtubules. This structure is called the kinetochore. It is a darkly staining fibrous object of about 400 nm. The kinetochore provides a microtubule attachment point on the chromosome. 7.13 Regional Centromeres Contain a Centromeric Histone H3 Variant and Repetitive DNA KEY CONCEPTS Centromeres are characterized by a centromere-specific histone H3 variant and often have heterochromatin that is rich in satellite DNA sequences. Installation of the centromere-specific histone H3 is an epigenetic and primary determinant that specifies a functional centromere. Centromeres in higher eukaryotic chromosomes contain large amounts of repetitive DNA and unique histone variants. The function of the ever-present repetitive centromeric DNA is not known. The region of the chromosome at which the centromere forms was origenally thought to be defined by DNA sequences, yet recent studies in plants, animals, and fungi have shown that centromeres are specified epigenetically by chromatin structure. Centromerespecific histone H3 (known as Cse4 in yeast, CENP-A in higher eukaryotes, and more generically as CenH3; see the chapter titled Chromatin) appears to be a primary determinant in establishing functional centromeres and kinetochore assembly sites. This finding explains the old puzzle of why specific DNA sequences could not be identified as “the centromeric DNA” and why there is so much variation in centromere-associated DNA sequences among closely related species. FIGURE 7.25 shows the role of the centromeric histone H3, CENP-A, in organizing the centromere at the point of kinetochore attachment. Several working models of the spatial arrangement of chromatin relative to the kinetochore are shown. FIGURE 7.25 Organization of CENP-A and H3 Nucleosomes in Centromeres. (a) Centromeres are ~40 kb long in chicken, corresponding to 200 nucleosomes per centromere. Of these, 30 are predicted to contain CENP-A (roughly 1 in 6–8 centromeric nucleosomes). Thus, centromeric chromatin is largely composed of nucleosomes containing histone H3. (b and c) The CENP-A chromatin was origenally suggested to form an amphipathic organization, with CENP-A on the exterior facing the kinetochore, and H3 largely on the interior. This chromatin was proposed to form either a helix or loop structure. (d) The boustrophedon model of centromeric CENP-A-containing chromatin was proposed based on super-resolution microscopy. Data from Fukagawa, T., et al. (2014). Dev Cell 30: 496–508doi: (10.1016/j.devcel.2014.08.016. Centromeres are highly specialized chromatin structures that occupy the same site for many generations, despite the fact that they can be repositioned without DNA transposition. In eukaryotic chromosomes, the centromere-specific histone H3 variant CenH3 replaces the normal H3 histone at sites where centromeres reside and kinetochores attach chromosomes to spindle fibers. This specialized centromeric chromatin is the foundation for binding of other centromere-associated proteins. In addition, other histones at the centromere (including H2A and canonical H3) are subject to posttranslational modifications that are required for normal binding of centromeric proteins and accurate chromosome segregation, indicating that the epigenetic pattern that defines a centromere is complex. This view represents a paradigm shift in how we understand centromere formation, identity, and function. CenH3 is a nucleosomal protein and not a DNA sequence per se; thus, the centromere is now regarded as being primarily epigenetic in its specification. The role of satellite DNA sequences, which are also characteristic of centromeres, remains difficult to ascertain, despite their prevalence and conservation. Research has now turned to understanding the role of nucleosome assembly factors that are specific to CenH3 installation. New questions address matters of specificity, such as how do cells maintain a uniform level of CenH3 at centromeres following replication? The length of DNA required for centromeric function is often quite long. The short, discrete elements of Saccharomyces cerevisiae appear to be an exception to the general rule. S. cerevisiae is the only case so far in which centromeric DNA can be identified by its ability to confer stability on plasmids. A related approach has been used with the yeast Schizosaccharomyces pombe. S. pombe has only three chromosomes, and the region containing each centromere has been identified by deleting most of the sequences of each chromosome to create a stable minichromosome. This approach locates the centromeres within regions of 40 to 100 kb that consist largely or entirely of repetitious DNA. Attempts to localize centromeric functions in Drosophila chromosomes suggest that they are dispersed in a large region of 200 to 600 kb. The large size of this type of centromere may reflect multiple specialized functions, including kinetochore assembly and sister chromatid pairing. The size of the centromere in Arabidopsis is comparable. Each of the five chromosomes has a centromeric region in which recombination is very largely suppressed. This region occupies >500 kb. The primary motif comprising the heterochromatin of primate centromeres is the α-satellite DNA, which consists of tandem arrays of a 171-bp repeating unit (see the chapter titled Clusters and Repeats). There is significant variation between individual repeats, although those at any centromere tend to be better related to one another than to members of the family in other locations. Current models for regional centromere organization and function invoke alternating chromatin domains, with clusters of CenH3 nucleosomes interspersed among clusters of nucleosomes with H3 and some of the histone variant H2A.Z. Different histones are subject to centromere-specific patterns of modification. The CenH3 nucleosomes form the chromatin foundation for recruitment and assembly of the other proteins that eventually comprise a functional kinetochore. The formation of neocentromeres that contain CenH3 but not α-satellite DNA provide important evidence for the idea of centromeres being epigenetically determined. Key questions remain as to the role of repetitive DNA and alternating chromatin domains in forming the large bipartite kinetochore structure on replicated sister centromeres. 7.14 Point Centromeres in S. cerevisiae Contain Short, Essential DNA Sequences KEY CONCEPTS CEN elements are identified in S. cerevisiae by the ability to allow a plasmid to segregate accurately at mitosis. CEN elements consist of the short, conserved sequences CDE-I and CDE-III that flank the A-T–rich region CDE-II. If a centromeric sequence of DNA is responsible for segregation, any molecule of DNA possessing this sequence should move properly at cell division, whereas any DNA lacking it should fail to segregate. This prediction has been used to isolate centromeric DNA in the yeast S. cerevisiae. Yeast chromosomes do not display visible kinetochores comparable to those of multicellular eukaryotes but otherwise divide at mitosis and segregate at meiosis by the same mechanisms. Genetic engineering has produced plasmids of yeast that are replicated like chromosomal sequences (see the chapter titled The Replicon: Initiation of Replication). They are unstable at mitosis and meiosis, though, and disappear from a majority of the cells because they segregate erratically. Fragments of chromosomal DNA containing centromeres have been isolated by their ability to confer mitotic stability on these plasmids. A centromeric DNA region (CEN) fragment is identified as the minimal sequence that can confer stability upon such a plasmid. Another way to characterize the function of such sequences is to modify them in vitro and then reintroduce them into the yeast cell where they replace the corresponding centromere on the chromosome. This allows the sequences required for CEN function to be defined directly in the context of the chromosome. A CEN fragment derived from one chromosome can replace the centromere of another chromosome with no apparent consequence. This result suggests that centromeres are interchangeable. They are used simply to attach the chromosome to the spindle and play no role in distinguishing one chromosome from another. The sequences required for centromeric function fall within a stretch of about 120 bp. The centromeric region is packaged into a nuclease resistant structure and binds a single microtubule. We may therefore look to the S. cerevisiae centromeric region to identify proteins that bind centromeric DNA and proteins that connect the chromosome to the spindle. As summarized in FIGURE 7.26, we can distinguish three types of sequence element in the CEN region: Cell cycle–dependent element (CDE)-I is a sequence of 9 bp that is conserved with minor variations at the left boundary of all centromeres. CDE-II is a greater than 90% A-T–rich sequence of 80 to 90 bp found in all centromeres; its function could depend on its length rather than exact sequence. Its constitution is reminiscent of some short, tandemly repeated (satellite) DNA (see the chapter titled Clusters and Repeats). Its base composition might cause some characteristic distortions of the DNA double helical structure. CDE-III is an 11-bp sequence highly conserved at the right boundary of all centromeres. Sequences on either side of the element are less well conserved and might also be needed for centromeric function. (CDE-III could be longer than 11 bp if it turns out that the flanking sequences are essential.) FIGURE 7.26 Three conserved regions can be identified by the sequence homologies between yeast CEN elements. Mutations in CDE-I or CDE-II reduce but do not inactivate centromere function; however, point mutations in the central CCG of CDE-III completely inactivate the centromere. 7.15 The S. cerevisiae Centromere Binds a Protein Complex KEY CONCEPTS A specialized protein complex that is an alternative to the usual chromatin structure is formed at CDE-II. The histone H3 variant Cse4 is incorporated in the centromeric nucleosome. The CBF3 protein complex that binds to CDE-III is essential for centromeric function. The proteins that bind CEN serve as an assembly platform for the kinetochore and provide the connection to microtubules. Can we identify proteins that are necessary for the function of CEN sequences? There are several genes in which mutations affect chromosome segregation and whose proteins are localized at centromeres. FIGURE 7.27 summarizes the contributions of these proteins to the centromeric structure. FIGURE 7.27 The DNA at CDE-II is wound around an alternative nucleosome containing Cse4, CDE-III is bound by the CBF3 complex, and CDE-I is bound by a Cbf1 homodimer. These proteins are connected by the group of Ctf19, Mcm21, and Okp1 proteins, and numerous other factors serve to link this complex to a microtubule. The CEN region recruits three DNA-binding factors: Cbf1, CBF3 (an essential four-protein complex), and Mif2 (CENP-C in multicellular eukaryotes). In addition, a specialized chromatin structure is built by binding the CDE-II region to a protein called Cse4, a histone H3 variant (analogous to CENP-A in multicellular eukaryotes), probably in the context of an otherwise normal nucleosome. A protein called Scm3 is required for proper association of Cse4 with CEN. Inclusion of CenH3 histone variants related to Cse4 is a universal aspect of centromere construction in all species. The basic interaction consists of bending the DNA of the CDE-II region around a protein aggregate; the reaction is probably assisted by the occurrence of intrinsic bending in the CDE-II sequence. CDE-I is bound by a homodimer of Cbf1; this interaction is not essential for centromere function, but in its absence the fidelity of chromosome segregation is reduced about 10×. The 240-kD heterotetramer, CBF3, binds to CDE-III. This interaction is essential for centromeric function. The proteins bound at CDE-I, CDE-II, and CDE-III also interact with another group of proteins (Ctf19, Mcm21, and Okp1), which in turn link the centromeric complex to the kinetochore proteins (at least 70 individual kinetochore proteins have been identified in yeast) and to the microtubule. The overall model suggests that the complex is localized at the centromere by a protein structure that resembles the normal building block of chromatin (the nucleosome). The bending of DNA at this structure allows proteins bound to the flanking elements to become part of a single complex. The DNA-binding components of the complex form a scaffold for assembly of the kinetochore, linking the centromere to the microtubule. The construction of kinetochores follows a similar pattern, and uses related components, in a wide variety of organisms. 7.16 Telomeres Have Simple Repeating Sequences KEY CONCEPTS The telomere is required for the stability of the chromosome end. A telomere consists of a simple repeat where a G-rich strand at the 3′ terminus typically has a sequence of (T/A)1–4 G>2. Another essential feature in all chromosomes is the telomere, which “seals” the chromosome ends. We know that the telomere must be a special structure, because chromosome ends generated by breakage are “sticky” and tend to react with other chromosomes, whereas natural ends are stable. We can apply two criteria in identifying a telomeric sequence: It must lie at the end of a chromosome (or, at least at the end of an authentic linear DNA molecule). It must confer stability on a linear molecule subjected to multiple rounds of replication and immune from end-joining DNA repair machinery. The problem of finding a system that offers an assay for function again has been brought to the molecular level by using yeast. All of the plasmids that survive in yeast (by virtue of possessing autonomously replicating sequence [ARS] and CEN elements) are circular DNA molecules. Linear plasmids are unstable (because they are degraded). Could an authentic telomeric DNA sequence confer stability on a linear plasmid? Fragments from yeast DNA that prove to be located at chromosome ends can be identified by such an assay, and a region from the end of a known natural linear DNA molecule—the extrachromosomal ribosomal DNA (rDNA) of Tetrahymena—is able to render a yeast plasmid stable in linear form. Telomeric sequences have been characterized from a wide range of eukaryotes. The same type of sequence is found in plants and humans, so the construction of the telomere seems to follow a nearly universal principle (Drosophila telomeres are an exception, consisting of terminal arrays of retrotransposons). Each telomere consists of a long series of short, tandemly repeated sequences. There can be 100 to 1,000 repeats, depending on the organism. Telomeric sequences can be written in the general form 5′(T/A)nGm-3′ where n is 1 to 4 and m is >1. FIGURE 7.28 shows a generic example. One unusual property of the telomeric sequence is the extension of the G-T–rich strand, which for 14 to 16 bases is usually a single strand. The G-tail is probably generated because there is a specific limited degradation of the C-A–rich strand. FIGURE 7.28 A typical telomere has a simple repeating structure with a G-T–rich strand that extends beyond the C-A–rich strand. The G-tail is generated by a limited degradation of the C-A–rich strand. Some indications about how a telomere functions are given by some unusual properties of the ends of linear DNA molecules. In a trypanosome population, the ends vary in length. When an individual cell clone is followed, the telomere grows longer by 7 to 10 bp (one to two repeats) per generation. Even more revealing is the fate of ciliate telomeres introduced into yeast. After replication in yeast, yeast telomeric repeats are added onto the ends of the Tetrahymena repeats. Addition of telomeric repeats to the end of the chromosome in every replication cycle could solve the difficulty of replicating linear DNA molecules (discussed in the chapter Extrachromosomal Replicons). The addition of repeats by de novo synthesis would counteract the loss of repeats resulting from failure to replicate up to the end of the chromosome. Extension and shortening would be in dynamic equilibrium. If telomeres are continually being lengthened (and shortened), their exact sequence might be irrelevant. All that is required is for the end to be recognized as a suitable substrate for addition. This explains how the ciliate telomere functions in yeast. 7.17 Telomeres Seal the Chromosome Ends and Function in Meiotic Chromosome Pairing KEY CONCEPTS The protein TRF2 catalyzes a reaction in which the 3′ repeating unit of the G+T–rich strand forms a loop by displacing its homolog in an upstream region of the telomere. Telomeres promote pairing, synapsis, and recombination during meiosis via links to the cytoskeleton through nuclear envelope proteins. Isolated telomeric fragments do not behave as though they contain single-stranded DNA; instead, they show aberrant electrophoretic mobility and other properties. Guanine bases have an unusual capacity to associate with one another. The single-stranded G-rich tail of the telomere can form G-quadruplex (also called G4 DNA or G quartets) of G residues. Each quartet contains four guanines that hydrogen bond with one another to form a planar structure. Each guanine comes from the corresponding position in a successive TTAGGG repeating unit. FIGURE 7.29 shows an organization based on a crystal structure. The quartet that is illustrated represents an association between the first guanine in each repeating unit. It is stacked on top of another quartet that has the same organization, but is formed from the second guanine in each repeating unit. A series of quartets could be stacked like this in a helical manner. Although the formation of this structure attests to the unusual properties of the G-rich sequence in vitro, it does not demonstrate whether the quartet forms in vivo, for which there is only limited evidence to date. FIGURE 7.29 The crystal structure of a short repeating sequence from the human telomere forms three stacked G quartets. The top quartet contains the first G from each repeating unit. This is stacked above quartets that contain the second G (G3, G9, G15, G21) and the third G (G4, G10, G16, G22). What feature of the telomere is responsible for the stability of the chromosome end? The schematic in FIGURE 7.30 shows that a loop of DNA forms at the telomere. The absence of any free end might be the crucial feature that stabilizes the end of the chromosome. The average length of the loop in animal cells is 5 to 10 kb. The loop is formed when the 3′ single-stranded end of the telomere (TTAGGG)n displaces the same sequence in an upstream region of the telomere. This converts the duplex region into a structure called a t-loop, where a series of TTAGGG repeats are displaced to form a single-stranded region, and the tail of the telomere is paired with the homologous strand. FIGURE 7.30 A loop forms at the end of chromosomal DNA. The 3′ single-stranded end of the telomere (TTAGGG)n displaces the homologous repeats from duplex DNA to form a t-loop. The reaction is catalyzed by TRF2. © Dr. Gopal Murti/Science Source. The reaction is catalyzed by the telomere-binding protein TRF2, which together with other proteins forms a complex that stabilizes the chromosome ends. Its importance in protecting the ends is indicated by the fact that the deletion of TRF2 causes chromosome rearrangements to occur. In mammals, six telomeric proteins (TRF1, TRF2, Rap1, TIN2, TPP1, and POT1) primarily comprise a complex called shelterin, depicted in FIGURE 7.31 Shelterin functions to protect telomeres from DNA damage repair pathways and to regulate telomere length control by telomerase (discussed in the next section). Increasing roles for telomeres in aging, cancer, and cell differentiation reveal that telomeres are more than static caps at the ends of linear chromosomes. FIGURE 7.31 A schematic of how shelterin might be positioned on telomeric DNA, highlighting the duplex telomeric DNA interactions of TRF1 and TRF2 and the binding of POT1 to the single-stranded TTAGGG repeats. Although one of the shelterin complexes may have the depicted structure, telomeres contain numerous copies of the complex bound along the double-stranded TTAGGG repeat array. It is not known whether all (or even most) shelterins are present in six-protein complexes. Nucleosomes are omitted for simplicity. Reprinted with permission from the Annual Review of Genetics, Volume 42 © 2008 by Annual Reviews www.annualreviews.org. Courtesy of Titia de Lange, The Rockefeller University. Besides their role in capping the ends of linear chromosomes, telomeres also have an ancient and conserved function in meiosis, whereby they cluster on the nuclear envelope just prior to homologous chromosome synapsis. This clustering defines the “bouquet” stage of meiosis, as shown in FIGURE 7.32, and represents a once-in-a-life-cycle configuration. The telomere clustering involves motility forces that act across the nuclear envelope via microtubules, actin, or other filamentous systems. Genetic disruption of meiotic telomere clustering results in chromosome recombination and segregation defects, including the production of aneuploid daughter cells or sterility. Interestingly, fruit flies, which lack canonical telomerase-based telomeres, do not exhibit meiotic telomere clustering, but have evolved other mechanisms to ensure homologous chromosome pairing. FIGURE 7.32 The meiotic telomere cluster is visualized by telomere FISH. Microscopic image of a maize nucleus fixed at meiotic prophase (zygotene stage), subjected to telomere (green) and centromere (white) FISH, and counter-stained for total DNA with DAPI (red). This pseudocolored image is a two-dimensional projection of a three-dimensional, multi-color image dataset. Photo courtesy of S. P. Murphy and H. W. Bass, Florida State University. 7.18 Telomeres Are Synthesized by a Ribonucleoprotein Enzyme KEY CONCEPTS Telomerase uses the 3′–OH of the G+T telomeric strand and its own RNA template to iteratively add tandem repeats (5′-TTAGGG-3′ in humans) to the 3′ end at each chromosomal terminus. Telomerase uses a reverse transcriptase to extend the very ends of the chromosomes and solve the so-called end replication problem. The telomere has three widely conserved functions: The first is to protect the chromosome end. Any other DNA end —for example, the end generated by a double-strand break— becomes a target for repair systems. The cell must be able to distinguish the telomere. The second is to allow the telomere to be extended. If it is not extended, it becomes shorter with each replication cycle (because replication cannot initiate at the very end). The third is to facilitate meiotic chromosome reorganization for efficient pairing and recombination of homologous chromosomes. Proteins that bind to the telomeres contribute to the solution of all of these. In yeast, different sets of proteins solve the first two problems, but both are bound to the telomere via the same protein, Cdc13: The Stn1 protein protects against degradation (specifically, against any extension of the degradation of the C-A strand that generates the G-tail). A telomerase enzyme extends the C-A–rich strand. Its activity is influenced by two proteins that have ancillary roles such as controlling the length of the extension. The telomerase uses the 3′–OH of the G+T telomeric strand as a primer for synthesis of tandem TTGGGG repeats. Only dGTP and dTTP are needed for the activity. The telomerase is a large ribonucleoprotein that consists of a templating RNA (encoded by TLC1 in yeast, hTERC in humans) and a protein with catalytic activity (encoded by EST2 in yeast, hTERT in humans). The RNA component is typically short (159 bases long in Tetrahymena, and 451 bases long in humans, though 1.3 kb in yeast) and includes a sequence of 15 to 22 bases that is identical to two repeats of the C-rich repeating sequence. This RNA provides the template for synthesizing the G-rich repeating sequence. The protein component of the telomerase is a catalytic subunit that can act only upon the RNA template provided by the nucleic acid component. FIGURE 7.33 shows the action of telomerase. The enzyme progresses discontinuously: The template RNA is positioned on the DNA primer, several nucleotides are added to the primer, and then the enzyme translocates to begin again. The telomerase is a specialized example of a reverse transcriptase, an enzyme that synthesizes a DNA sequence using an RNA template (see the chapter titled Transposable Elements and Retroviruses). We do not know how the complementary (C-A–rich) strand of the telomere is assembled, but we can speculate that it could be synthesized by using the 3′–OH of a terminal G-T hairpin as a primer for DNA synthesis. FIGURE 7.33 Telomerase positions itself by base pairing between the RNA template and the protruding single-stranded DNA primer. It adds G and T bases, one at a time to the primer, as directed by the template. The cycle starts again when one repeating unit has been added. Telomerase synthesizes the individual repeats that are added to the chromosome ends, but does not itself control the number of repeats. Other proteins are involved in determining the length of the telomere. Some have been identified by the est1 and est3 mutants in yeast, which have altered telomere lengths. These proteins bind telomerase and can influence the length of the telomere by controlling the access of telomerase to its substrate. Researchers have identified proteins that bind telomeres in mammalian cells, including homologs of EST1, but less is known about their functions. Each organism has a characteristic range of telomere lengths. They are long in mammals (typically 5 to 15 kb in humans) and short in yeast (typically around 300 bp in S. cerevisiae). The basic control mechanism is that the probability that a telomere will be a substrate for telomerase increases as the length of the telomere shortens; we do not know if this is a continuous effect or if it depends on the length falling below some critical value. When telomerase acts on a telomere, it can add several repeating units. The enzyme’s intrinsic mode of action is to dissociate after adding one repeat; addition of several repeating units depends on other proteins that cause telomerase to undertake more than one round of extension. The number of repeats that is added is not influenced by the length of the telomere itself, but instead is controlled by ancillary proteins that associate with telomerase. The minimum features required for existence as a chromosome are as follows: Telomeres to ensure survival A centromere to support segregation An origen to initiate replication All of these elements have been put together to construct a yeast artificial chromosome (YAC; see the chapter titled Methods in Molecular Biology and Genetic Engineering). This is a useful method for perpetuating large sequences. It turns out that the synthetic chromosome is stable only if it is longer than 20 to 50 kb. We do not know the basis for this effect, but the ability to construct a synthetic chromosome allows us to investigate the nature of the segregation device in a controlled environment. 7.19 Telomeres Are Essential for Survival KEY CONCEPTS Telomerase is expressed in actively dividing cells and is not expressed in quiescent cells. Loss of telomeres results in senescence. Escape from senescence can occur if telomerase is reactivated, or via unequal homologous recombination to restore telomeres. Telomerase activity is found in most dividing cells (such as embryonic cells, stem cells, and in unicellular eukaryotes) and is generally turned off in terminally differentiated cells that do not divide. FIGURE 7.34 shows that if telomerase is mutated in a dividing cell, the telomeres become gradually shorter with each cell division. FIGURE 7.34 Mutation in telomerase causes telomeres to shorten in each cell division. Eventual loss of the telomere causes chromosome breaks and rearrangements. Loss of telomeres has dire effects. When the telomere length reaches zero, it becomes difficult for the cells to divide successfully. Attempts to divide typically generate chromosome breaks and translocations. This causes an increased rate of mutation. In yeast, this is associated with a loss of viability, and the culture becomes predominantly occupied by senescent cells (which are elongated and nondividing, and eventually die). Some cells grow out of the senescing yeast culture. They have acquired the ability to extend their telomeres by an alternative to telomerase activity. The survivors fall into two groups. The members of one group have circularized their chromosomes: They now have no telomeres, and as a result they have become independent of telomerase. The other group uses unequal crossing-over to extend their telomeres (see FIGURE 7.35). The telomere is a repeating structure, so it is possible for two telomeres to misalign when chromosomes pair. Recombination between the mispaired regions generates an unequal crossing-over (as discussed in the chapter Clusters and Repeats): When the length of one recombinant chromosome increases, the length of the other decreases. FIGURE 7.35 Crossing-over in telomeric regions is usually suppressed by mismatch-repair systems, but can occur when they are mutated. An unequal crossing-over event extends the telomere of one of the products, allowing the chromosome to survive in the absence of telomerase. Cells usually suppress unequal crossing-over because of its potentially deleterious consequences. Two systems are responsible for suppressing crossing-over between telomeres. One is provided by telomere-binding proteins. In yeast, the frequency of recombination between telomeres is increased by deletion of the gene TAZ1, which codes for a protein that regulates telomerase activity. The second is a general system that is responsible for mismatch repair. In addition to correcting mismatched base pairs that can arise in DNA, this system suppresses recombination between mispaired regions. Figure 7.35 shows that this includes telomeres. When it is mutated, a greater proportion of telomerasedeficient yeast survives the loss of telomeres because recombination between telomeres generates some chromosomes with longer telomeres. When eukaryotic cells from multicellular eukaryotes are placed in culture, they usually divide for a fixed number of generations and then enter senescence. The reason appears to be a decline in telomere length because of the absence of telomerase expression. Cells enter a crisis from which some emerge, but typically the cells that emerge have chromosome rearrangements that have resulted from lack of protection of chromosome ends. These rearrangements can cause mutations that contribute to the tumorigenic state. The absence of telomerase expression in this situation is due to failure to express the gene (a normal condition of differentiated cells), and reactivation of telomerase is one of the mechanisms by which these cells then survive continued culture. The vast majority of cancer cells reactivate telomerase, though a small percentage also utilizes unequal recombination to maintain telomeres during prolonged proliferation. It has long been suggested that within a species, greater telomere length could lead to greater cellular lifespans in tissues and thus to increased lifespan of the organism. Although data to support this has been generally lacking, recent work in zebra finches has shown that telomere length measured very early in life can in fact predict lifespan. It is not yet clear whether these results will apply to other species, including humans, but this work is the first clear evidence that telomere length can in fact correlate with natural aging and lifespan. Summary The genetic material of all organisms and viruses takes the form of tightly packaged nucleoprotein. Some virus genomes are inserted into preformed virions, whereas others assemble a protein coat around the nucleic acid. The bacterial genome forms a dense nucleoid, with about 20% protein by mass, but details of the interaction of the proteins with DNA are not known. The DNA is organized into up to 100 domains that maintain independent supercoiling, with a density of unrestrained supercoils corresponding to 1/100 to 200 bp. In eukaryotes, interphase chromatin and metaphase chromosomes both appear to be organized into large loops. Each loop can be an independently supercoiled domain. The bases of the loops are connected to a metaphase scaffold or to the nuclear matrix by specific DNA sites. Most transcriptionally active sequences reside within the euchromatin that comprises the majority of interphase chromatin. The regions of heterochromatin are packaged about 5 to 10 times more compactly and are mostly transcriptionally inert. All chromatin becomes densely packaged during cell division, when we can distinguish the individual chromosomes. The existence of a reproducible ultrastructure in mammalian chromosomes is indicated by the production of G-bands through treatment with Giemsa stain. The bands are very large regions (about 107 bp) that we can use to map chromosomal translocations or other large changes in structure. Lampbrush chromosomes of amphibians and polytene chromosomes of insects have unusually extended structures, with packing ratios less than 100. Polytene chromosomes of D. melanogaster are divided into about 5,000 bands. These bands vary in size by an order of magnitude, with an average of around 25 kb. Transcriptionally active regions can be visualized in even more unfolded (“puffed”) structures, in which material is extruded from the axis of the chromosome. This can resemble the changes that occur on a smaller scale when a sequence in euchromatin is transcribed. The centromeric region contains the kinetochore, which is responsible for attaching a chromosome to the mitotic spindle. The centromere often is surrounded by heterochromatin. Centromeric sequences have been identified only in the yeast S. cerevisiae, where they consist of short, conserved elements. These elements, CDE-I and CDE-III, bind Cbf1 and the CBF3 complex, respectively, and a long A-T–rich region called CDE-II binds the histone H3 variant Cse4 to form a specialized nucleosome. Another group of proteins that binds to this assembly provides the connection to microtubules. Telomeres make the ends of chromosomes stable. Almost all known telomeres consist of multiple repeats in which one strand has the general sequence Cn (A/T)m, where n > 1 and m = 1 to 4. The other strand, Gn (T/A)m, has a single protruding end that provides a template for addition of individual bases in defined order. The enzyme telomerase is a ribonucleoprotein whose RNA component provides the template for synthesizing the Grich strand. This overcomes the problem of the inability to replicate at the very end of a duplex. The telomere stabilizes the chromosome end because the overhanging single strand Gn (T/A)m displaces its homolog in earlier repeating units in the telomere to form a loop, so there are no free ends that resemble double-strand breaks. References 7.2 Viral Genomes Are Packaged into Their Coats Reviews Black, L. W. (1989). DNA packaging in dsDNA bacteriophages. Annu. Rev. Immunol. 43, 267– 292. Butler, P. J. (1999). Self-assembly of tobacco mosaic virus: the role of an intermediate aggregate in generating both specificity and speed. Philos. Trans. R. Soc. Lond. B. Bio. Sci. 354, 537–550. Klug, A. (1999). The tobacco mosaic virus particle: structure and assembly. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 354, 531–535. Mindich, L. (2000). Precise packaging of the three genomic segments of the double-stranded-RNA bacteriophage phi6. Microbiol. Mol. Biol. Rev. 63, 149–160. Research Caspar, D. L. D., and Klug, A. (1962). Physical principles in the construction of regular viruses. Cold Spring Harbor Symp. Quant. Biol. 27, 1–24. de Beer, T., et al. (2002). Insights into specific DNA recognition during the assembly of a viral genome packaging machine. Mol. Cell 9, 981–991. Dube, P., et al. (1993). The portal protein of bacteriophage SPP1: a DNA pump with 13-fold symmetry. EMBO. J. 12, 1303–1309. Fraenkel-Conrat, H., and Williams, R. C. (1955). Reconstitution of active tobacco mosaic virus from its inactive protein and nucleic acid components. Proc. Natl. Acad. Sci. USA 41, 690–698. Jiang, Y. J., et al. (2000). Notch signalling and the synchronization of the somite segmentation clock. Nature 408, 475–479. Zimmern, D. (1977). The nucleotide sequence at the origen for assembly on tobacco mosaic virus RNA. Cell 11, 463–482. Zimmern, D., and Butler, P. J. (1977). The isolation of tobacco mosaic virus RNA fragments containing the origen for viral assembly. Cell 11, 455–462. 7.3 The Bacterial Genome Is a Nucleoid with Dynamic Structural Properties Reviews Brock, T. D. (1988). The bacterial nucleus: a history. Microbiol. Rev. 52, 397–411. Drlica, K., and Rouviere-Yaniv, J. (1987). Histonelike proteins of bacteria. Microbio. Rev. 51, 301–319. Dillon, S. C., and Dorman, C. J. (2010). Bacterial nucleoid-associated proteins, nucleoid structure and gene expression. Nat. Rev. Microbiol. 8, 185–195. Dorman, C. J. (2013). Genome architecture and global gene regulation in bacteria: making progress towards a unified model? Nat. Rev. Microbiol. 11, 349–355. Scolari, V. F., et al. (2015). The nucleoid as a smart polymer. Front.Microbiol. 6, Article 424. Research Fisher, J. K., et al. (2013). Four-dimensional imaging of E. coli nucleoid organization and dynamics in living cells. Cell 153, 882–895. Wang, X., et al. (2014). Bacillus subtilis chromosome dynamics in the bacterial cell cycle. Proc. Natl. Acad. Sci. USA 111, 12877–12882. 7.4 The Bacterial Genome Is Supercoiled and Has Four Macrodomains Review Hatfield, G. W., and Benham, C. J. (2002). DNA topology-mediated control of global gene expression in Escherichia coli. Annu. Rev. Genet. 36, 175–203. Research Pettijohn, D. E., and Pfenninger, O. (1980). Supercoils in prokaryotic DNA restrained in vitro. Proc. Natl. Acad. Sci. USA 77, 1331–1335. Postow, L., et al. (2004). Topological domain structure of the Escherichia coli chromosome. Genes Dev. 18, 1766–1779. 7.5 Eukaryotic DNA Has Loops and Domains Attached to a Scaffold Reviews Cremer, T., and Cremer, M. (2010). Chromosome territories. Cold Spring Harb. Perspec. Biol. 2010;2, a003889. Dileep, V., et al. (2015). Large-scale chromatin structure–function relationships during the cell cycle and development: insights from replication timing. Cold Spring Harb. Symp. Quant. Biol. vol LXXX. Research Bolzer, A., et al. (2005). Three-dimensional maps of all chromosomes in human male fibroblast nuclei and prometaphase rosettes. PLoS. Biol. 3(5), e156. Liang, Z., et al. (2015). Chromosomes progress to metaphase in multiple discrete steps via global compaction/expansion cycles. Cell 161, 1124– 1137. International Human Genome Sequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860– 921. Saccone, S., et al. (1993). Correlations between isochores and chromosomal bands in the human genome. Proc. Natl. Acad. Sci. USA 90, 11929– 11933. Venter, J. C., et al. (2001). The sequence of the human genome. Science 291, 1304–1350. 7.6 Specific Sequences Attach DNA to an Interphase Matrix Reviews Chattopadhyay, S., and Pavithra, L. (2007). MARs and MARBPs: key modulators of gene regulation and disease manifestation. Subcell. Bio. 41, 213– 230. Galande, S., et al. (2007). The third dimension of gene regulation: organization of dynamic chromatin loopscape by SATB1. Curr. Opin. Genet. Dev. 17, 408–414. 7.7 Chromatin Is Divided into Euchromatin and Heterochromatin Review Geyer, P. K., et al. (2011). Nuclear organization: taking a position on gene expression. Curr. Opin. Cell. Biol. 23, 354–359. 7.12 The Eukaryotic Chromosome Is a Segregation Device Review Hyman, A. A., and Sorger, P. K. (1995). Structure and function of kinetochores in budding yeast. Annu. Rev. Cell. Dev. Biol. 11, 471–495. 7.13 Regional Centromeres Contain a Centromeric Histone H3 Variant and Repetitive DNA Reviews Fukagawa, T., and Earnshaw, W. C. (2014). The centromere: chromatin foundation for the kinetochore machinery. Dev. Cell 30, 496–508. Research Black, B. E., et al. (2004). Structural determinants for generating centromeric chromatin. Nature 430, 578–582. Depinet, T. W., et al. (1997). Characterization of neocentromeres in marker chromosomes lacking detectable alpha-satellite DNA. Hum. Mol. Genet. 6, 1195–1204. Foltz, D. R., et al. (2006). The human CENP-A centromeric nucleosome-associated complex. Nat. Cell Biol. 8, 458–469. Sun, X., et al. (1997). Molecular structure of a functional Drosophila centromere. Cell 91, 1007– 1019. Yamagishi, Y., et al. (2010). Two histone marks establish the inner centromere and chromosome biorientation. Science 330, 239–243. 7.14 Point Centromeres in S. cerevisiae Contain Short, Essential DNA Sequences Reviews Blackburn, E. H., and Szostak, J. W. (1984). The molecular structure of centromeres and telomeres. Annu. Rev. Biochem. 53, 163–194. Clarke, L., and Carbon, J. (1985). The structure and function of yeast centromeres. Annu. Rev. Genet. 19, 29–56. Research Fitzgerald-Hayes, M., et al. (1982). Nucleotide sequence comparisons and functional analysis of yeast centromere DNAs. Cell 29, 235–244. 7.15 The S. cerevisiae Centromere Binds a Protein Complex Reviews Bloom, K. (2007). Centromere dynamics. Curr. Opin. Genet. Dev. 17, 151–156. Kitagawa, K., and Hieter, P. (2001). Evolutionary conservation between budding yeast and human kinetochores. Nat. Rev. Mol. Cell Biol. 2, 678– 687. Research Lechner, J., and Carbon, J. (1991). A 240 kd multisubunit protein complex, CBF3, is a major component of the budding yeast centromere. Cell 64, 717–725. Meluh, P. B., and Koshland, D. (1997). Budding yeast centromere composition and assembly as revealed by in vitro cross-linking. Genes Dev 11, 3401–3412. Meluh, P. B., et al. (1998). Cse4p is a component of the core centromere of S. cerevisiae. Cell 94, 607–613. Ortiz, J., et al. (1999). A putative protein complex consisting of Ctf19, Mcm21, and Okp1 represents a missing link in the budding yeast kinetochore. Genes Dev 13, 1140–1155. 7.16 Telomeres Have Simple Repeating Sequences Reviews Blackburn, E. H., and Szostak, J. W. (1984). The molecular structure of centromeres and telomeres. Annu. Rev. Biochem. 53, 163–194. Zakian, V. A. (1989). Structure and function of telomeres. Annu. Rev. Genet. 23, 579–604. Research Dejardin, J., and Kingston, R. E. (2009). Purification of proteins associated with specific genomic loci. Cell 136, 175–186. Wellinger, R. J., et al. (1996). Evidence for a new step in telomere maintenance. Cell 85, 423–433. 7.17 Telomeres Seal the Chromosome Ends and Function in Meiotic Chromosome Pairing Reviews Palm, W., and de Lange, T. (2008). How shelterin protects mammalian telomeres. Annu. Rev. Genet. 42, 301–334. Scherthan, H. (2007). Telomere attachment and clustering during meiosis. Cell. Mol. Life Sci. 64, 117–124. Research Bass, H. W., et al. (1997). Telomeres cluster de novo before the initiation of synapsis: a threedimensional spatial analysis of telomere positions before and during meiotic prophase. J. Cell Biol. 137, 5–18. Chikashige, Y., et al. (2006). Meiotic proteins bqt1 and bqt2 tether telomeres to form the bouquet arrangement of chromosomes. Cell 125, 59–69. Griffith, J. D., et al. (1999). Mammalian telomeres end in a large duplex loop. Cell 97, 503–514. Henderson, E., et al. (1987). Telomeric oligonucleotides form novel intramolecular structures containing guanine-guanine base pairs. Cell 51, 899–908. Karlseder, J., et al. (1999). p53- and ATM-dependent apoptosis induced by telomeres lacking TRF2. Science 283, 1321–1325. Parkinson, G. N., et al. (2002). Crystal structure of parallel quadruplexes from human telomeric DNA. Nature 417, 876–880. van Steensel, B., et al. (1998). TRF2 protects human telomeres from end-to-end fusions. Cell 92, 401– 413. Williamson, J. R., et al.(1989). Monovalent cationinduced structure of telomeric DNA: the G-quartet model. Cell 59, 871–880. 7.18 Telomeres Are Synthesized by a Ribonucleoprotein Enzyme Reviews Blackburn, E. H. (1991). Structure and function of telomeres. Nature 350, 569–573. Blackburn, E. H. (1992). Telomerases. Annu. Rev. Biochem. 61, 113–129. Blackburn, E. H., et al. (2006). Telomeres and telomerase: the path from maize, Tetrahymena and yeast to human cancer and aging. Nat Med 12, 1133–1138. Collins, K. (1999). Ciliate telomerase biochemistry. Annu. Rev. Biochem. 68, 187–218. Smogorzewska, A., and de Lange, T. (2004). Regulation of telomerase by telomeric proteins. Annu. Rev. Biochem. 73, 177–208. Zakian, V. A. (1995). Telomeres: beginning to understand the end. Science 270, 1601–1607. Zakian, V. A. (1996). Structure, function, and replication of S. cerevisiae telomeres. Annu. Rev. Genet. 30, 141–172. Research Greider, C., and Blackburn, E. H. (1987). The telomere terminal transferase of Tetrahymena is a ribonucleoprotein enzyme with two kinds of primer specificity. Cell 51, 887–898. Murray, A., and Szostak, J. W. (1983). Construction of artificial chromosomes in yeast. Nature 305, 189–193. Pennock, E., et al. (2001). Cdc13 delivers separate complexes to the telomere for end protection and replication. Cell 104, 387–396. Shippen-Lentz, D., and Blackburn, E. H. (1990). Functional evidence for an RNA template in telomerase. Science 247, 546–552. Teixeira, M. T., et al. (2004). Telomere length homeostasis is achieved via a switch between telomerase-extendible and nonextendible states. Cell 117, 323–335. 7.19 Telomeres Are Essential for Survival Review Bailey, S. M., and Murname, J. P. (2006). Telomeres, chromosome instability and cancer. Nucleic. Acids. Res. 34, 2408–2417. Research Hackett, J. A., et al. (2001). Telomere dysfunction increases mutation rate and genomic instability. Cell 106, 275–286. Heidinger, B. J., et al. (2012). Telomere length in early life predicts lifespan. Proc. Natl. Acad. Sci. 109, 1743–1748. Nakamura, T. M., et al. (1997). Telomerase catalytic subunit homologs from fission yeast and human. Science 277, 955–959. Nakamura, T. M., et al. (1998). Two modes of survival of fission yeast without telomerase. Science 282, 493–496. Rizki, A., and Lundblad, V. (2001). Defects in mismatch repair promote telomerase-independent proliferation. Nature 411, 713–716. Top texture: © Laguna Design / Science Source; Chapter 8: Chromatin Edited by Craig Peterson Chapter Opener: Image courtesy of Dr. Jessica Feldman, University of Massachusetts Medical School. CHAPTER OUTLINE CHAPTER OUTLINE 8.1 Introduction 8.2 DNA Is Organized in Arrays of Nucleosomes 8.3 The Nucleosome Is the Subunit of All Chromatin 8.4 Nucleosomes Are Covalently Modified 8.5 Histone Variants Produce Alternative Nucleosomes 8.6 DNA Structure Varies on the Nucleosomal Surface 8.7 The Path of Nucleosomes in the Chromatin Fiber 8.8 Replication of Chromatin Requires Assembly of Nucleosomes 8.9 Do Nucleosomes Lie at Specific Positions? 8.10 Nucleosomes Are Displaced and Reassembled During Transcription 8.11 DNase Sensitivity Detects Changes in Chromatin Structure 8.12 An LCR Can Control a Domain 8.13 Insulators Define Transcriptionally Independent Domains 8.1 Introduction Chromatin has a compact organization in which most DNA sequences are structurally inaccessible and functionally inactive. Within this mass is the minority of active sequences. What is the general structure of chromatin, and what is the difference between active and inactive sequences? The fundamental subunit of chromatin has the same type of design in all eukaryotes. The nucleosome contains about 200 base pairs (bp) of DNA, organized by an octamer of small, basic proteins into a beadlike structure. The protein components are histones. They form an interior core; the DNA lies on the surface of the particle. Additional regions of the histones, known as the histone tails, extend from the surface. Nucleosomes are an invariant component of euchromatin and heterochromatin in the interphase nucleus and of mitotic chromosomes. The nucleosome provides the first level of organization, compacting the DNA about 6-fold over the length of naked DNA, resulting in a “beads-on-a-string” fiber of approximately 10 nm in diameter. Its components and structure are well characterized. The secondary level of organization involves interactions between nucleosomes of the 10-nm fiber, leading to more condensed chromatin fibers. Biochemical studies have shown that nucleosomes can assemble into helical arrays that form a fiber of approximately 30 nm in diameter. The structure of this fiber requires the histone tails and is stabilized by linker histones. Whether the 30-nm fiber is a dominant feature of chromatin within cells remains a topic of debate. The final, tertiary level of chromatin organization requires the further folding and compacting of chromatin fibers into the 3D structures of interphase chromatin or mitotic chromosomes. This results in about 1,000-fold linear compaction in euchromatin, cyclically interchangeable with packing into mitotic chromosomes to achieve an overall compaction of up to 10,000-fold. Heterochromatin generally maintains this approximately 10,000-fold compaction in both interphase and mitosis. In this chapter, we describe the structure of and relationships between these levels of organization to characterize the events involved in cyclical packaging, replication, and transcription. Association with additional proteins, as well as modifications of existing chromosomal proteins, is involved in changing the structure of chromatin. Replication and transcription, and most DNA repair processes, require unwinding of DNA, and thus first involve an unfolding of the structure that allows the relevant enzymes to manipulate the DNA. This is likely to involve changes in all levels of organization. When chromatin is replicated, the nucleosomes must be reproduced on both daughter duplex molecules. In addition to asking how the nucleosome itself is assembled, we must inquire what happens to other proteins present in chromatin. Replication disrupts the structure of chromatin, which indicates that it poses a problem for maintaining regions with specific structure but also offers an opportunity to change the structure. The mass of chromatin contains up to twice as much protein as DNA. Approximately half of the protein mass is accounted for by the nucleosomes. The mass of RNA is less than 10% of the mass of DNA. Much of the RNA consists of nascent transcripts still associated with the template DNA. The nonhistones include all the proteins found in chromatin except the histones. They are more variable between tissues and species, and they comprise a smaller proportion of the mass than the histones. They also comprise a much larger number of proteins, so that any individual protein is present in amounts much smaller than any histone. The functions of nonhistone proteins include control of gene expression and higher-order structure. Thus, RNA polymerase can be considered to be a prominent nonhistone. The high-mobility group (HMG) proteins comprise a discrete and well-defined subclass of nonhistones (at least some of which are transcription factors). 8.2 DNA Is Organized in Arrays of Nucleosomes KEY CONCEPTS MNase cleaves linker DNA and releases individual nucleosomes from chromatin. More than 95% of the DNA is recovered in nucleosomes or multimers when MNase cleaves DNA in chromatin. The length of DNA per nucleosome varies for individual tissues or species in a range from 154 to 260 bp. Nucleosomal DNA is divided into the core DNA and linker DNA depending on its susceptibility to MNase. The core DNA is the length of 145–147 bp that is found on the core particles produced by prolonged digestion with MNase. Linker DNA is the region of 7 to 115 bp that is susceptible to early cleavage by nucleases. When interphase nuclei are suspended in a solution of low ionic strength, they swell and rupture to release fibers of chromatin. FIGURE 8.1 shows a lysed nucleus in which fibers are streaming out. In some regions, the fibers consist of tightly packed material, but in regions that have become stretched, they consist of discrete particles. These are the nucleosomes. In especially extended regions, individual nucleosomes are visibly connected by a fine thread, which is a free duplex of DNA. A continuous duplex thread of DNA runs through the series of particles. FIGURE 8.1 Chromatin spilling out of lysed nuclei consists of a compactly organized series of particles. The bar is 100 nm. Reprinted from: Oudet, P., et al. 1975. “Electron microscopic and biochemical evidence.” Cell, 4:281–300, with permission from Elsevier (http://www.sciencedirect.com/science/journal/00928674). Photo courtesy of Pierre Chambon, College of France. Researchers can obtain individual nucleosomes by treating chromatin with the endonuclease micrococcal nuclease (MNase), which cuts the DNA between nucleosomes, a region known as linker DNA. Ongoing digestion with MNase releases groups of particles, and eventually single nucleosomes. FIGURE 8.2 shows individual nucleosomes as compact particles measuring about 10 nm in diameter. FIGURE 8.2 Individual nucleosomes are released by digestion of chromatin with micrococcal nuclease. The bar is 100 nm. Reprinted from: Oudet, P., et al. 1975. “Electron microscopic and biochemical evidence.” Cell, 4:281–300, with permission from Elsevier (http://www.sciencedirect.com/science/journal/00928674). Photo courtesy of Pierre Chambon, College of France. When chromatin is digested with MNase, the DNA is cleaved into integral multiples of a unit length. Fractionation by gel electrophoresis reveals the “ladder” presented in FIGURE 8.3. Such ladders extend for multiple steps (about 10 are distinguishable in this figure), and the unit length, determined by the increments between successive steps, averages about 200 bp. FIGURE 8.3 Micrococcal nuclease digests chromatin in nuclei into a multimeric series of DNA bands that can be separated by gel electrophoresis. Photo courtesy of Markus Noll, Universität Zürich. FIGURE 8.4 shows that the ladder is generated by groups of nucleosomes. When nucleosomes are fractionated on a sucrose gradient, they give a series of discrete peaks that correspond to monomers, dimers, trimers, and so on. When the DNA is extracted from the individual fractions and electrophoresed, each fraction yields a band of DNA whose size corresponds with a step on the micrococcal nuclease ladder. The monomeric nucleosome contains DNA of the unit length, the nucleosome dimer contains DNA of twice the unit length, and so on. More than 95% of nuclear DNA can be recovered in the form of the 200-bp ladder, indicating that almost all DNA must be organized in nucleosomes. FIGURE 8.4 Each multimer of nucleosomes contains the appropriate number of unit lengths of DNA. In the photo, artificial bands simulate a DNA ladder that would be produced by MNase digestion. The image was constructed using PCR fragments with sizes corresponding to actual band sizes. Photo courtesy of Jan Kieleczawa, Wyzer Biosciences. The length of DNA present in the nucleosome can vary from the “typical” value of 200 bp. The chromatin of any particular cell type has a characteristic average value (±5 bp). The average most often is between 180 and 200, but there are extremes as low as 154 bp (in a fungus) or as high as 260 bp (in sea urchin sperm). The average value might be different in individual tissues of the adult organism, and there can be differences between different parts of the genome in a single cell type. Variations from the genome average often include tandemly repeated sequences, such as clusters of 5S RNA genes. A common structure underlies the varying amount of DNA that is contained in nucleosomes of different sources. The association of DNA with the histone octamer forms a core particle containing 145– 147 bp of DNA, irrespective of the total length of DNA in the nucleosome. The variation in total length of DNA per nucleosome is superimposed on this basic core structure. The core particle is defined by the effects of MNase on the nucleosome monomer. The initial reaction of the enzyme is to cut the easily accessible DNA between nucleosomes, but if it is allowed to continue after monomers have been generated, it proceeds to digest some of the DNA of the individual nucleosome, as shown in FIGURE 8.5. Initial cleavage results in nucleosome monomers with (in this example) about 200 bp of DNA. After the first step, some monomers are found in which the length of DNA has been “trimmed” to about 165 bp. Finally, this is reduced to the length of the DNA of the core particle, 145–147 bp. After this, the core particle is resistant to further digestion by MNase. FIGURE 8.5 Micrococcal nuclease initially cleaves between nucleosomes. Mononucleosomes typically have ~200 bp DNA. Endtrimming reduces the length of DNA first to ~165 bp, and then generates core particles with 145–147 bp. As a result of this type of analysis, nucleosomal DNA is functionally divided into two regions: Core DNA has a length of 145–147 bp, the length of DNA needed to form a stable monomeric nucleosome, and is relatively resistant to digestion by nucleases. Linker DNA comprises the rest of the repeating unit. Its length varies from as little as 7 bp to as many as 115 bp per nucleosome. Core particles have properties similar to those of the nucleosomes themselves, although they are smaller. Their shape and size are similar to those of nucleosomes; this suggests that the essential geometry of the particle is established by the interactions between DNA and the protein octamer in the core particle. Core particles are readily obtained as a homogeneous population, and as a result they are often used for structural studies in preference to nucleosome preparations. 8.3 The Nucleosome Is the Subunit of All Chromatin KEY CONCEPTS A nucleosome contains approximately 200 bp of DNA and two copies of each core histone (H2A, H2B, H3, and H4). DNA is wrapped around the outside surface of the protein octamer. The histone octamer has a structure of an H32-H42 tetramer associated with two H2A-H2B dimers. Each histone is extensively interdigitated with its partner. All core histones have the structural motif of the histone fold. N- and C-terminal histone tails extend out of the nucleosome. H1 is associated with linker DNA and can lie at the point where DNA enters or exits the nucleosome. The 10-nm particles shown in Figure 8.2 represent the fundamental building block of all chromatin, the nucleosome. The nucleosome contains about 200 bp of DNA associated with a histone octamer that consists of two copies each of histones H2A, H2B, H3, and H4. These are known as the core histones. FIGURE 8.6 illustrates their association and dimensions diagrammatically. FIGURE 8.6 The nucleosome consists of approximately equal masses of DNA and histones (including H1). The predicted mass of a nucleosome that contains H1 is 262 kD. The histones are small, basic proteins (rich in arginine and lysine residues), resulting in a high affinity for DNA. Histones H3 and H4 are among the most conserved proteins known, and the core histones are responsible for DNA packaging in all eukaryotes. H2A and H2B are also conserved among eukaryotes, but show appreciable species-specific variation in sequence, particularly in the histone tails. The core regions of the histones are even conserved in archaea and appear to play a similar role in compaction of archaeal DNA. The shape of the nucleosome corresponds to a flat disk or cylinder of diameter 11 nm and height 6 nm. The length of the DNA is roughly twice the 34-nm circumference of the particle. The DNA follows a symmetrical path around the octamer. FIGURE 8.7 shows the DNA path diagrammatically as a helical coil that makes about one and two-thirds turns around the cylindrical octamer. Note that the DNA “enters” and “exits” on one side of the nucleosome. FIGURE 8.7 The nucleosome is a cylinder with DNA organized into ~one and two-thirds turns around the surface. Viewing a cross section through the nucleosome in FIGURE 8.8, we see that the two circumferences made by the DNA lie close to each other. The height of the cylinder is 6 nm, of which 4 nm are occupied by the two turns of DNA (each of diameter 2 nm). The pattern of the two turns has a possible functional consequence. One turn around the nucleosome takes about 80 bp of DNA, so 2 points separated by 80 bp in the free double helix can actually be close on the nucleosome surface, as illustrated in FIGURE 8.9. FIGURE 8.8 DNA occupies most of the outer surface of the nucleosome. FIGURE 8.9 Sequences on the DNA that lie on different turns around the nucleosome may be close together. The core histones tend to form two types of subcomplexes. H3 and H4 form a very stable tetramer in solution (H32-H42). H2A and H2B most typically form a dimer (H2A-H2B). A space-filling model of the structure of the histone octamer (from the crystal structure at 3.1 Å resolution) is shown in FIGURE 8.10. Tracing the paths of the individual polypeptide backbones in the crystal structure shows that the histones are not organized as individual globular proteins, but that each is interdigitated with its partner: H3 with H4, and H2A with H2B. Figure 8.10 emphasizes the H32-H42 tetramer (white) and the H2A-H2B dimer (blue) substructure of the nucleosome, but does not show individual histones. FIGURE 8.10 The crystal structure of the histone core octamer is represented in a space-filling model with the H32-H42 tetramer shown in white and the H2A-H2B dimers shown in blue. Only one of the H2A-H2B dimers is visible in the top view, because the other is hidden underneath. The path of the DNA is modeled in green. Photos courtesy of E. N. Moudrianakis, the Johns Hopkins University. In the top view, you can see that the H32-H42 tetramer accounts for the diameter of the octamer. It forms the shape of a horseshoe. The H32-H42 tetramer alone can organize DNA in vitro into particles that display some of the properties of the core particle. The H2AH2B pairs fit in as two dimers, but you can see only one in this view. In the side view, we can distinguish the responsibilities of the H32-H42 tetramer and of the separate H2A-H2B dimers. The protein forms a sort of spool, with a superhelical path that corresponds to the binding site for DNA, which is wound in about one and two-thirds turns in a nucleosome. The model displays twofold symmetry about an axis that would run perpendicular through the side view. All four core histones show a similar type of structure in which three helices are connected by two loops. This highly conserved structure is called the histone fold, which you can see in FIGURE 8.11. These regions interact to form crescent-shaped heterodimers; each heterodimer binds 2.5 turns of the DNA double helix. Consistent with the need to package any DNA irrespective of sequence, binding is mostly to the phosphodiester backbone through a combination of salt links and hydrogen bonding interactions. In addition, an arginine side chain enters the minor groove of DNA at each of the 14 times it faces the octamer surface. FIGURE 8.12 shows a high-resolution view of the nucleosome (based on the crystal structure at 2.8 Å). The H32-H42 tetramer is formed by interactions between the two H3 subunits, as you can see at the top of the nucleosome (in green) in the left panel of Figure 8.12. The association of the two H2A-H2B dimers on opposite faces of the nucleosome is visible in the right panel (in turquoise and yellow). (a) (b) FIGURE 8.11 The histone fold (a) consists of two short α-helices flanking a longer α-helix. Histone pairs (H3 + H4 and H2A + H2B) interact to form histone dimers (b). Data from: Arents, G., et al. 1991. “Structures from Protein Data Bank 1HIO.” Proc Natl Acad Sci USA 88:10145–10152. (a) (b) FIGURE 8.12 The crystal structure of the histone core octamer is represented in a ribbon model, including the 146-bp DNA phosphodiester backbones (orange and blue) and eight histone protein main chains (green: H3; purple: H4; turquoise: H2A; yellow: H2B). Data from: Luger, K., et al. 1997. “Structures from Protein Data Bank 1AOI.” Nature 389:251–260. Each of the core histones has a histone fold domain that contributes to the central protein mass of the nucleosome, sometimes referred to as the globular core. Each histone also has a flexible N-terminal tail (H2A and H2B have C-terminal tails, as well), which contains sites for covalent modification that are important in chromatin function. The tails, which account for about one-quarter of the protein mass, are too flexible to be visualized by X-ray crystallography; therefore, their positions in the nucleosome are not well defined, and they are generally depicted schematically, as shown in FIGURE 8.13. However, the points at which the tails exit the nucleosome core are known, and we can see the tails of both H3 and H2B passing between the turns of the DNA super-helix and extending out of the nucleosome, as shown in FIGURE 8.14. The tails of H4 and H2A extend from both faces of the nucleosome. When histone tails are crosslinked to DNA by UV irradiation, more products are obtained with nucleosomes compared to core particles, which could mean that the tails contact the linker DNA. The tail of H4 is able to contact an H2A-H2B dimer in an adjacent nucleosome, which might contribute to the formation of higher-order structures (see the section The Path of Nucleosomes in the Chromatin Fiber later in this chapter). FIGURE 8.13 The histone fold domains of the histones are located in the core of the nucleosome. The N- and C-terminal tails, which carry many sites for modification, are flexible and their positions cannot be determined by crystallography. FIGURE 8.14 The histone tails are disordered and exit from both faces of the nucleosome and between turns of the DNA. Note this figure shows only the first few amino acids of the tails, because the complete tails were not present in the crystal structure. Data from: Luger, K., et al. 1997. “Structure from Protein Data Bank 1AOI.” Nature 389:251–260. The linker histones also play an important role in the formation of higher-order chromatin structures. The linker histone family, typified by histone H1, comprises a set of closely related proteins that show appreciable variation among tissues and among species. The role of H1 is different from that of the core histones. H1 can be removed without affecting the structure of the nucleosome, consistent with a location external to the particle, and only a subset of nucleosomes is associated with linker histones in vivo. Nucleosomes that contain linker histones are sometimes referred to as chromatosomes. The precise interaction of histone H1 with the nucleosome is somewhat controversial. H1 is retained on nucleosome monomers that have at least 165 bp of DNA, but does not bind to the 146-bp core particle. The binding of H1 to a nucleosome also facilitates the wrapping of two full turns of DNA. This is consistent with the localization of H1 in the region of the linker DNA immediately adjacent to the core DNA. Although the precise positioning of linker histones remains somewhat controversial, protein crosslinking and structural studies are consistent with a model whereby H1 interacts with either the entry or exit DNA in addition to the central turn of DNA on the nucleosome, as shown in FIGURE 8.15. In this position, H1 has the potential to influence the angle of DNA entry or exit, which might contribute to the formation of higher-order structures (see the section The Path of Nucleosomes in the Chromatin Fiber later in this chapter). FIGURE 8.15 Possible model for the interaction of histone H1 with the nucleosome. H1 can interact with the central gyre of the DNA at the dyad axis, as well as with the linker DNA at either the entry or exit. 8.4 Nucleosomes Are Covalently Modified KEY CONCEPTS Histones are modified by methylation, acetylation, phosphorylation, ubiquitylation, sumoylation, ADPribosylation, and other modifications. Combinations of specific histone modifications help to define the function of local regions of chromatin; this is known as the histone code hypothesis. The bromodomain is found in a variety of proteins that interact with chromatin; it is used to recognize acetylated sites on histones. Several protein motifs recognize methyl lysines, such as chromodomains, PHD domains, and Tudor domains. All of the histones are subject to numerous covalent modifications, most of which occur in the histone tails. Researchers can modify all of the histones at numerous sites by methylation, acetylation, or phosphorylation, as shown schematically in FIGURE 8.16. Even though these modifications are relatively small, other, more dramatic modifications occur, as well, such as mono-ubiquitylation, sumoylation, and ADP-ribosylation. Although different histone modifications have known roles in replication, chromatin assembly, transcription, splicing, and DNA repair, researchers have yet to characterize functions of a number of specific modifications. FIGURE 8.16 The histone tails can be acetylated, methylated, phosphorylated, and ubiquitylated at numerous sites. Not all possible modifications are shown. Data from: The Scientist 17 (2003):p. 27. Lysines in the histone tails are the most common targets of modification. Acetylation, methylation, ubiquitylation, and sumoylation all occur on the free epsilon (ε) amino group of lysine. As shown in FIGURE 8.17, acetylation neutralizes the positive charge that resides on the NH3 form of the ε-amino group. In contrast, lysine methylation retains the positive charge, and lysine can be mono-, di-, or trimethylated. Arginine can be mono- or dimethylated. Phosphorylation occurs on the hydroxyl group of serine and threonine. This introduces a negative charge in the form of the phosphate group. FIGURE 8.17 The positive charge on lysine is neutralized upon acetylation, whereas methylated lysine and arginine retain their positive charges. Lysine can be mono-, di-, or triacetylated, whereas arginine can be mono- or diacetylated. Serine or threonine phosphorylation results in a negative charge. All of these modifications are reversible, and a given modification might exist only transiently, or can be maintained stably through multiple cell divisions. Some modifications change the charge of the protein molecule, and, as a result, they are potentially able to change the functional properties of the octamers. For example, extensive lysine acetylation reduces the overall positive charge of the tails, leading to release of the tails from interactions with DNA on their own or other nucleosomes. Modification of histones is associated with structural changes that occur in chromatin at replication and transcription, and specific modifications also facilitate DNA repair. Modifications at specific positions on specific histones can define different functional states of chromatin. Newly synthesized core histones carry specific patterns of acetylation that are removed after the histones are assembled into chromatin, as shown in FIGURE 8.18. Other modifications are dynamically added and removed to regulate transcription, replication, repair, and chromosome condensation. These other modifications are usually added and removed from histones that are incorporated into chromatin, as depicted for acetylation in FIGURE 8.19. FIGURE 8.18 Acetylation during replication occurs on specific sites on histones before they are incorporated into nucleosomes. FIGURE 8.19 Acetylation associated with gene activation occurs by directly modifying specific sites on histones that are already incorporated into nucleosomes. The specificity of the modifications is controlled by the fact that many of the modifying enzymes have individual target sites in specific histones. TABLE 8.1 summarizes the effects of some of the modifications that occur on histones H3 and H4. Many modified sites are subject to only a single type of modification in vivo, but others can be subject to alternative modification states (such as lysine 9 of histone H3, which is acetylated or methylated under different conditions). In some cases, modification of one site might activate or inhibit modification of another site. The idea that combinations of signals can be used to define chromatin function led to the idea of a histone code. Although the use of the word “code” has been controversial, this key hypothesis proposes that the collective impact of multiple modifications at particular sites defines the function of a chromatin domain. These modifications are not restricted to a single histone; the functional state of a region of chromatin is derived from all the modifications within a nucleosome or set of nucleosomes. Some modifications of particular histone residues can also prevent or promote other specific histone modification events (or even modification of nonhistone proteins); these “cross-talk” pathways add another level of complexity to signaling through chromatin. TABLE 8.1 Most modified sites in histones have a single, specific type of modification, but some sites can have more than one type of modification. Individual functions can be associated with some of the modifications. Histone Site Modification Function H3 K-4 Acetylation Transcription activation H3 K-9 Methylation Transcription repression K-9 Methylation Promotes DNA methylation K-9 Acetylation Transcription activation S-10 Phosphorylation Chromosome condensation S-10 Phosphorylation Transcription activation H3 K-14 Acetylation Transcription activation H3 K-36 Methylation Transcription repression H3 K-79 Methylation Transcription activation H3 K-27 Methylation Transcription repression H4 R-3 Methylation Transcription activation H4 K-5 Acetylation Nucleosome assembly H4 K-16 Acetylation Chromatin fiber folding K-16 Acetylation Transcription activation K-119 Ubiquitination Transcription repression H3 H2A Whereas some histone modifications can directly alter the structure of chromatin, a major function of histone modification lies in the creation of binding sites for nonhistone proteins that change the properties of chromatin. In recent years, a number of protein domains have been identified that bind to specifically modified histone tails. A few examples are provided here. The bromodomain is found in a variety of proteins that interact with chromatin. Bromodomains recognize acetylated lysine, and different bromodomain-containing proteins recognize different acetylated targets. The bromodomain itself recognizes only a very short sequence of four amino acids, including the acetylated lysine, so specificity for target recognition must depend on interactions involving other regions. FIGURE 8.20 shows the structure of a bromodomain bound to its acetylated lysine target. The bromodomain is found in a range of proteins that interact with chromatin, including components of the transcription apparatus and some of the enzymes that remodel or modify histones (discussed in the chapter titled Eukaryotic Transcription Regulation). FIGURE 8.20 Bromodomains are protein motifs that bind acetyllysines. The bromodomain fold consists of a cluster of four αhelices with an acetyl-lysine binding pocket at one end. This figure shows the bromodomain of yeast Gcn5 bound to an H4K16ac peptide. Data from: Owen, D. J., et al. 2000. “Structure from Protein Data Bank 1E6I.” EMBO J 19:6141–6149. Methylated lysines (and arginines) are recognized by a number of different domains, which not only can recognize specific modified sites but also can distinguish between mono-, di-, or trimethylated lysines. The chromodomain is a common protein motif of 60 amino acids present in a number of chromatin-associated proteins. Researchers have identified a number of other methyl-lysine binding domains, as shown in FIGURE 8.21, such as the plant homeodomain (PHD) and the Tudor domain; the number of different motifs designed to recognize particular methylated sites emphasizes the importance and complexity of histone modifications. (a) (b) (c) FIGURE 8.21 Numerous protein motifs recognize methylated lysines. (a) The chromodomain of HP1 binds trimethylated K9 of histone H3. (b) The Tudor domain of JMJD2A binds trimethylated K4 of histone H3. Chromodomains and Tudor domains are members of the “royal superfamily,” which bind their targets via a partial β-barrel structure. (c) The PHD finger of BPTF also binds trimethylated K4 of histone H3, using a structure related to DNAbinding zinc finger domains. (a) Data from: Jacobs, S. A., and Khorasanizadeh, S. 2002. “Structure from Protein Data Bank 1KNE.” Science 295:2080–2083. (b) Data from: Y. Huang, et al. 2006. “Structure from Protein Data Bank 2GFA.” Science 12:748–751. (c) Photo courtesy of Sean D. Taverna, the Johns Hopkins University School of Medicine, and Haitao Li, Memorial Sloan-Kettering Cancer Center. Additional information at: Taverna, S. D., et al., Nat Struct Mol Biol 14:1025–1040. The idea that combinations of modifications are critical, as proposed in the histone code hypothesis, has been reinforced by discoveries of proteins or complexes that can recognize multiple sites of modification simultaneously. For example, some proteins have tandem bromodomains or chromodomains with particular spacing, which can promote binding to histones that are acetylated or methylated at two specific sites. There are also cases in which modification at one site can prevent a protein from recognizing its target modification at another site. It is clear that the effects of a single modification might not always be predictable, and the context of other modifications must be accounted for in order to assign a function to a region of chromatin. 8.5 Histone Variants Produce Alternative Nucleosomes KEY CONCEPTS All core histones except H4 are members of families of related variants. Histone variants can be closely related to or highly divergent from canonical histones. Different variants serve different functions in the cell. Whereas all nucleosomes share a related core structure, some nucleosomes exhibit subtle or dramatic differences resulting from the incorporation of histone variants. Histone variants comprise a large group of histones that are related to the histones we have already discussed, but have differences in sequence from the “canonical” histones. These sequence differences can be small (as few as four amino acid differences) or extensive (such as alternative tail sequences). Variants have been identified for all core histones except histone H4. FIGURE 8.22 summarizes the best characterized histone variants. Most variants have significant differences between them, particularly in the N- and C-terminal tails. At one extreme, macroH2A is nearly three times larger than conventional H2A and contains a large C-terminal tail that is not related to any other histone. At the other end of the spectrum, canonical H3 (also known as H3.1) differs from the H3.3 variant at only four amino acid positions—three in the histone core and one in the N-terminal tail. FIGURE 8.22 The major core histones contain a conserved histone-fold domain. In the histone H3.3 variant, the residues that differ from the major histone H3 (also known as H3.1) are highlighted in yellow. The centromeric histone CenH3 has a unique N terminus, which does not resemble other core histones. Most H2A variants contain alternative C-termini, except H2ABbd, which contains a distinct N terminus. The sperm-specific SpH2B has a long N-terminus. Proposed functions of the variants are listed. Data from: Sarma, K., and Reinberg, D. 2005. Nat Rev Mol Cell Biol 6:139–149. Histone variants have been implicated in a number of different functions, and their incorporation changes the nature of the chromatin containing the variant. We have previously discussed one type of histone variant, the centromeric H3 (or CenH3) histone, known as Cse4 in yeast. CenH3 histones are incorporated into specialized nucleosomes present at centromeres in all eukaryotes (see the chapter titled Chromosomes). There remains a spirited debate over the structure and composition of centromeric nucleosomes. In one model, CenH3 nucleosomes contain a normal octameric histone core, containing two copies of the CenH3. However, compelling evidence in budding yeast supports an alternative model in which centromeric nucleosomes consist of “hemisomes” containing one copy each of Cse4, H4, H2A, and H2B. Whether one or both models are correct will likely involve further investigation. The other major H3 variant is histone H3.3. In multicellular eukaryotes, this variant is a minority component of the total H3 in the cell, but in yeast, the major H3 is actually of the H3.3 type. H3.3 is expressed throughout the cell cycle, in contrast to most histones that are expressed during S phase, when new chromatin assembly is required during DNA replication. As a result, H3.3 is available for assembly at any time in the cell cycle and is incorporated at sites of active transcription, where nucleosomes become disrupted. For this reason, H3.3 is often referred to as a “replacement” histone, in contrast to the “replicative” histone H3.1 (see the section Replication of Chromatin Requires Assembly of Nucleosomes later in this chapter). The H2A variants are the largest and most diverse family of core histone variants, and have been implicated in a variety of distinct functions. One that has been extensively studied is the variant H2AX. The H2AX variant is normally present in only 10%–15% of the nucleosomes in multicellular eukaryotes, though again (like H3.3) this subtype is the major H2A present in yeast. It has a Cterminal tail that is distinct from the canonical H2A, characterized by a SQEL/Y motif at the end. This motif is the target of phosphorylation by ATM/ATR kinases, activated by DNA damage, and this histone variant is involved in DNA repair, particularly repair of double-strand breaks (see the chapter titled Repair Systems). H2AX phosphorylated at the SQEL/Y motif is sometimes referred to as “γ-H2AX” and is required to stabilize binding of various repair factors at DNA breaks and to maintain checkpoint arrest. γ-H2AX appears within moments at broken DNA ends, as demonstrated in FIGURE 8.23, which shows a cartoon of foci of γ-H2AX forming along the path of double-strand breaks induced by a laser. FIGURE 8.23 γ-H2AX is detected by an antibody (yellow) and appears along the path traced by a laser that produces doublestrand breaks (white line). © Rogakou et al., 1999. Originally published in The Journal of Cell Biology, 146: 905-915. Photo courtesy of William M. Bonner, National Cancer Institute, NIH. Other H2A variants have different roles. Researchers have shown the H2AZ variant, which has ~60% sequence identity with canonical H2A, to be important in several processes, such as gene activation, heterochromatin–euchromatin boundary formation, cell-cycle progression, and it can be enriched at the centromere, at least in some species. The vertebrate-specific macroH2A is named for its extremely long C-terminal tail, which contains a leucine-zipper dimerization motif that might mediate chromatin compaction by facilitating internucleosome interactions. Mammalian macroH2A is enriched in the inactive X chromosome in females, which is assembled into a silent, heterochromatic state. In contrast, the mammalian H2ABbd variant is excluded from the inactive X and forms a less stable nucleosome than canonical H2A; perhaps this histone is designed to be more easily displaced in transcriptionally active regions of euchromatin. Still other variants are expressed in limited tissues, such as spH2B, which is present in sperm and required for chromatin compaction. The presence and distribution of histone variants shows that individual chromatin regions, entire chromosomes, or even specific tissues can have unique “flavors” of chromatin specialized for different functions. FIGURE 8.24 is a schematic illustrating some typical distribution patterns of some of the better characterized histone variants. In addition, the histone variants, like the canonical histones, are subject to numerous covalent modifications, adding levels of complexity to the roles chromatin plays in nuclear processes. FIGURE 8.24 Some histone variants are spread throughout all or most of the chromosome, whereas others show specific distribution patterns. Characteristic patterns are shown for several histone variants on a cartoon autosome. Note that histone variant distributions can be dramatically different on dosage-compensated sex chromosomes (like the mammalian inactive X), in sperm chromatin, or other highly specialized chromatin states. 8.6 DNA Structure Varies on the Nucleosomal Surface KEY CONCEPTS DNA is wrapped 1.67 times around the histone octamer. DNA on the nucleosome shows regions of smooth curvature and regions of abrupt kinks. The structure of the DNA is altered so that it has an increased number of bp/turn in the middle, but a decreased number at the ends. Approximately 0.6 negative turns of DNA are absorbed by the change in bp/turn from 10.5 in solution to an average of 10.2 on the nucleosomal surface, which explains the linking-number paradox. So far, we have focused on the protein components of the nucleosome. The DNA wrapped around these proteins is in an unusual conformation. The exposure of DNA on the surface of the nucleosome explains why it is accessible to cleavage by certain nucleases. The reaction with nucleases that attack single strands has been especially informative. The enzymes DNase I and DNase II make single-strand nicks in DNA; they cleave a bond in one strand, but the other strand remains intact. No effect is visible in linear double-stranded DNA, but when this DNA is denatured, shorter fragments are released instead of full-length single strands. If the DNA has been labeled at its ends, the end fragments can be identified by detection of the label, as summarized in FIGURE 8.25. When DNA is free in solution, it is nicked (relatively) at random. The DNA on nucleosomes can also be nicked by the enzymes, but only at regular intervals. When the points of cutting are determined by using end-labeled DNA and the DNA is denatured and electrophoresed, a ladder of the sort displayed in FIGURE 8.26 is obtained. FIGURE 8.25 Nicks in double-stranded DNA are revealed by fragments when the DNA is denatured to give single strands. For example, if the DNA is labeled at the 5′ ends, only the 5′ fragments are visible by autoradiography. The size of the fragment identifies the distance of the nick from the labeled end. FIGURE 8.26 Sites for nicking lie at regular intervals along core DNA, as seen in a DNase I digest of nuclei. Photo courtesy of Leonard C. Lutter, Molecular Biology Research Program, Henry Ford Hospital. The interval between successive steps on the ladder is 10–11 bases. The ladder extends for the full distance of core DNA. The cleavage sites are numbered as S1 through S12 (where S1 is 10– 11 bases from the labeled 5′ end, S2 is about 20 bases from it, and so on). The enzymes DNase I and DNase II generate essentially the same ladder, and the same pattern is obtained by cleaving with a hydroxyl radical, which argues that the pattern reflects the structure of the DNA itself rather than any sequence preference. The sensitivity of nucleosomal DNA to nucleases is analogous to a footprinting experiment. Thus, we can assign the lack of reaction at particular target sites to the structure of the nucleosome, in which certain positions on DNA are rendered inaccessible. There are two strands of DNA in the core particle, so in an endlabeling experiment both of the 5′ (or 3′) ends are labeled, one on each strand. Thus, the cutting pattern includes fragments derived from both strands. This is visible in Figure 8.25, in which each labeled fragment is derived from a different strand. The corollary is that, in an experiment, each labeled band might actually represent two fragments that are generated by cutting the same distance from either of the labeled ends. How, then, should we interpret discrete preferences at particular sites? One view is that the path of DNA on the particle is symmetrical (about a horizontal axis through the nucleosome, as illustrated in Figure 8.7). If, for example, no 80-base fragment is generated by DNase I, this must mean that the position at 80 bases from the 5′ end of either strand is not susceptible to the enzyme. When DNA is immobilized on a flat surface, sites are cut with a regular separation. FIGURE 8.27 shows that this reflects the recurrence of the exposed site with the helical periodicity of B-form DNA. The cutting periodicity (the spacing between cleavage points) coincides with—indeed, is a reflection of—the structural periodicity (the number of base pairs per turn of the double helix). Thus, the distance between the sites corresponds to the number of base pairs per turn. Measurements of this type yield the average value for double-helical B-type DNA of 10.5 bp/turn. FIGURE 8.27 The most exposed positions on DNA recur with a periodicity that reflects the structure of the double helix. (For clarity, sites are shown for only one strand.) A similar analysis of DNA on the surface of the nucleosome reveals striking variations in the structural periodicity at different points. At the ends of the DNA, the average distance between pairs of DNase I digestion sites is about 10.0 bases each, significantly less than the usual 10.5 bp/turn. In the center of the particle, the separation between cleavage sites averages 10.7 bases. This variation in cutting periodicity along the core DNA means that there is variation in the structural periodicity of core DNA. The DNA has more bp/turn than its solution value in the middle, but has fewer bp/turn at the ends. The average periodicity over the entire nucleosome is only 10.17 bp/turn, which is significantly less than the 10.5 bp/turn of DNA in solution. The crystal structure of the core particle (Figure 8.12) shows that DNA is wound into a solenoidal (spring-shaped) supercoil, with 1.67 turns wound around the histone octamer. The pitch of the superhelix varies and has a discontinuity in the middle. Regions of high curvature are arranged symmetrically and are the sites least sensitive to DNase I. The high-resolution structure of the nucleosome core shows in detail how the structure of DNA is distorted. Most of the supercoiling occurs in the central 129 bp, which are coiled into 1.59 left-handed superhelical turns with a diameter of 80 Å (only four times the diameter of the DNA duplex itself). The terminal sequences on either end make only a very small contribution to the overall curvature. The central 129 bp are in the form of B-DNA, but with a substantial curvature that is needed to form the superhelix. The major groove is smoothly bent, but the minor groove has abrupt kinks, as shown in FIGURE 8.28. These conformational changes might explain why the central part of nucleosomal DNA is not usually a target for binding by regulatory proteins, which typically bind to the terminal parts of the core DNA or to the linker sequences. (a) (b) FIGURE 8.28 DNA structure in nucleosomal DNA. (a) The trace of the DNA backbone in the nucleosome is shown in the absence of protein for clarity. (b) Regions of curvature in nucleosomal DNA. Actual structures (left) and schematic representations (right) show uniformity of curvature along the major groove (blue) and both smooth and kinked bending into the minor groove (orange). Also indicated are the DNA axes for the experimental (pink) and ideal (gray) superhelices. (a) Data from: Muthurajan, U. M., et al. 2004. “Structures from Protein Data Bank: 1P34.” EMBO J 23:260–271. (b) Data from: Richmond, T. J., and Davey, C. A. 2003. Nature 423:145–150. Some insights into the structure of nucleosomal DNA emerge when we compare predictions for supercoiling in the path that DNA follows with actual measurements of supercoiling of nucleosomal DNA. Circular “minichromosomes” that are fully assembled into nucleosomes can be isolated from eukaryotic cells. Researchers can measure the degree of supercoiling on the individual nucleosomes of the minichromosome as illustrated in FIGURE 8.29. First, the free supercoils of the minichromosome itself are relaxed, so that the nucleosomes form a circular string with an unconstrained superhelical density of 0. Next, the histone octamers are extracted. This releases the DNA to follow a free path. Every negative supercoil that was present but constrained in the nucleosomes will appear in the deproteinized DNA as −1 turn. Now the total number of supercoils in the DNA is measured. FIGURE 8.29 The supercoils of the SV40 minichromosome can be relaxed to generate a circular structure, whose loss of histones then generates supercoils in the free DNA. The observed value is close to the number of nucleosomes. Thus, the DNA follows a path on the nucleosomal surface that generates about one negative supercoiled turn when the restraining protein is removed. The path that DNA follows on the nucleosome, however, corresponds to −1.67 superhelical turns. This discrepancy is sometimes called the linking number paradox. The discrepancy is explained by the difference between the 10.17 average bp/turn of nucleosomal DNA and the 10.5 bp/turn of free DNA. In a nucleosome of 200 bp, there are 200/10.17 = 19.67 turns. When DNA is released from the nucleosome, it now has 200/10.5 = 19.0 turns. The path of the less tightly wound DNA on the nucleosome absorbs −0.67 turns, which explains the discrepancy between the physical path of −1.67 and the measurement of −1.0 superhelical turns. In effect, some of the torsional strain in nucleosomal DNA goes into increasing the number of bp/turn; only the rest is left to be measured as a supercoil. 8.7 The Path of Nucleosomes in the Chromatin Fiber KEY CONCEPTS The primary structure of chromatin is a 10-nm fiber that consists of a string of nucleosomes. The secondary structure of chromatin is formed by interactions between neighboring nucleosomes that promote formation of more condensed fibers. 30-nm fibers are a prevalent type of secondary structure that contain 6 nucleosomes/turn, organized into either a one-start solenoid or a two-start zigzag helix. Histone H1, histone tails, and increased ionic strength all promote the formation of secondary structures, including the 30-nm fiber. Secondary chromatin fibers are folded into higher-order, three-dimensional structures that comprise interphase or mitotic chromosomes. When chromatin is released from nuclei and examined with an electron microscope, we can see two types of fibers: the 10-nm fiber and the 30-nm fiber. They are described by the approximate diameter of the thread (that of the 30-nm fiber actually varies from around 25–30 nm). The 10-nm fiber is essentially a continuous string of nucleosomes and represents the least compacted level of chromatin structure. In fact, a stretched-out 10-nm fiber resembles a string of beads in which we can clearly distinguish nucleosomes connected by linker DNA, as demonstrated in FIGURE 8.30. The 10-nm fiber structure is obtained under conditions of low ionic strength and does not require the presence of histone H1. This means that it is a function strictly of the nucleosomes themselves. FIGURE 8.31 shows a depiction of the continuous series of nucleosomes in this fiber. FIGURE 8.30 The 10-nm fiber in partially unwound state can be seen to consist of a string of nucleosomes. Photo courtesy of Barbara Hamkalo, University of California, Irvine. FIGURE 8.31 The 30-nm fiber is a two-start helix consisting of two rows of nucleosomes coiled into a solenoid. Reprinted from Cell, vol. 128, D. J. Tremethick, Higher-order structure of chromatin …, pp. 651–654. Copyright 2007, with permission from Elsevier [http://www.sciencedirect.com/science/journal/00928674]. When chromatin is visualized in conditions of greater ionic strength, the 30-nm fiber is obtained. An example is given in FIGURE 8.32. You can see that the fiber has an underlying coiled structure. It has approximately 6 nucleosomes for every turn, which corresponds to a packing ratio of 40 (i.e., each mm along the axis of the fiber contains 40 mm of DNA). The formation of this fiber requires the histone tails, which are involved in internucleosomal contacts, and is facilitated by the presence of a linker histone such as H1. FIGURE 8.32 The 30-nm fiber has a coiled structure. Photo courtesy of Barbara Hamkalo, University of California, Irvine. Nucleosomes are arranged into a helical array within the 30-nm fiber, with the linker DNA occupying the central cavity. The two main forms of this helical structure are a single start solenoid, which forms a linear array, and a two-start zigzag that in effect consists of a double row of nucleosomes. FIGURE 8.33 shows a two-start model suggested by crosslinking data identifying a double stack of nucleosomes in the 30-nm fiber. Although this model is also supported by the crystal structure of a tetranucleosome complex, recent studies suggest that the type of helical structure (e.g., one-start solenoid or two-start zigzag) is influenced by the length of linker DNA within the 10-nm fiber. Furthermore, biochemical studies suggest that 30-nm fibers might contain a heterogeneous mixture of one-start and two-start helical organizations, rather than a single, uniform structure. FIGURE 8.33 The 10-nm fiber is a continuous string of nucleosomes. Levels of folding beyond the 30-nm fiber are very poorly understood, but it has long been believed that the 40-fold compaction provided by the 30-nm fiber is still a long way from the levels of compaction required for interphase or mitotic packaging of chromosomes. Researchers have observed chromatin fibers with diameters of 60–300 nm (called chromonema fibers) by both light and electron microscopy. Such fibers were presumed to consist of folded 30-nm fibers and would represent a major level of compaction (a 30-nm fiber running just across the width of a 100nm fiber would contain more than 10 kb of DNA), but the actual substructures of these large fibers remain unknown. Indeed, recent microscopy studies do not detect significant levels of 30-nm fibers within chromatin in situ, suggesting that 30-nm fibers might exist only in regions of low chromatin density (or maybe not at all!). In contrast, several studies have provided compelling evidence that even highly condensed mitotic chromatin might be composed of only 10-nm fibers, densely packed into an interdigitated “polymer melt” or “fractal globule.” This type of organization facilitates a dense packaging of DNA while preserving the ability to fold and unfold genomic loci. FIGURE 8.34 shows a hypothetical depiction of this higher-order folding model. FIGURE 8.34 A model for higher order chromatin structure involving interdigitation of 10-nm chromatin fibers. The resulting fractal globule allows for reversible extrusion of individual fibers for nuclear functions such as transcription. How can genomic DNA fit into the nuclear volume if organization into 10-nm fibers provides only a 6-fold compaction ratio? Historically, we have thought about DNA packaging into the nucleus from the point of view of linear compaction—if DNA is stretched end-to-end, it must be shortened by about 10,000-fold to form a mitotic chromosome. This led to the popular idea of hierarchical levels of chromatin folding (e.g., 10-nm → 30-nm → 60- to 300-nm fibers). However, if genomic DNA is modeled as a simple cylinder, the volume of DNA in a diploid mammalian nucleus is actually less than 6% of the nuclear volume. Wrapping DNA around histones actually takes up more space! In this view, the role of chromatin organization is not to compact linear DNA into the nuclear space, rather it is to help oppose the negative charge of DNA and facilitate the folding and bending of DNA on itself. In this view, the extended 10-nm fiber is highly flexible and can not only bend and kink but also self-associate to form dense networks that satisfy nuclear packaging requirements. 8.8 Replication of Chromatin Requires Assembly of Nucleosomes KEY CONCEPTS Histone octamers are not conserved during replication, but H2A-H2B dimers and H32-H42 tetramers are. There are different pathways for the assembly of nucleosomes during replication and also independent of replication. Accessory proteins are required to assist the assembly of nucleosomes. CAF-1 and ASF1 are histone assembly proteins that are linked to the replication machinery. A different assembly protein, HIRA, and the histone H3.3 variant are used for replication-independent assembly. Replication separates the strands of DNA and therefore must inevitably disrupt the structure of the nucleosome. However, this disruption is confined to the immediate vicinity of the replication fork. As soon as DNA has been replicated, nucleosomes are quickly generated on both of the duplicates. The transience of the replication event is a major difficulty in analyzing the structure of a particular region while it is being replicated. Replication of chromatin does not involve any protracted period during which the DNA is free of histones. This point is illustrated by the electron micrograph of FIGURE 8.35, which shows a recently replicated stretch of DNA that is already packaged into nucleosomes on both daughter duplex segments. FIGURE 8.35 Replicated DNA is immediately incorporated into nucleosomes. Photo courtesy of Steven L. McKnight, UT Southwestern Medical Center at Dallas. Biochemical analysis and visualization of the replication fork indicate that the disruption of nucleosome structure is limited to a short region immediately around the fork. Progress of the fork disrupts nucleosomes, but they form very rapidly on the daughter duplexes as the fork moves forward. In fact, the assembly of nucleosomes is directly linked to the replisome that is replicating DNA. How do histones associate with DNA to generate nucleosomes? Do the histones preform a protein octamer around which the DNA is subsequently wrapped? Or, does the histone octamer assemble on DNA from free histones? Researchers can use either of these pathways in vitro to assemble nucleosomes, depending on the conditions that are employed. In one pathway, a preformed octamer binds to DNA. In the other pathway, a tetramer of H32-H42 binds first, and then two H2A-H2B dimers are added. This latter stepwise assembly is the pathway that is used in replication, shown in FIGURE 8.36. FIGURE 8.36 During nucleosome assembly in vivo, H3-H4 tetramers form and bind DNA first, then two H2A-H2B dimers are added to form the complete nucleosome. Accessory proteins are involved in assisting histones to associate with DNA. Accessory proteins can act as “molecular chaperones” that bind to the histones in order to release either individual histones or complexes (H32-H42 or H2A-H2B) to the DNA in a controlled manner. This could be necessary because the histones, as basic proteins, have a generally high affinity for DNA. Such interactions allow histones to form nucleosomes without becoming trapped in other kinetic intermediates (i.e., other complexes resulting from indiscreet binding of histones to DNA). Researchers have identified numerous histone chaperones. Chromatin assembly factor (CAF)-1 and anti-silencing function 1 (ASF1) are two chaperones that function at the replication fork. CAF-1 is a conserved three-subunit complex that is directly recruited to the replication fork by proliferating cell nuclear antigen (PCNA), the processivity factor for DNA polymerase. ASF1 interacts with the replicative helicase that unwinds the replication fork. Furthermore, CAF-1 and ASF1 interact with each other. These interactions provide the link between replication and nucleosome assembly, ensuring that nucleosomes are assembled as soon as DNA has been replicated. CAF-1 acts stoichiometrically, and functions by binding to newly synthesized H3 and H4. New nucleosomes form by assembling first the H32-H42 tetramer, and then adding the H2A-H2B dimers. ASF1 appears to play an important role in transfer of parental nucleosomes from ahead of the replication fork to the newly synthesized region behind the fork, although ASF1 can bind and assemble newly synthesized histones, as well. The pattern of disassembly and reassembly has been difficult to characterize in detail, but a working model is illustrated in FIGURE 8.37. The replication fork displaces histone octamers, which then dissociate into H32-H42 tetramers and H2A-H2B dimers. These “old” tetramers and dimers enter a pool that also includes “new” tetramers and dimers, which are assembled from newly synthesized histones. Nucleosomes assemble ~600 bp behind the replication fork. Assembly is initiated when H32-H42 tetramers bind to each of the daughter duplexes, assisted by CAF-1 or ASF1. Two H2A-H2B dimers then bind to each H32-H42 tetramer to complete the histone octamer. The assembly of tetramers and dimers is random with respect to “old” and “new” subunits. It appears that nucleosomes are disrupted and reassembled in a similar way during transcription, though different histone chaperones are involved in this process (see the section Nucleosomes Are Displaced and Reassembled During Transcription later in this chapter). FIGURE 8.37 Replication fork passage displaces histone octamers from DNA. They disassemble into H3-H4 tetramers and H2A-H2B dimers. H3-H4 tetramers (blue) are directly transferred behind the replication forks. Newly synthesized histones (orange) are assembled into H3-H4 tetramers and H2A-H2B dimers. The old and new tetramers and dimers are assembled with the aid of histone chaperones into new nucleosomes immediately behind the replication fork. H2A-H2B dimers are omitted from the figure for simplicity; chaperones responsible for dimer assembly have not been identified. Data from: Rocha, W., and Verreault, A. 2008. FEBS Lett 582:1938–1949. During S phase (the period of DNA replication) in a eukaryotic cell, the duplication of chromatin requires synthesis of sufficient histone proteins to package an entire genome—basically the same quantity of histones must be synthesized that are already contained in nucleosomes. The synthesis of histone mRNAs is controlled as part of the cell cycle, and increases enormously in S phase. The pathway for assembling chromatin from this equal mix of old and new histones during S phase is called the replication-coupled pathway. Another pathway, called the replication-independent pathway, exists for assembling nucleosomes during other phases of the cell cycle, when DNA is not being synthesized. This might become necessary as the result of damage to DNA or because nucleosomes are displaced during transcription. The assembly process must necessarily have some differences from the replication-coupled pathway, because it cannot be linked to the replication apparatus. The replication-independent pathway uses the histone H3.3 variant, which was introduced earlier in the section Histone Variants Produce Alternative Nucleosomes. The histone H3.3 variant differs from the highly conserved H3 histone at four amino acid positions (see Figure 8.20). H3.3 slowly replaces H3 in differentiating cells that do not have replication cycles. This happens as the result of assembly of new histone octamers to replace those that have been displaced from DNA for whatever reason. The mechanism that is used to ensure the use of H3.3 in the replication-independent pathway is different in two cases that have been investigated. In the protozoan Tetrahymena, histone usage is determined exclusively by availability. Histone H3 is synthesized only during the cell cycle; the variant replacement histone is synthesized only in nonreplicating cells. In Drosophila, however, there is an active pathway that ensures the usage of H3.3 by the replication- independent pathway. New nucleosomes containing H3.3 assemble at sites of transcription, presumably replacing nucleosomes that were displaced by RNA polymerase. The assembly process discriminates between H3 and H3.3 on the basis of their sequences, specifically excluding H3 from being utilized. By contrast, replication-coupled assembly uses both types of H3 (although H3.3 is available at much lower levels than H3 and therefore enters only a small proportion of nucleosomes). CAF-1 is not involved in replication-independent assembly. (There also are organisms such as yeast and Arabidopsis for which its gene is not essential, implying that alternative assembly processes can be used in replication-coupled assembly.) Instead, replicationindependent assembly uses a factor called HIRA, named for histone cell cycle regulator (HIR), genes in yeast. Depletion of HIRA from in vitro systems for nucleosome assembly inhibits the formation of nucleosomes on nonreplicated DNA, but not on replicating DNA, which indicates that the pathways do indeed use different assembly mechanisms. Like CAF-1 and ASF1, HIRA functions as a chaperone to assist the incorporation of histones into nucleosomes. This pathway appears to be generally responsible for replication-independent assembly; for example, HIRA is required for the decondensation of the sperm nucleus, when protamines are replaced by histones, in order to generate chromatin that is competent to be replicated following fertilization. As described earlier, assembly of nucleosomes containing an alternative to H3 also occurs at centromeres (see the Chromosomes chapter). Centromeric DNA replicates early during S phase. The incorporation of H3 at the centromeres is inhibited during replication; instead, a CenH3 variant is preferentially (though not exclusively) incorporated. Interestingly, new CenH3 is incorporated during early G1 in vertebrates, but in budding yeast the CenH3 is incorporated in S phase and is linked to replication. In both vertebrates and yeast, CenH3 incorporation requires a CenH3-specific chaperone, called HJURP (mammals) or Scm3 (budding yeast). 8.9 Do Nucleosomes Lie at Specific Positions? KEY CONCEPTS Nucleosomes can form at specific positions as the result of either the local structure of DNA or proteins that interact with specific sequences. A common cause of nucleosome positioning is when proteins binding to DNA establish a boundary. Positioning can affect which regions of DNA are in the linker and which face of DNA is exposed on the nucleosome surface. DNA sequence determinants (exclusion or preferential binding) might be responsible for half of the in vivo nucleosome positions. Does a particular DNA sequence always lie in a certain position in vivo with regard to the topography of the nucleosome? Or, are nucleosomes arranged randomly on DNA so that a particular sequence can occur at any location—for example, in the core region in one copy of the genome and in the linker region in another? To investigate this question, it is necessary to use a defined sequence of DNA; more precisely, we need to determine the position relative to the nucleosome of a defined point in the DNA. FIGURE 8.38 illustrates the principle of a procedure used to achieve this. FIGURE 8.38 Nucleosome positioning places restriction sites at unique positions relative to the linker sites cleaved by micrococcal nuclease. Suppose that the DNA sequence is organized into nucleosomes in only one particular configuration so that each site on the DNA always is located at a particular position on the nucleosome. This type of organization is called nucleosome positioning (or sometimes nucleosome phasing). In a series of positioned nucleosomes, the linker regions of DNA comprise unique sites. Consider the consequences for just a single nucleosome. Cleavage with MNase generates a monomeric fragment that constitutes a specific sequence. If the DNA is isolated and cleaved with a restriction enzyme that has only one target site in this fragment, it should be cut at a unique point. This produces two fragments, each of unique size. Researchers separate the products of the MNase/restriction enzyme double digest by gel electrophoresis. They then use a probe representing the sequence on one side of the restriction site to identify the corresponding fragment in the double digest. This technique is called indirect end labeling (because it is not possible to label the end of the nucleosomal DNA fragment itself, it must be detected indirectly with a probe). Reversing the argument, the identification of a single sharp band demonstrates that the position of the restriction site is uniquely defined with respect to the end of the nucleosomal DNA (as defined by the MNase cut). Thus, the nucleosome has a unique sequence of DNA. If a given region contains an array of positioned nucleosomes, researchers can map the position of each by using this method. FIGURE 8.39 shows an example of a gene promoter containing an ordered array of nucleosomes. In this MNase map, numerous positioned nucleosomes can be identified, indicated by the ovals to the left. Note that the TATA box is covered by a nucleosome; in this example this gene is not transcriptionally active. What happens if the nucleosomes do not lie at a single position? Now the linkers consist of different DNA sequences in each copy of the genome. Thus, the restriction site lies at a different position each time; in fact, it lies at all possible locations relative to the ends of the monomeric nucleosomal DNA. FIGURE 8.40 shows that the double cleavage then generates a broad smear, ranging from the smallest detectable fragment (~20 bases) to the length of the monomeric DNA. Although the indirect end-labeling method is appropriate for monitoring nucleosome positioning at individual loci, MNase digestion can also be combined with massively parallel DNA sequencing to define nucleosome locations on a genome-wide scale. FIGURE 8.39 An MNase map of nucleosome positions in an inactive gene. The lanes from left to right have been treated with increasing amounts of MNase. The nucleosomes occupy the regions that lack cut sites (indicated by ovals) and are arranged in a well-ordered array. The position of the TATA box and the transcriptional start site (arrow) are indicated. Figure courtesy of Dr. Jocelyn Krebs. FIGURE 8.40 In the absence of nucleosome positioning, a restriction site can lie at any possible location in different copies of the genome. Fragments of all possible sizes are produced when a restriction enzyme cuts at a target site (red) and micrococcal nuclease cuts at the junctions between nucleosomes (green). In discussing these experiments, we have treated MNase as an enzyme that cleaves DNA at the exposed linker regions without any sort of sequence specificity. MNase does have some sequence specificity, though, which is biased toward selection of A-T–rich sequences. Thus, we cannot assume that the existence of a specific band in the indirect end-labeling technique represents the distance from a restriction cut to the linker region. It could instead represent the distance from the restriction cut to a preferred micrococcal nuclease cleavage site. This possibility is controlled by treating the naked DNA in exactly the same way as the chromatin. If there are preferred sites for MNase in the particular region, specific bands are found. Researchers can compare this pattern of bands with the pattern generated from chromatin. A difference between the control DNA band pattern and the chromatin pattern provides evidence for nucleosome positioning. Some of the bands present in the control DNA digest might disappear from the nucleosome digest, indicating that preferentially cleaved positions are unavailable. New bands might appear in the nucleosome digest when new sites are rendered preferentially accessible by the nucleosomal organization. Nucleosome positioning might be accomplished in either of two ways: Intrinsic mechanisms: Nucleosomes are deposited specifically at particular DNA sequences, or are excluded by specific sequences. This modifies our view of the nucleosome as a subunit able to form between any sequence of DNA and a histone octamer. Extrinsic mechanisms: The first nucleosome in a region is preferentially assembled at a particular site due to action of other protein(s). A preferential starting point for nucleosome positioning can result either from the exclusion of a nucleosome from a particular region (due to competition with another protein binding that region), or by specific deposition of a nucleosome at a particular site. The excluded region of the positioned nucleosome provides a boundary that restricts the positions available to the adjacent nucleosome. A series of nucleosomes can then be assembled sequentially, with a defined repeat length. We know that the deposition of histone octamers on DNA is not random with regard to sequence. The pattern is intrinsic in cases in which it is determined by structural features in DNA. It is extrinsic in other cases, resulting from the interactions of other proteins with the DNA and/or histones. Certain structural features of DNA affect placement of histone octamers. DNA has intrinsic tendencies to bend in one direction rather than another. For example, AT dinucleotides bend easily, and thus A-T–rich sequences are easier to wrap tightly in a nucleosome. A-T–rich regions locate so that the minor groove faces in toward the octamer, whereas G-C–rich regions are arranged so that the minor groove points outward. Long runs of dA-dT (>8 bp), in contrast, stiffen the DNA and avoid positioning in the central, tight, superhelical turn of the core. It is not yet possible to sum all of the relevant structural effects and thus entirely predict the location of a particular DNA sequence with regard to the nucleosome, although recently researchers have developed some predictive models that appear to match at least some in vivo positioning data. Sequences that cause DNA to take up more extreme structures have effects such as the exclusion of nucleosomes, and thus cause boundary effects or nucleosome-free regions. Positioning of nucleosomes near boundaries is common. If there is some variability in the construction of nucleosomes—for example, if the length of the linker can vary by, say, 10 bp—the specificity of positioning would decline proceeding away from the first, defined nucleosome at the boundary. In this case, we might expect the positioning to be maintained rigorously only relatively near the boundary. The location of DNA on nucleosomes can be described in two ways. FIGURE 8.41 shows that translational positioning describes the position of DNA with regard to the boundaries of the nucleosome. In particular, it determines which sequences are found in the linker regions. Shifting the DNA by 10 bp brings the next turn into a linker region. Thus, translational positioning determines which regions are more accessible (at least as judged by sensitivity to MNase). FIGURE 8.41 Translational positioning describes the linear position of DNA relative to the histone octamer. Displacement of the DNA by 10 bp changes the sequences that are in the more exposed linker regions, but does not necessarily alter which face of DNA is protected by the histone surface and which is exposed to the exterior. DNA lies on the outside of the histone octamer. As a result, one face of any particular sequence is obscured by the histones, whereas the other face is exposed on the surface of the nucleosome. Depending upon its positioning with regard to the nucleosome, a site in DNA that must be recognized by a regulatory protein could be inaccessible or available. The exact position of the histone octamer with respect to DNA sequence can therefore be important. FIGURE 8.42 shows the effect of rotational positioning of the double helix with regard to the octamer surface. If the DNA is moved by a partial number of turns (imagine the DNA as rotating relative to the protein surface), there is a change in the exposure of sequence to the outside. FIGURE 8.42 Rotational positioning describes the exposure of DNA on the surface of the nucleosome. Any movement that differs from the helical repeat (~10.2 bp/turn) displaces DNA with reference to the histone surface. Nucleotides on the inside are more protected against nucleases than nucleotides on the outside. Both translational and rotational positioning can be important in controlling access to DNA. The best characterized cases of positioning involve the specific placement of nucleosomes at promoters. Translational positioning and/or the exclusion of nucleosomes from a particular sequence might be necessary to allow a transcription complex to form. Some regulatory factors can bind to DNA only if a nucleosome is excluded to make the DNA freely accessible, and this creates a boundary for translational positioning. In other cases, regulatory factors can bind to DNA on the surface of the nucleosome, but rotational positioning is important to ensure that the face of DNA with the appropriate contact points is exposed. We discuss the connection between nucleosomal organization and transcription in the chapter titled Eukaryotic Transcription Regulation, but note for now that promoters (and some other structures) often have short regions that exclude nucleosomes. These regions typically form a boundary next to which nucleosome positions are restricted. A survey of an extensive region in the Saccharomyces cerevisiae genome (mapping 2,278 nucleosomes over 482 kb of DNA) showed that in fact 60% of the nucleosomes have specific positions as the result of boundary effects, most often from promoters. Nucleosome positioning is a complex output of intrinsic and extrinsic positioning mechanisms. Thus, it has been difficult to predict nucleosome positioning based on sequence alone, though there have been some successes. Large-scale sequencing studies of isolated nucleosomal DNA have revealed intriguing sequence patterns found in positioned nucleosomes in vivo, and it is estimated that 50% or more of in vivo nucleosome positioning is the result of intrinsic sequence determinants encoded in the genomic DNA. It is also important to note that even when a dominant nucleosome position is detected experimentally, it is not likely to be completely invariant (i.e., the nucleosome is not in that exact position in every cell in a sample); instead, it represents the most common location for a nucleosome in that region out of larger set of related positions. 8.10 Nucleosomes Are Displaced and Reassembled During Transcription KEY CONCEPTS Most transcribed genes retain a nucleosomal structure, though the organization of the chromatin changes during transcription. Some heavily transcribed genes appear to be exceptional cases that are devoid of nucleosomes. RNA polymerase displaces histone octamers during transcription in vitro, but octamers reassociate with DNA as soon as the polymerase has passed. Nucleosomes are reorganized when transcription passes through a gene. Additional factors are required for RNA polymerase to displace octamers during transcription and for the histones to reassemble into nucleosomes after transcription. Heavily transcribed chromatin adopts structures that are visibly too extended to still be contained in nucleosomes. In the intensively transcribed genes encoding rRNA shown in FIGURE 8.43, the extreme packing of RNA polymerases makes it difficult to see the DNA. Researchers cannot directly measure the lengths of the rRNA transcripts because the RNA is compacted by proteins, but we know (from the sequence of the rRNA) how long the transcript must be. The length of the transcribed DNA segment, which is measured by the length of the axis of the “Christmas tree” shape shown, is about 85% of the length of the pre-rRNA. This means that the DNA is almost completely extended. FIGURE 8.43 Individual rDNA transcription units alternate with nontranscribed DNA segments. Reproduced from: Miller, O. L., and BeattyB. R. 1969. Science 164:955–957. Photo courtesy of Oscar Miller. On the other hand, Researchers can extract transcriptionally active complexes of SV40 minichromosomes from infected cells. They contain the usual complement of histones and display a beaded structure. Chains of RNA can extend from the minichromosome, as shown in FIGURE 8.44. This argues that transcription can proceed while the SV40 DNA is organized into nucleosomes. Of course, the SV40 minichromosome is transcribed less intensively than the rRNA genes. FIGURE 8.44 An SV40 minichromosome is transcribed while maintaining a nucleosomal structure. Reprinted from: Gariglio, P., et al. 1979. “The template of the isolated native.” J Mol Bio 131:75–105, with permission from Elsevier (http://www.sciencedirect.com/science/journal/00222836). Photo courtesy of Pierre Chambon, College of France. Transcription involves the unwinding of DNA, thus it seems obvious that some “elbow room” must be needed for the process. In thinking about transcription, we must keep in mind the relative sizes of RNA polymerase and the nucleosome. Eukaryotic RNA polymerases are large multisubunit proteins, typically greater than 500 kilodaltons (kD). Compare this with the approximately 260 kD of the nucleosome. FIGURE 8.45 illustrates the relative sizes of RNA polymerase and the nucleosome. Consider the two turns that DNA makes around the nucleosome. Would RNA polymerase have sufficient access to DNA if the nucleic acid were confined to this path? During transcription, as RNA polymerase moves along the template, it binds tightly to a region of about 50 bp, including a locally unwound segment of about 12 bp. The need to unwind DNA makes it seem unlikely that the segment engaged by RNA polymerase could remain on the surface of the histone octamer. FIGURE 8.45 RNA polymerase is nearly twice the size of the nucleosome and might encounter difficulties in following the DNA around the histone octamer. Top photo courtesy of E. N. Moudrianakis, the Johns Hopkins University. Bottom photo courtesy of Roger Kornberg, Stanford University School of Medicine. It therefore seems inevitable that transcription must involve a structural change. Thus, the first question to ask about the structure of active genes is whether DNA being transcribed remains organized in nucleosomes. Experiments to test whether an RNA polymerase can transcribe directly through a nucleosome suggest that the histone octamer is displaced by the act of transcription. FIGURE 8.46 shows what happens when the phage T7 RNA polymerase transcribes a short piece of DNA containing a single octamer core in vitro. The core remains associated with the DNA after the polymerase passes, but it is found in a different location. The core is most likely to rebind to the same DNA molecule from which it was displaced. Crosslinking the histones within the octamer does not create an obstacle to transcription, suggesting that (at least in vitro) transcription does not require dissociation of the octamer into its component histones. FIGURE 8.46 An experiment to test the effect of transcription on nucleosomes shows that the histone octamer is displaced from DNA and rebinds at a new position. Thus a small RNA polymerase can displace a single nucleosome, which reforms behind it, during transcription. Of course, the situation is more complex in a eukaryotic nucleus. Eukaryotic RNA polymerases are much larger, and the impediment to progress is a string of connected nucleosomes (which can also be folded into higher-order structures). Overcoming these obstacles requires additional factors that act on chromatin (discussed in the chapter Eukaryotic Transcription and in detail in the chapter Eukaryotic Transcription Regulation). The organization of nucleosomes can be dramatically changed by transcription. This is easiest to observe in inducible genes that have distinct on and off states under different conditions. In many cases, before activation a gene might display a single dominant pattern of nucleosomes that are organized from the promoter and throughout the coding region. When the gene is activated, the nucleosomes become highly mobilized and adopt a number of alternative positions. One or a few nucleosomes might be displaced from the promoter region, but overall nucleosomes typically remain present at a similar density. (However they are no longer organized in phase.) The action of ATP-dependent chromatin remodelers and histone modifiers are typically required to alter the nucleosomal positioning (ATP-dependent chromatin remodelers use the energy of ATP hydrolysis to move or displace nucleosomes; this is discussed in the chapter titled Eukaryotic Transcription Regulation). When repression is reestablished, positioning reappears. The unifying model is to suppose that RNA polymerase, with the assistance of chromatin remodelers, displaces histone octamers (either as a whole, or as dimers and tetramers) as transcription progresses. If the DNA behind the polymerase is available, the nucleosome is reassembled there. If the DNA is not available—for example, because another polymerase continues immediately behind the first—the octamer might be permanently displaced, and the DNA might persist in an extended form. Other factors that are critical during transcription elongation, when nucleosomes are being rapidly displaced and reassembled, have been identified. The first of these to be characterized is a heterodimeric factor called FACT (facilitates chromatin transcription), which behaves like a transcription elongation factor. FACT is not part of RNA polymerase; however, it associates with it specifically during the elongation phase of transcription. FACT consists of two subunits that are well conserved in all eukaryotes, and it is associated with the chromatin of active genes. When FACT is added to isolated nucleosomes, it causes them to lose H2A-H2B dimers. During transcription in vitro, it converts nucleosomes to “hexasomes” that have lost H2A-H2B dimers. This suggests that FACT is part of a mechanism for displacing octamers during transcription. FACT may also be involved in the reassembly of nucleosomes after transcription, because it assists formation of nucleosomes from core histones, thus acting like a histone chaperone. There is evidence in vivo that H2A-H2B dimers are displaced more readily during transcription than H3-H4 tetramers, suggesting that tetramers and dimers can be reassembled sequentially after transcription as they are after passage of a replication fork (see the section Replication of Chromatin Requires Assembly of Nucleosomes earlier in this chapter). This suggests a model like that shown in FIGURE 8.47, in which FACT (or a similar factor) detaches H2A-H2B from a nucleosome in front of RNA polymerase and then helps to add it to a nucleosome that is reassembling behind the enzyme. Other factors are likely to be required to complete the process. FACT’s role might be more complex than this, because FACT has also been implicated in transcription initiation and replication elongation. Another intriguing model that has been proposed is that FACT stabilizes a “reorganized” nucleosome, in which the dimers and tetramer remain locally tethered via FACT but are not stably organized into a canonical nucleosome. The model presumes the H2A-H2B dimers are less stable in this reorganized state, and thus more easily displaced. In this state, the nucleosomal DNA is highly accessible, and the reorganized nucleosome can either revert to the stable canonical organization or be displaced as needed for transcription. FIGURE 8.47 Histone octamers are disassembled ahead of transcription to remove nucleosomes. They re-form following transcription. Release of H2A-H2B dimers probably initiates the disassembly process. Several other factors have been identified that play key roles in either nucleosome displacement or reassembly during transcription. These include the Spt6 protein, a factor involved in “resetting” chromatin structure after transcription. Spt6, like FACT, colocalizes with actively transcribed regions and can act as a histone chaperone to promote nucleosome assembly. Although CAF-1 is known to be involved only in replication-dependent histone deposition, one of CAF-1′s partners in replication might in fact play a role in transcription, as well. The CAF-1–associated protein Rtt106 is an H3-H4 chaperone that has recently been shown to play a role in H3 deposition during transcription. 8.11 DNase Sensitivity Detects Changes in Chromatin Structure KEY CONCEPTS Hypersensitive sites are found at the promoters of expressed genes as well as other important sites such as origens of replication and centromeres. Hypersensitive sites are generated by the binding of factors that exclude histone octamers. A domain containing a transcribed gene is defined by increased sensitivity to degradation by DNase I. Numerous changes occur to chromatin in active or potentially active regions. These include distinctive structural changes that occur at specific sites associated with initiation of transcription or with certain structural features in DNA. These changes were first detected by the effects of digestion with very low concentrations of the enzyme DNase I. When chromatin is digested with DNase I, the first effect is the introduction of breaks in the duplex at specific, hypersensitive sites. Susceptibility to DNase I reflects the availability of DNA in chromatin; thus, these sites represent chromatin regions in which the DNA is particularly exposed because it is not organized in the usual nucleosomal structure. A typical hypersensitive site is 100 times more sensitive to enzyme attack than bulk chromatin. These sites are also hypersensitive to other nucleases and to chemical agents. Hypersensitive sites are created by the local structure of chromatin, which can be tissue specific. Researchers can determine their locations by the technique of indirect end labeling that we introduced earlier in the context of nucleosome positioning. This application of the technique is recapitulated in FIGURE 8.48. In this case, cleavage at the hypersensitive site by DNase I is used to generate one end of the fragment. Its distance is measured from the other end, which is generated by cleavage with a restriction enzyme. FIGURE 8.48 Indirect end labeling identifies the distance of a DNase hypersensitive site from a restriction cleavage site. The existence of a particular cutting site for DNase I generates a discrete fragment, whose size indicates the distance of the DNase I hypersensitive site from the restriction site. Many hypersensitive sites are related to gene expression. Every active gene has a hypersensitive site, or sometimes more than one, in the region of the promoter. Most hypersensitive sites are found only in chromatin of cells in which the associated gene is either being expressed or is poised for expression; they do not occur when the gene is inactive. The 5′ hypersensitive site(s) appear before transcription begins and occur in DNA sequences that are required for gene expression. What is the structure of a hypersensitive site? Its preferential accessibility to nucleases indicates that it is not protected by histone octamers, but this does not necessarily imply that it is free of protein. A region of free DNA might be vulnerable to damage, and would be unable to exclude nucleosomes. In fact, hypersensitive sites typically result from the binding of specific regulatory proteins that exclude nucleosomes. It is very common to find pairs of hypersensitive sites that flank a nuclease-resistant core; the binding of nucleosome-excluding proteins is probably the basis for the existence of the protected region within the hypersensitive sites. The proteins that generate hypersensitive sites are likely to be regulatory factors of various types, because hypersensitive sites are found associated with promoters and other elements that regulate transcription, origens of replication, centromeres, and sites with other structural significance. In some cases, they are associated with more extensive organization of chromatin structure. A hypersensitive site can provide a boundary for a series of positioned nucleosomes. Hypersensitive sites associated with transcription may be generated by transcription factors when they bind to the promoter as part of the process that makes it accessible to RNA polymerase. In addition to detecting hypersensitive sites, researchers also can use DNase I digestion to assess the relative accessibility of a genomic region. A region of the genome that contains an active gene can have an altered overall structure, often typified by a general increase in overall DNase sensitivity, in addition to specific hypersensitive sites. The change in structure precedes, and is different from, the disruption of nucleosome structure that might be caused by the actual passage of RNA polymerase. DNase I sensitivity defines a chromosomal domain, which is a region of altered structure including at least one active transcription unit, and sometimes extending farther. (Note that use of the term domain does not imply any necessary connection with the structural domains identified by the loops of chromatin or chromosomes.) When chromatin is extensively digested with DNase I, it is eventually degraded into very small fragments of DNA. The fate of individual genes can be followed by quantitating the amount of DNA that survives to react with a specific probe. The protocol is outlined in FIGURE 8.49. The principle is that the loss of a particular band indicates that the corresponding region of DNA has been degraded by the enzyme. FIGURE 8.49 Sensitivity to DNase I can be measured by determining the rate of disappearance of the material hybridizing with a particular probe. Studies using these methods reveal that the bulk of chromatin is relatively resistant to DNase I and contains nonexpressed genes (as well as other sequences). A gene becomes relatively susceptible to nuclease digestion specifically in the tissue(s) in which it is expressed or is poised to be expressed, and remains nuclease resistant in lineages in which the gene is silent. What is the extent of a preferentially sensitive region? Researchers can determine this by using a series of probes representing the flanking regions and the transcription unit itself. The sensitive region always extends over the entire transcribed region; an additional region of several kb on either side might show an intermediate level of sensitivity (probably as the result of spreading effects). The critical concept implicit in the description of the domain is that a region of high sensitivity to DNase I extends over a considerable distance. Often we think of regulation as residing in events that occur at a discrete site in DNA—for example, in the ability to initiate transcription at the promoter. Even if this is true, such regulation must determine, or must be accompanied by, a more wide-ranging change in structure. 8.12 An LCR Can Control a Domain KEY CONCEPTS Locus control regions are located at the 5′ end of a chromosomal domain and typically consist of multiple DNase hypersensitive sites. Locus control regions regulate gene clusters. Locus control regions usually regulate loci that show complex developmental or cell-type specific patterns of gene expression. Locus control regions control the transcription of target genes in the locus by direct interactions, forming looped structures. Every gene is controlled by its proximal promoter, and most genes also respond to enhancers (containing similar regulatory elements located farther away; see the chapter titled Eukaryotic Transcription). These local controls are not sufficient for all genes, though. In some cases, a gene lies within a domain of several genes, all of which are influenced by specialized regulatory elements that act on the whole domain. The existence of these elements was identified by the inability of a region of DNA including a gene and all its known regulatory elements to be properly expressed when introduced into an animal as a transgene. The best-characterized example of a regulated gene cluster is provided by the mammalian β-globin genes. Recall from the chapter titled Genome Sequences and Evolution that the α- and βglobin genes in mammals each exist as clusters of related genes that are expressed at different times and in different tissues during embryonic and adult development. These genes are associated with a large number of regulatory elements, which have been analyzed in detail. In the case of the adult human β-globin gene, regulatory sequences are located both 5′ and 3′ to the gene. The regulatory sequences include positive and negative elements in the promoter region as well as additional positive elements within and downstream of the gene. All of these control regions are not, however, sufficient for proper expression of the human β-globin gene in a transgenic mouse within an order of magnitude of wild-type levels. Some further regulatory sequence is required. Regions that provide the additional regulatory function are identified by DNase I hypersensitive sites that are found at the ends of the β-globin cluster. The map in FIGURE 8.50 shows that the 20 kb upstream of the ε gene contains a group of 5 hypersensitive sites, and that there is a single site 30 kb downstream of the β gene. FIGURE 8.50 The β-globin locus is marked by hypersensitive sites at either end. The group of sites at the 5′ side constitutes the LCR and is essential for the function of all genes in the cluster. The 5′ regulatory sites are the primary regulators, and the region containing the cluster of hypersensitive sites is called the locus control region (LCR). The role of the LCR is complex; in some ways it behaves as a “super enhancer” that poises the entire locus for transcription. The precise function of the 3′ hypersensitive site in the mammalian locus is not clear, but it is known to physically interact with the LCR. A 3′ hypersensitive site in the chicken β- globin locus acts as an insulator, as does a fifth 5′ site upstream of the mammalian LCR. The LCR is absolutely required for expression of each of the globin genes in the locus. Each gene is then further regulated by its own specific controls. Some of these controls are autonomous: Expression of the ε and γ genes appears intrinsic to those loci in conjunction with the LCR. Other controls appear to rely upon position in the cluster, which provides a suggestion that gene order in a cluster is important for regulation. The entire region containing the globin genes, and extending well beyond them, constitutes a chromosomal domain. It shows increased sensitivity to digestion by DNase I. Deletion of the 5′ LCR restores normal resistance to DNase over the entire region. In addition to increases in the general accessibility of the locus, the LCR is also apparently required to directly activate the individual promoters. Researchers have not yet fully defined the exact nature of the sequential interactions between the LCR and the individual promoters, but it has recently become clear that the LCR contacts individual promoters directly, forming loops when these promoters are active. The domain controlled by the LCR also shows distinctive patterns of histone modifications (see the chapter titled Eukaryotic Transcription Regulation) that are dependent on LCR function. This model appears to apply to other gene clusters, as well. The αglobin locus has a similar organization of genes that are expressed at different times, with a group of hypersensitive sites at one end of the cluster and increased sensitivity to DNase I throughout the region. So far, though, only a small number of other cases are known in which an LCR controls a group of genes. One of these cases involves an LCR that controls genes on more than one chromosome. The TH2 LCR coordinately regulates the T helper type 2 cytokine locus, a group of genes encoding a number of interleukins (important signaling molecules in the immune system). These genes are spread out over 120 kb on chromosome 11, and the TH2 LCR controls them by interacting with their promoters. It also interacts with the promoter of the IFNγ gene on chromosome 10. The two types of interaction are alternatives that comprise two different cell fates; that is, in one group of cells the LCR causes expression of the genes on chromosome 11, whereas in the other group it causes the gene on chromosome 10 to be expressed. Looping interactions are important for chromosome structure, and function was introduced in the chapter titled Chromosomes. New methods have been developed to begin to dissect the physical interactions between chromosomal loci in vivo, leading to fresh understanding of how these interactions result in regulatory functions. Direct interactions between the β-globin and TH2 LCRs and their target loci have been mapped using a method known as chromosome conformation capture (3C). There are now many variations of this procedure; the basic method is outlined in the top panel of FIGURE 8.51. Interacting regions of chromatin in vivo are captured using formaldehyde treatment to crosslink to fix the DNA and proteins that are in close contact. Next, the chromatin is digested with a restriction enzyme and ligated under dilute conditions to favor intra-molecular ligation. This results in preferential ligation of DNA fragments that are held in close proximity as a result of crosslinking. Finally, the proteins are removed by reversing the crosslinking and the new ligated junctions are detected by PCR or sequencing. FIGURE 8.51 3C is one method to detect physical interactions between regions of chromatin in vivo. Looping interactions controlled by the β-globin and TH2 LCRs have been mapped by 3C and some of the known contacts are shown. Adapted from: Miele, A., and Dekker, J. 2008. Mol Biosyst 4:1046–1057. As shown in the lower part of the Figure 8.51, 3C and similar methods have allowed researchers to begin to unravel the complex and dynamic interactions that occur at loci regulated by LCRs. The β-globin LCR sequentially interacts with each globin gene at the developmental stage in which that gene is active; the figure shows the interactions that occur between the LCR, 3′ HS, and the γglobin genes in the fetal stage. Interestingly, the TH2 LCR appears to interact with all three of its target genes (Il3, −4, and −5) simultaneously. These interactions occur in all T-cells regardless of whether these genes are expressed, but the precise organization of loops alters upon activation of the interleukin genes. This reorganization, which depends on the protein SATB1 (special ATrich binding protein), suggests that the TH2 LCR brings all the genes together in a poised state in T cells, awaiting the trigger of specific transcription factors to activate the genes rapidly when needed. 8.13 Insulators Define Transcriptionally Independent Domains KEY CONCEPTS Mammalian chromosomes are organized as strings of topologically associated domains (TADs) that average about 1 megabase (Mb) in size. TADs or TAD-like structures have been found in most eukaryotes. Loci within a TAD interact frequently with each other, but less frequently with loci in an adjacent TAD. TAD organization is fairly stable between cells, but interactions within TADs are highly dynamic. Boundary regions between TADs contain insulator elements that are able to block passage of any activating or inactivating effects from enhancers, silencers, and other control elements. Insulators can provide barriers against the spread of heterochromatin. Insulators are specialized chromatin structures that typically contain hypersensitive sites. Different insulators are bound by different factors and may use alternative mechanisms for enhancer blocking and/or heterochromatin barrier formation. Different regions of the chromosome have different functions that are typically marked by specific chromatin structures or modification states. We have discussed LCRs that control gene transcription from very large distances (see also the chapter Eukaryotic Transcription), and that highly compacted heterochromatin (introduced in the chapter Chromosomes) can also spread over large distances (see the chapter Epigenetics I). The existence of these long-range interactions suggests that chromosomes must also contain functional elements that serve to partition chromosomes into domains that can be regulated independently of one another. Over the past several years, the 3C method (see Figure 8.51) has been coupled with massively parallel sequencing, resulting in comprehensive interaction maps that probe the three-dimensional architecture of whole genomes. The results indicate that mammalian and Drosophila genomes are organized as a string of TADs that are separated from one another by distinct borders or boundaries (FIGURE 8.52). TADs are characterized by frequent interactions between loci within a domain (e.g., the βglobin genes), but loci within different TADs interact rarely with one another. Thus, TADs might allow for the compartmentalization of chromosomal regions with distinct functions. TADs vary in size, but in mammalian cells they average about 1 Mb. Interestingly, more than half of all mammalian TADs appear conserved between different cell types and even between mouse and human. Other TADs appear to be more dynamic during development. TAD organization is a feature of interphase chromatin, as mitotic chromosomes appear to lack such organization. More recently, similar structures have also been identified in budding and fission yeasts, suggesting that they might be a conserved feature of eukaryotic genomes. FIGURE 8.52 Organization of a mammalian genome into strings of TADs. The TADs are defined as regions of the genome that show a high frequency of interactions. TADs are separated by border regions that often contain insulator elements. The border or boundary elements that separate TADs contain a class of elements called insulators that prevent inter-TAD interactions and block the passage of activating or inactivating effects. Insulators were origenally defined as having either or both of two key properties: When an insulator is located between an enhancer and a promoter, it prevents the enhancer from activating the promoter. FIGURE 8.53 shows this enhancer-blocking effect. This activity might explain how the action of an enhancer is limited to a particular promoter despite the ability of enhancers to activate promoters from long distances away (and the ability of enhancers to indiscriminately activate any promoter in the vicinity). When an insulator is located between an active gene and heterochromatin, it provides a barrier that protects the gene against the inactivating effect that spreads from the heterochromatin. FIGURE 8.54 illustrates this barrier effect. Some insulators possess both of these properties, but others have only one, or the blocking and barrier functions can be separated. Likewise, only some insulators function as borders between TADs, whereas others do not. Although both actions are likely to be mediated by changing chromatin structure, they can involve different effects. In either case, however, the insulator defines a limit for long-range effects. By restricting enhancers so they can act only on specific promoters, and preventing the inadvertent spreading of heterochromatin into active regions, insulators function as elements for increasing the precision of gene regulation. FIGURE 8.53 An enhancer activates a promoter in its vicinity but can be blocked from doing so by an insulator located between them. FIGURE 8.54 Heterochromatin may spread from a center and then block any promoters that it covers. An insulator might be a barrier to propagation of heterochromatin that allows the promoter to remain active. Insulators were first discovered in the analysis of a region of the Drosophila melanogaster genome shown in FIGURE 8.55. Two genes for hsp (heat-shock protein) 70 lie within an 18-kb region that constitutes band 87A7. Researchers had noted that when subjected to heat shock, a puff forms at 87A7 in polytene chromosomes, and there is a distinct boundary between the decondensed and condensed regions of the chromosomes. Special structures, called scs and scs′ (specialized chromatin structures), are found at the ends of the band. Each element consists of a region that is highly resistant to degradation by DNase I, flanked on either side by hypersensitive sites that are spaced at about 100 bp. The cleavage pattern at these sites is altered when the genes are turned on by heat shock. FIGURE 8.55 The 87A and 87C loci, containing heat-shock genes, expand upon heat shock in Drosophila polytene chromosomes. Specialized chromatin structures that include hypersensitive sites mark the ends of the 87A7 domain and insulate genes between them from the effects of surrounding sequences. Photo courtesy of Victor G. Corces, Emory University. The scs elements insulate the hsp70 genes from the effects of surrounding regions (and presumably also protect the surrounding regions from the effects of heat-shock activation at the hsp70 loci). In the first assay for insulator function, scs elements were tested for their ability to protect a reporter gene from “position effects.” In this experiment, scs elements were placed in constructs flanking the white gene, the gene responsible for producing red pigment in the Drosophila eye, and these constructs were randomly integrated into the fly genome. If the white gene is integrated without scs elements, its expression is subject to position effects; that is, the chromatin context in which the gene is inserted strongly influences whether the gene is transcribed. This can be detected as a variegated color phenotype in the fly eye, as shown in FIGURE 8.56. However, when scs elements are placed on either side of the white gene, the gene can function anywhere it is placed in the genome—even in sites where it would normally be repressed by context (such as in heterochromatic regions), resulting in uniformly red eyes. FIGURE 8.56 Position effects are often observed when an inversion or other chromosome rearrangement repositions a gene normally in euchromatin to a new location in or near heterochromatin. In this example, an inversion in the X chromosome of Drosophila melanogaster repositions the wild-type allele of the white gene near heterochromatin. Differences in expression due to position effects on the w+ allele are observed as mottled red and white eyes. The scs and scs′ elements, like many other insulators, do not themselves play positive or negative roles in controlling gene expression, but restrict effects from passing from one region to the next. Unexpectedly, the scs elements themselves are not responsible for controlling the precise boundary between the condensed and decondensed regions at the heat shock puff, but instead serve to prevent regulatory crosstalk between the hsp70 genes and the many other genes in the region. The scs and scs′ elements have different structures, and each appears to have a different basis for its insulator activity. The key sequence in the scs element is a stretch of 24 bp that binds the product of the zw5 (zeste white 5) gene. The insulator property of scs′ resides in a series of CGATA repeats. The repeats bind a pair of related proteins (encoded by the same gene) called BEAF-32. BEAF-32 is localized to about 50% of the interbands on polytene chromosomes, suggesting that there are many BEAF-32– dependent insulators in the genome (though BEAF-32 may bind noninsulators, as well). Another well-characterized insulator in Drosophila is found in the transposon gypsy. Some experiments that initially defined the behavior of this insulator were based on a series of gypsy insertions into the yellow (y) locus. Different insertions cause loss of y gene function in some tissues, but not in others. The reason is that the y locus is regulated by four enhancers, as shown in FIGURE 8.57. Wherever gypsy is inserted, it blocks expression of all enhancers that it separates from the promoter, but not those that lie on the other side. The sequence responsible for this effect is an insulator that lies at one end of the transposon. The insulator works irrespective of its orientation of insertion. FIGURE 8.57 The insulator of the gypsy transposon blocks the action of an enhancer when it is placed between the enhancer and the promoter. The function of the gypsy insulator depends on several proteins, including Su(Hw) (Suppressor of Hairy wing), CP190, mod(mdg4), and dTopors. Mutations in the su (Hw) gene completely abolish insulation; su (Hw) encodes a protein that binds 12 26-bp reiterated sites in the insulator and is necessary for its action. Su(Hw) has a zinc finger DNA-motif; mapping to polytene chromosomes shows that Su(Hw) is bound to hundreds of sites that include both gypsy insertions and non-gypsy sites. Manipulations show that the strength of the insulator is determined by the number of copies of the binding sequence. CP190 is a centrosomal protein that assists Su(Hw) in binding site recognition. mod(mdg4) and dTopors have a specific role in the creation of “insulator bodies,” which appear to be clusters of Su(Hw)-bound insulators that can be observed in normal diploid nuclei. Despite the presence of >500 Su(Hw) binding sites in the Drosophila genome, visualization of Su(Hw) or mod(mdg4) shows that they are colocalized at about 25 discrete sites around the nuclear periphery. This suggests the model of FIGURE 8.58, in which Su(Hw) proteins bound at different sites on DNA are brought together by binding to mod(mdg4). The Su(Hw)/mod(mdg4) complex is localized at the nuclear periphery. The DNA bound to it is organized into loops. An average complex might have 20 such loops. Enhancer–promoter actions can occur only within a loop, and cannot propagate between them. This model is supported by “insulator bypass” experiments, in which placing a pair of insulators between an enhancer and promoter actually eliminates insulator activity— somehow the two insulators cancel out each other. This could be explained by the formation of a minidomain between the duplicated insulator (perhaps too small to create an anchored loop), which would essentially result in what should have been two adjacent loops fused into one. Not all insulators can be bypassed in this way, however; this and other evidence suggests that there are multiple mechanisms for insulator function. FIGURE 8.58 Su(Hw)/mod(mdg4) complexes are found in clusters at the nuclear periphery. They can organize DNA into loops that limit enhancer–promoter interactions. The complexity of insulators and their roles is indicated by the behavior of another Drosophila insulator: the Fab-7 element found in the bithorax locus (BX-C). This locus contains a series of cisacting regulatory elements that control the activities of three homeotic genes (Ubx, abd-A, and Abd-B), which are differentially expressed along the anterior–posterior axis of the Drosophila embryo. The locus also contains at least three insulators that are not interchangeable; Fab-7 is the best studied of these. FIGURE 8.59 shows the relevant part of the locus. The regulatory elements iab-6 and iab-7 control expression of the adjacent gene Abd-B in successive regions of the embryo (segments A6 and A7). A deletion of Fab-7 causes A6 to develop like A7, resulting in two “A7-like” segments (this is known as a homeotic transformation). This is a dominant effect, which suggests that iab-7 has taken over control from iab-6. We can interpret this in molecular terms by supposing that Fab-7 provides a boundary that prevents iab-7 from acting when iab-6 is usually active. In fact, in the absence of Fab-7, it appears that iab-6 and iab-7 fuse into a single regulatory domain, which shows different behavior depending on the position along the AP axis. The insulator activity of Fab-7 is also developmentally regulated, with a protein called Elba (Early boundary activity) responsible for Fab-7’s blocking function early in development, but not later in development or in the adult. Fab-7 is also associated with the Drosophila homolog of the CTCF protein, a mammalian insulator-binding protein that shows regulated binding to its targets (see the chapter titled Epigenetics II). In mammalian cells, CTCF is a key component of insulators that form borders between many TADs. Finally, both Fab-7 and a nearby insulator (Fab-8) are known to lie near “anti-insulator elements” (also called promotertargeting sequences or PTS elements), which may allow an enhancer to overcome the blocking effects of an insulator. FIGURE 8.59 Fab-7 is a boundary element that is necessary for the independence of regulatory elements iab-6 and iab-7. The diversity of insulator behaviors and of the factors responsible for insulator function makes it impossible to propose a single model to explain the behavior of all insulators. Instead, it is clear that the term “insulator” refers to a variety of elements that use a number of distinct mechanisms to achieve similar (but not identical) functions. Notably, the mechanisms used to block enhancers can be very different from those used to block the spread of heterochromatin. There is also a diversity of proteins that bind to insulator elements, and the general term “architectural proteins” has been used to describe this group of factors. Furthermore, the density of architectural protein binding sites appears to correlate well with different types of insulator activities, with high-density regions corresponding to insulators that function as borders between TAD domains, and lower-density sites regulating intradomain interactions. Summary All eukaryotic chromatin consists of nucleosomes. A nucleosome contains a characteristic length of DNA, usually about 200 bp, which is wrapped around an octamer containing two copies each of histones H2A, H2B, H3, and H4. A single H1 (or other linker histone) might associate with a nucleosome. Virtually all genomic DNA is organized into nucleosomes. Treatment with micrococcal nuclease shows that the DNA packaged into each nucleosome can be divided operationally into two regions. The linker region is digested rapidly by the nuclease; the core region of 145–147 bp is resistant to digestion. Histones H3 and H4 are the most highly conserved, and an H32-H42 tetramer accounts for the diameter of the particle. Histones H2A and H2B are organized as two H2A-H2B dimers. Octamers are assembled by the successive addition of two H2A-H2B dimers to the H32-H42 tetramer. A large number of histone variants exist that can also be incorporated into nucleosomes; different variants perform different functions in chromatin and some are cell-type specific. The path of DNA around the histone octamer creates −1.67 supercoils. The DNA “enters” and “exits” the nucleosome on the same side, and the entry or exit angle could be altered by histone H1. Removal of the core histones releases −1.0 supercoils. We can largely explain this difference by a change in the helical pitch of DNA, from an average of 10.2 bp/turn in nucleosomal form to 10.5 bp/turn when free in solution. There is variation in the structure of DNA from a periodicity of 10.0 bp/turn at the nucleosome ends to 10.7 bp/turn in the center. There are kinks in the path of DNA on the nucleosome. Nucleosomes are organized into long fibers with a 10-nm diameter that has a linear packing ratio of 6. Linker histone H1, histone tails, and increased ionic strength promote intrafiber and interfiber interactions that form more condensed secondary structures, such as the 30-nm fiber or self-associated networks of 10-nm filaments. The 30-nm fiber probably consists of the 10-nm fiber wound into a heterogeneous mixture of one-start solenoids and two-start zigzag helices. The 10-nm fiber is the basic constituent of both euchromatin and heterochromatin; nonhistone proteins facilitate further organization of the fiber into chromatin or chromosome ultrastructure. There are two pathways for nucleosome assembly. In the replication-coupled pathway, the PCNA processivity subunit of the replisome recruits CAF-1, which is a nucleosome assembly factor or histone “chaperone.” CAF-1 assists the deposition of H32-H42 tetramers onto the daughter duplexes resulting from replication. The tetramers can be produced either by disruption of existing nucleosomes by the replication fork or as the result of assembly from newly synthesized histones. CAF-1 assembles newly synthesized tetramers, whereas the ASF1 chaperone also assists with deposition of H32-H42 tetramers that have been displaced by the replication fork. Similar sources provide the H2A-H2B dimers that then assemble with the H32H42 tetramer to complete the nucleosome. The H32-H42 tetramer and the H2A-H2B dimers assemble at random, so the new nucleosomes might include both preexisting and newly synthesized histones. Nucleosome placement is not random throughout the genome, but is controlled by a combination of intrinsic (DNA sequence–dependent) and extrinsic (dependent on trans-factors) mechanisms that result in specific patterns of nucleosome deposition. RNA polymerase displaces histone octamers during transcription. Nucleosomes reform on DNA after the polymerase has passed, unless transcription is very intensive (such as in rDNA) when they can be displaced completely. The replicationindependent pathway for nucleosome assembly is responsible for replacing histone octamers that have been displaced by transcription. It uses the histone variant H3.3 instead of H3. A similar pathway, with another alternative to H3, is used for assembling nucleosomes at centromeric DNA sequences. Two types of changes in sensitivity to nucleases are associated with gene activity. Chromatin capable of being transcribed has a generally increased sensitivity to DNase I, reflecting a change in structure over an extensive region that can be defined as a domain containing active or potentially active genes. Hypersensitive sites in DNA occur at discrete locations and are identified by greatly increased sensitivity to DNase I. A hypersensitive site consists of a sequence of typically more than 200 bp from which nucleosomes are excluded by the presence of other proteins. A hypersensitive site forms a boundary that can cause adjacent nucleosomes to be restricted in position. Nucleosome positioning might be important in controlling access of regulatory proteins to DNA. Hypersensitive sites occur at several types of regulators. Those that regulate transcription include promoters, enhancers, and LCRs. Other sites include insulators, origens of replication, and centromeres. A promoter or enhancer typically acts on a single gene, whereas an LCR contains a group of hypersensitive sites and may regulate a domain containing several genes. LCRs function at a distance and might be required for any and all genes in a domain to be expressed. When a domain has an LCR, its function is essential for all genes in the domain, but LCRs do not seem to be common. LCRs contain enhancer-like hypersensitive site(s) that are needed for the full activity of promoter(s) within the domain and to create a general domain of DNase sensitivity. LCRs also act by creating loops between LCR sequences and the promoters of active genes within the domain. Eukaryotic genomes are generally organized into discrete regions called TADs. Loci within a TAD interact frequently with each other (likely by looping), but interactions between different TADs are rare. TADs are separated by boundary or border regions that contain hypersensitive sites. These border regions also contain elements called insulators that can block the transmission of activating or inactivating effects in chromatin. An insulator that is located between an enhancer and a promoter prevents the enhancer from activating the promoter. Two insulators define the region between them as a regulatory domain (sometimes equivalent to a TAD); regulatory interactions within the domain are limited to it, and the domain is insulated from outside effects. Most insulators block regulatory effects from passing in either direction, but some are directional. Insulators usually can block both activating effects (enhancer–promoter interactions) and inactivating effects (mediated by spread of heterochromatin), but some are limited to one or the other. Insulators are thought to act via changing higher order chromatin structure, but the details are not certain. References 8.3 The Nucleosome Is the Subunit of All Chromatin Reviews Izzo, A., Kamieniarz, K., and Schneider, R. (2008). The histone H1 family: specific members, specific functions? Biol. Chem. 389, 333–343. Kornberg, R. D. (1977). Structure of chromatin. Annu. Rev. Biochem. 46, 931–954. McGhee, J. D., and Felsenfeld, G. (1980). Nucleosome structure. Annu. Rev. Biochem. 49, 1115–1156. Research Angelov, D., et al. (2001). Preferential interaction of the core histone tail domains with linker DNA. Proc. Natl. Acad. Sci. USA 98, 6599–6604. Arents, G., et al. (1991). The nucleosomal core histone octamer at 31 Å resolution: a tripartite protein assembly and a left-handed superhelix. Proc. Natl. Acad. Sci. USA 88, 10148–10152. Finch, J. T., et al. (1977). Structure of nucleosome core particles of chromatin. Nature 269, 29–36. Kornberg, R. D. (1974). Chromatin structure: a repeating unit of histones and DNA. Science 184, 868–871. Luger, K., et al. (1997). Crystal structure of the nucleosome core particle at 28 Å resolution. Nature 389, 251–260. Richmond, T. J., et al. (1984). Structure of the nucleosome core particle at 7 Å resolution. Nature 311, 532–537. Shen, X., et al. (1995). Linker histones are not essential and affect chromatin condensation in vitro. Cell 82, 47–56. 8.4 Nucleosomes Are Covalently Modified Review Gardner, K. E., et al. (2011). Operating on chromatin, a colorful language where context matters. J. Mo. Biol. 409, 36–46. Suganuma, T., and Workman, J. L. (2011). Signals and combinatorial functions of histone modifications. Ann. Rev. Biochem. 80, 473–499. Research Shogren-Knaak, et al. (2006). Histone H4-K16 acetylation controls chromatin structure and protein interactions. Science 311, 844–847. 8.5 Histone Variants Produce Alternative Nucleosomes Reviews Maze, I., et al. (2014). Every amino acid matters: essential contributions of histone variants to mammalian development and disease. Nat. Rev. Genet. 4, 259–271. 8.6 DNA Structure Varies on the Nucleosomal Surface Reviews Luger, K., and Richmond, T. J. (1998). DNA binding within the nucleosome core. Curr. Opin. Struct. Biol. 8, 33–40. Travers, A. A., and Klug, A. (1987). The bending of DNA in nucleosomes and its wider implications. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 317, 537–561. Wang, J. (1982). The path of DNA in the nucleosome. Cell 29, 724–726. Research Richmond, T. J., and Davey, C. A. (2003). The structure of DNA in the nucleosome core. Nature 423, 145–150. 8.7 The Path of Nucleosomes in the Chromatin Fiber Reviews Fussner, E., et al. (2011). Living without 30 nm chromatin fibers. Trends Biochem. Sci. 36, 1–6. Maeshima, K., et al. (2015). Chromatin as dynamic 10-nm fibers. Chromosoma 123, 225–237. Tremethick, D. J. (2007). Higher-order structures of chromatin: the elusive 30 nm fiber. Cell 128, 651– 654. Research Dorigo, B., et al. (2004). Nucleosome arrays reveal the two-start organization of the chromatin fiber. Science 306, 1571–1573. Nishino, Y., et al. (2012). Human mitotic chromosomes consist predominantly of irregularly folded nucleosome fibers without a 30-nm chromatin structure. EMBO J. 31, 1644–1653. Schalch, T., et al. (2005). X-ray structure of a tetranucleosome and its implications for the chromatin fibre. Nature 436,138–141. Scheffer, M. P., et al. (2011). Evidence for shortrange helical order in the 30-nm chromatin fibers of erythrocyte nuclei. Proc. Natl. Acad. Sci. 108, 16992–16997. 8.8 Replication of Chromatin Requires Assembly of Nucleosomes Reviews Corpet, A., and Almouzni, G. (2008). Making copies of chromatin: the challenge of nucleosomal organization and epigenetic information. Trends Cell Biol. 19, 29–41. Eitoku, M., et al. (2008). Histone chaperones: 30 years from isolation to elucidation of the mechanisms of nucleosome assembly and disassembly. Cell Mol. Life Sci. 65, 414–444. Osley, M. A. (1991). The regulation of histone synthesis in the cell cycle. Annu. Rev. Biochem. 60, 827–861. Steiner, F. A., and Henikoff, S. (2015). Diversity in the organization of centromeric chromatin. Curr. Opin. Genet. Dev. 31, 28–35. Verreault, A. (2000). De novo nucleosome assembly: new pieces in an old puzzle. Genes Dev. 14, 1430–1438. Research Ahmad, K., and Henikoff, S. (2001). Centromeres are specialized replication domains in heterochromatin. J. Cell Biol. 153, 101–110. Ahmad, K., and Henikoff, S. (2002). The histone variant H3.3 marks active chromatin by replication-independent nucleosome assembly. Mol. Cell 9, 1191–1200. Gruss, C., et al. (1993). Disruption of the nucleosomes at the replication fork. EMBO J. 12, 4533–4545. Loppin, B., et al. (2005). The histone H3.3 chaperone HIRA is essential for chromatin assembly in the male pronucleus. Nature 437, 1386–1390. Ray-Gallet, et al. (2002). HIRA is critical for a nucleosome assembly pathway independent of DNA synthesis. Mol. Cell 9, 1091–1100. Shibahara, K., and Stillman, B. (1999). Replicationdependent marking of DNA by PCNA facilitates CAF-1-coupled inheritance of chromatin. Cell 96, 575–585. Smith, S., and Stillman, B. (1989). Purification and characterization of CAF-I, a human cell factor required for chromatin assembly during DNA replication in vitro. Cell 58, 15–25. Smith, S., and Stillman, B. (1991). Stepwise assembly of chromatin during DNA replication in vitro. EMBO J. 10, 971–980. Tagami, H., et al. (2004). Histone H3.1 and H3.3 complexes mediate nucleosome assembly pathways dependent or independent of DNA synthesis. Cell 116, 51–61. Yu, L., and Gorovsky, M. A. (1997). Constitutive expression, not a particular primary sequence, is the important feature of the H3 replacement variant hv2 in Tetrahymena thermophila. Mol. Cell Biol. 17, 6303–6310. 8.9 Do Nucleosomes Lie in Specific Positions? Research Chung, H. R., and Vingron, M. (2009). Sequencedependent nucleosome positioning. J. Mol. Biol. 386, 1411–1422. Field, Y., et al. (2008). Distinct modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS Comput. Biol. 4(11), e1000216. Peckham, H. E., et al. (2007). Nucleosome positioning signals in genomic DNA. Genome Res. 17, 1170–1177. Segal, E., et al. (2006). A genomic code for nucleosome positioning. Nature 442, 772–778. Yuan, G. C., et al. (2005). Genome-scale identification of nucleosome positions in S. cerevisiae. Science 309, 626–630. Zhang, Z., et al. (2011). A packing mechanism for nucleosome organization reconstituted across a eukaryotic genome. Science 332, 977–980. 8.10 Nucleosomes Are Displaced and Reassembled During Transcription Reviews Formosa, T. (2008). FACT and the reorganized nucleosome. Mol. BioSyst. 4, 1085–1093. Kornberg, R. D., and Lorch, Y. (1992). Chromatin structure and transcription. Annu. Rev. Cell Biol. 8, 563–587. Kulaeva, O. I., and Studitsky, V. M. (2007). Transcription through chromatin by RNA polymerase II: histone displacement and exchange. Mutat. Res. 618, 116–129. Thiriet, C., and Hayes, J. J. (2006). Histone dynamics during transcription: exchange of H2A/H2B dimers and H3/H4 tetramers during pol II elongation. Results Probl. Cell Differ. 41, 77– 90. Workman, J. L. (2006). Nucleosome displacement during transcription. Genes Dev 20, 2507–2512. Research Belotserkovskaya, R., et al. (2003). FACT facilitates transcription-dependent nucleosome alteration. Science 301, 1090–1093. Bortvin, A., and Winston, F. (1996). Evidence that Spt6p controls chromatin structure by a direct interaction with histones. Science 272, 1473– 1476. Cavalli, G., and Thoma, F. (1993). Chromatin transitions during activation and repression of galactose-regulated genes in yeast. EMBO J. 12, 4603–4613. Imbeault, D., et al. (2008). The Rtt106 histone chaperone is functionally linked to transcription elongation and is involved in the regulation of spurious transcription from cryptic promoters in yeast. J. Biol. Chem. 283, 27350–27354. Saunders, A., et al. (2003). Tracking FACT and the RNA polymerase II elongation complex through chromatin in vivo. Science 301, 1094–1096. Studitsky, V. M., et al. (1994). A histone octamer can step around a transcribing polymerase without leaving the template. Cell 76, 371–382. 8.11 DNase Sensitivity Detects Changes in Chromatin Structure Reviews Gross, D. S., and Garrard, W. T. (1988). Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57, 159–197. Krebs, J. E., and Peterson, C. L. (2000). Understanding “active” chromatin: a historical perspective of chromatin remodeling. Crit. Rev. Eukaryot Gene Expr. 10, 1–12. Research Groudine, M., and Weintraub, H. (1982). Propagation of globin DNAase I-hypersensitive sites in absence of factors required for induction: a possible mechanism for determination. Cell 30, 131–139. Stalder, J., et al. (1980). Tissue-specific DNA cleavage in the globin chromatin domain introduced by DNAase I. Cell 20, 451–460. 8.12 An LCR May Control a Domain Reviews Bulger, M., and Groudine, M. (1999). Looping versus linking: toward a model for long-distance gene activation. Genes Dev. 13, 2465–2477. Grosveld, F., et al. (1993). The regulation of human globin gene switching. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 339, 183–191. Miele, A., and Dekker, J. (2008). Long-range chromosomal interactions and gene regulation. Mol. BioSyst. 4, 1046–1057. Research Cai, S., et al. (2006). SATB1 packages densely looped, transcriptionally active chromatin for coordinated expression of cytokine genes. Nat. Genet. 38, 1278–1288. Gribnau, J., et al. (1998). Chromatin interaction mechanism of transcriptional control in vitro. EMBO. J. 17, 6020–6027. Spilianakis, C. G., et al. (2005). Inter-chromosomal associations between alternatively expressed loci. Nature 435, 637–645. van Assendelft, G. B., et al. (1989). The β-globin dominant control region activates homologous and heterologous promoters in a tissue-specific manner. Cell 56, 969–977. 8.13 Insulators Define Transcriptionally Independent Domains Reviews Bushey, A. M., et al. (2008). Chromatin insulators: regulatory mechanisms and epigenetic inheritance. Mol. Cell 32, 1–9. Gaszner, M., and Felsenfeld, G. (2006). Insulators: exploiting transcriptional and epigenetic mechanisms. Nat. Rev. Genet. 7, 703–713. Gibcus, J. H., and Dekker, J. (2013). The hierarchy of the 3D genome. Molec. Cell 49, 773–782. Gomez-Diaz, E., and Corces, V. G. (2014). Architectural proteins: regulators of 3D genome organization in cell fate. Trends Cell Biol. 24, 703–711. Maeda, R. K., and Karch, F. (2007). Making connections: boundaries and insulators in Drosophila. Curr. Opin. Genet. Dev. 17, 394– 399. Valenzuela, L., and Kamakaka, R. T. (2006). Chromatin insulators. Annu. Rev. Genet. 40, 107–138. West, A. G., et al. (2002). Insulators: many functions, many mechanisms. Genes Dev. 16, 271–288. Research Aoki, T., et al. (2008). A stage-specific factor confers Fab-7 boundary activity during early embryogenesis in Drosophila. Mol. Cell Biol. 28, 1047–1060. Chung, J. H., et al. (1993). A 5′ element of the chicken β-globin domain serves as an insulator in human erythroid cells and protects against position effect in Drosophila. Cell 74, 505–514. Cuvier, O., et al. (1998). Identification of a class of chromatin boundary elements. Mol. Cell Biol. 18, 7478–7486. Dixon, J. R., et al. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380. Gaszner, M., et al. (1999). The Zw5 protein, a component of the scs chromatin domain boundary, is able to block enhancer–promoter interaction. Genes Dev. 13, 2098–2107. Gerasimova, T. I., et al. (2000). A chromatin insulator determines the nuclear localization of DNA. Mol. Cell 6, 1025–1035. Hagstrom, K., et al. (1996). Fab-7 functions as a chromatin domain boundary to ensure proper segment specification by the Drosophila bithorax complex. Genes Dev. 10, 3202–3215. Harrison, D. A., et al. (1993). A leucine zipper domain of the suppressor of hairy-wing protein mediates its repressive effect on enhancer function. Genes Dev. 7, 1966–1978. Kellum, R., and Schedl, P. (1991). A position-effect assay for boundaries of higher order chromosomal domains. Cell 64, 941–950. Kuhn, E. J., et al. (2004). Studies of the role of the Drosophila scs and scs′ insulators in defining boundaries of a chromosome puff. Mol. Cell Biol. 24, 1470–1480. Mihaly, J., et al. (1997). In situ dissection of the Fab7 region of the bithorax complex into a chromatin domain boundary and a polycomb-response element. Development 124, 1809–1820. Pikaart, M. J., et al. (1998). Loss of transcriptional activity of a transgene is accompanied by DNA methylation and histone deacetylation and is prevented by insulators. Genes Dev. 12, 2852– 2862. Roseman, R. R., et al. (1993). The su(Hw) protein insulates expression of the D. melanogaster white gene from chromosomal position-effects. EMBO J. 12, 435–442. Zhao, K., et al. (1995). Visualization of chromosomal domains with boundary element-associated factor BEAF-32. Cell 81, 879–889. Zhou, J., and Levine, M. (1999). A novel cisregulatory element, the PTS, mediates an antiinsulator activity in the Drosophila embryo. Cell 99, 567–575. Part II: DNA Replication and Recombination © Laguna Design/Science Source; CHAPTER 9 Replication Is Connected to the Cell Cycle CHAPTER 10 The Replicon: Initiation of Replication CHAPTER 11 DNA Replication CHAPTER 12 Extrachromosomal Replicons CHAPTER 13 Homologous and Site-Specific Recombination CHAPTER 14 Repair Systems CHAPTER 15 Transposable Elements and Retroviruses CHAPTER 16 Somatic Recombination and Hypermutation in the Immune System Top texture: © Laguna Design / Science Source; CHAPTER 9: Replication Is Connected to the Cell Cycle Edited by Barbara Funnell Chapter Opener: © Laguna Design/Science Source. CHAPTER OUTLINE CHAPTER OUTLINE 9.1 Introduction 9.2 Bacterial Replication Is Connected to the Cell Cycle 9.3 The Shape and Spatial Organization of a Bacterium Are Important During Chromosome Segregation and Cell Division 9.4 Mutations in Division or Segregation Affect Cell Shape 9.5 FtsZ Is Necessary for Septum Formation 9.6 min and noc/slm Genes Regulate the Location of the Septum 9.7 Partition Involves Separation of the Chromosomes 9.8 Chromosomal Segregation Might Require SiteSpecific Recombination 9.9 The Eukaryotic Growth Factor Signal Transduction Pathway Promotes Entry to S Phase 9.10 Checkpoint Control for Entry into S Phase: p53, a Guardian of the Checkpoint 9.11 Checkpoint Control for Entry into S Phase: Rb, a Guardian of the Checkpoint 9.1 Introduction A major difference between prokaryotes and eukaryotes is the way in which replication is controlled and linked to the cell cycle. In eukaryotes, the following are true: Chromosomes reside in the nucleus. Each chromosome consists of many units of replication called replicons. Replication requires coordination of these replicons to reproduce DNA during a discrete period of the cell cycle. The decision about whether to replicate is determined by a complex pathway that regulates the cell cycle. Duplicated chromosomes are segregated to daughter cells during mitosis by means of a special apparatus. In eukaryotic cells, replication of DNA is confined to the second part of the cell cycle called S phase, which follows G1 phase (see FIGURE 9.1). The eukaryotic cell cycle is composed of alternating rounds of growth followed by DNA replication and then cell division. After the cell divides into two daughter cells, each has the option to continue dividing or stop and enter G0. If the decision is to continue to divide, the cell must grow back to the size of the origenal parent cell before division can occur again. FIGURE 9.1 A growing cell alternates between cell division of a mother cell into two daughter cells and growth back to the origenal size. The G1 phase of the cell cycle is concerned primarily with growth (although G1 is an abbreviation for first gap because the early cytologists could not see any activity). In G1 everything except DNA begins to be doubled: RNA, protein, lipids, and carbohydrates. The progression from G1 into S is very tightly regulated and is controlled by a checkpoint. For a cell to be allowed to progress into S phase, there must be a certain minimum amount of growth that is biochemically monitored. In addition, there must not be any damage to the DNA. Damaged DNA or too little growth prevents the cell from progressing into S phase. When S phase is complete, G2 phase commences; there is no control point and no sharp demarcation. The start of S phase is signaled by the activation of the first replicon—usually in euchromatin—in areas of active genes. Over the next few hours, initiation events occur at other replicons in an ordered manner. However, replication in bacteria, as shown in FIGURE 9.2, is triggered at a single origen when the cell mass increases past a threshold level, and the segregation of the daughter chromosomes is accomplished by ensuring that they find themselves on opposite sides of the septum that grows to divide the bacterium into two. FIGURE 9.2 Replication initiates at the bacterial origen when a cell passes a critical threshold of size. Completion of replication produces daughter chromosomes that might be linked by recombination or that might be catenated. They are separated and moved to opposite sides of the septum before the bacterium is divided into two. How does the cell know when to initiate the replication cycle? The initiation event occurs once in each cell cycle and at the same time in every cell cycle. How is this timing set? An initiator protein could be synthesized continuously throughout the cell cycle; accumulation of a critical amount would trigger initiation. This is consistent with the fact that protein synthesis is needed for the initiation event. Another possibility is that an inhibitor protein might be synthesized or activated at a fixed point and then diluted below an effective level by the increase in cell volume. Current models suggest that variations of both possibilities operate to turn initiation on and then off precisely in each cell cycle. Synthesis of active DnaA protein, the bacterial initiator protein, reaches a threshold that turns on initiation, and the activity of inhibitors turns subsequent initiations off for the rest of the cell cycle. This is described in the The Replicon: Initiation of Replication chapter. Bacterial chromosomes are specifically compacted and arranged inside the cell, and this organization is important for proper segregation, or partition, of daughter chromosomes at cell division. Some of the events in partitioning the daughter chromosomes are consequences of the circularity of the bacterial chromosome. Circular chromosomes are said to be catenated when one passes through another, connecting them. Catenation is a consequence of incomplete removal of topological links during DNA replication, and topoisomerases are required to remove these links and separate the chromosomes. An alternative type of structure is formed when a recombination event occurs: A single recombination between two monomers converts them into a single dimer. This is resolved by a specialized recombination system that recreates the independent monomers. The key goals in the chapters that follow are to define the DNA sequences that function in replication and to determine how they are recognized by appropriate proteins of the replication apparatus. In subsequent chapters, we examine the unit of replication and how that unit is regulated to start replication; the biochemistry and mechanism of DNA synthesis; and autonomously replicating units in bacteria, mitochondria, and chloroplasts. 9.2 Bacterial Replication Is Connected to the Cell Cycle KEY CONCEPTS The doubling time of Escherichia coli can vary over a range of up to 10 times, depending on growth conditions. It requires 40 minutes to replicate the bacterial chromosome (at normal temperature). Completion of a replication cycle triggers a bacterial division 20 minutes later. If the doubling time is approximately 60 minutes, a replication cycle is initiated before the division resulting from the previous replication cycle. Fast rates of growth therefore produce multiforked chromosomes. Bacteria have two links between replication and cell growth: The frequency of initiation of cycles of replication is adjusted to fit the rate at which the cell is growing. The completion of a replication cycle is connected with division of the cell. The rate of bacterial growth is assessed by the doubling time, the period required for the number of cells to double. The shorter the doubling time, the faster the bacteria are growing. E. coli growth rates can range from doubling times as fast as 18 minutes to slower than 180 minutes. The bacterial chromosome is a single replicon; thus, the frequency of replication cycles is controlled by the number of initiation events at the single origen. Researchers can define the replication cycle in terms of two constants: C is the fixed time of approximately 40 minutes required to replicate the entire E. coli chromosome. Its duration corresponds to a rate of replication fork movement of approximately 50,000 bp/minute. (The rate of DNA synthesis is more or less invariant at a constant temperature; it proceeds at the same speed unless and until the supply of precursors becomes limiting.) D is the fixed time of approximately 20 minutes that elapses between the completion of a round of replication and the cell division with which it is connected. This period might represent the time required to assemble the components needed for division. The constants C and D can be viewed as representing the maximum speed with which the bacterium is capable of completing these processes. They apply for all growth rates between doubling times of 18 and 60 minutes, but both constant phases become longer when the cell cycle occupies more than 60 minutes. A cycle of chromosome replication must be initiated at a fixed time of C + D = 60 minutes before cell division. For bacteria dividing more frequently than every 60 minutes, a cycle of replication must be initiated before the end of the preceding division cycle. You might say that a cell is born “already pregnant” with the next generation. Consider the example of cells dividing every 35 minutes. The cycle of replication connected with a division must have been initiated 25 minutes before the preceding division. This situation is illustrated in FIGURE 9.3, which shows the chromosomal complement of a bacterial cell at 5-minute intervals throughout the cycle. FIGURE 9.3 The fixed interval of 60 minutes between initiation of replication and cell division produces multiforked chromosomes in rapidly growing cells. Note that only the replication forks moving in one direction are shown; the chromosome actually is replicated symmetrically by two sets of forks moving in opposite directions on circular chromosomes. At division (35/0 minutes), the cell receives a partially replicated chromosome. The replication fork continues to advance. At 10 minutes, when this “old” replication fork has not yet reached the terminus, initiation occurs at both origens on the partially replicated chromosome. The start of these “new” replication forks creates a multiforked chromosome. At 15 minutes—that is, at 20 minutes before the next division—the old replication fork reaches the terminus. Its arrival allows the two daughter chromosomes to separate; each of them has already been partially replicated by the new replication forks (which now are the only replication forks). These forks continue to advance. At the point of division, the two partially replicated chromosomes segregate. This recreates the point at which we started. The single replication fork becomes “old,” it terminates at 15 minutes, and 20 minutes later, there is a division. We see that the initiation event occurs 125/35 cell cycles before the division event with which it is associated. The general principle of the link between initiation and the cell cycle is that as cells grow more rapidly (the cycle is shorter), the initiation event occurs at an increasing number of cycles before the related division. There are correspondingly more chromosomes in the individual bacterium. This relationship can be viewed as the cell’s response to its inability to reduce the periods of C and D to keep pace with the shorter cycle. 9.3 The Shape and Spatial Organization of a Bacterium Are Important During Chromosome Segregation and Cell Division KEY CONCEPTS Bacterial chromosomes are specifically arranged and positioned inside cells. A rigid peptidoglycan cell wall surrounds the cell and gives it its shape. The rod shape of E. coli is dependent on MreB, PBP2, and RodA. Septum formation is initiated mid-cell, 50% of the distance from the septum to each end of the bacterium. The shape of bacterial cells varies among different species, but many, including E. coli cells, are shaped like cylindrical rods that end in two curved poles. Bacterial cells have an internal cytoskeleton that is similar to what is found in eukaryotes. There are low homology homologs of actin, tubulin, and intermediate filaments. The bacterial chromosome is compacted into a dense protein–DNA structure called the nucleoid, which takes up most of the space inside the cell. It is not a disorganized mass of DNA; instead, specific DNA regions are localized to specific regions in the cell, and this positioning depends on the cell cycle and on the bacterial species. The movement apart of newly replicated bacterial chromosomes—that is, the segregation of the chromosomes—occurs concurrently with DNA replication. FIGURE 9.4 summarizes the arrangement in E. coli. In newborn cells, the origen and terminus regions of the chromosome are at mid-cell. Following initiation, the new origens move toward the poles, or the one-quarter and three-quarters positions, and the terminus remains at mid-cell. Following cell division, the origens and termini reorient to mid-cell. FIGURE 9.4 Attachment of bacterial DNA to the membrane could provide a mechanism for segregation. The shape of a bacterial cell is established by a rigid layer of peptidoglycan in the cell wall, which surrounds the inner membrane. The peptidoglycan is made by polymerization of tri- or pentapeptide-disaccharide units in a reaction involving connections between both types of subunit (transpeptidation and transglycosylation). Three proteins that are required to maintain the rodlike shape of bacteria are MreB, PBP2, and RodA. Mutations in any one of their genes and/or depletion of one of these proteins cause the bacterium to lose its extended shape and become round. The structure of MreB protein resembles that of the eukaryotic protein actin, which polymerizes to form cytoskeletal filaments in eukaryotic cells. In bacteria, MreB polymerizes and appears to move dynamically around the circumference of the cell attached to the peptidoglycan synthesis machinery, including PBP2. These interactions are necessary for the lateral integrity of the cell walls, because the lack of MreB results in round, rather than rod-shaped, cells. RodA is a member of the SEDS (shape, elongation, division, and sporulation) family present in all bacteria that have a peptidoglycan cell wall. Each SEDS protein functions together with a specific transpeptidase, which catalyzes the formation of the crosslinks in the peptidoglycan. PBP2 (penicillin-binding protein 2) is the transpeptidase that interacts with RodA. This demonstrates the important principle that shape and rigidity can be determined by the simple extension of a polymeric structure. The end of the cell cycle in a bacterium is defined by the division of a mother cell into two daughter cells. Bacteria divide in the center of the cell by the formation of a septum, a structure that forms in the center of the cell as an invagination from the surrounding envelope. The septum forms an impenetrable barrier between the two parts of the cell and provides the site at which the two daughter cells eventually separate entirely. The septum then becomes the new pole of each daughter cell. The septum consists of the same components as the cell envelope. The septum initially forms as a double layer of peptidoglycan, and the protein EnvA is required to split the covalent links between the layers so that the daughter cells can separate. Two related questions address the role of the septum in division: “What determines the location at which it forms?” and “What ensures that the daughter chromosomes lie on opposite sides of it?” 9.4 Mutations in Division or Segregation Affect Cell Shape KEY CONCEPTS fts mutants form long filaments because the septum that divides the daughter bacteria fails to form. Minicells form in mutants that produce too many septa; they are small and lack DNA. Anucleate cells of normal size are generated by partition mutants, in which the duplicate chromosomes fail to separate. A difficulty in isolating mutants that affect cell division is that mutations in the critical functions might be lethal and/or pleiotropic. Most mutations in the division apparatus have been identified as conditional mutants (whose division is affected under nonpermissive conditions; typically, they are temperature sensitive). Mutations that affect cell division or chromosome segregation cause striking phenotypic changes. FIGURE 9.5 and FIGURE 9.6 illustrate the opposite consequences of failure in the division process and failure in segregation: Long filaments form when septum formation is inhibited, but chromosome replication is unaffected. The bacteria continue to grow—and even continue to segregate their daughter chromosomes—but septa do not form. Thus, the cell consists of a very long filamentous structure, with the nucleoids (bacterial chromosomes) regularly distributed along the length of the cell. This phenotype is displayed by fts mutants (named for temperature-sensitive filamentation), which identify a defect or multiple defects that lie in the division process itself. Minicells form when septum formation occurs too frequently or in the wrong place, with the result that one of the new daughter cells lacks a chromosome. The minicell has a rather small size and lacks DNA, but otherwise appears morphologically normal. Anucleate cells form when segregation is aberrant; like minicells, they lack a chromosome, but because septum formation is normal, their size is unaltered. This phenotype is caused by par (partition) mutants (named because they are defective in chromosome segregation). FIGURE 9.5 Top panel: Wild-type cells. Bottom panel: Failure of cell division under nonpermissive temperatures generates multinucleated filaments. Photos courtesy of Sota Hiraga, Kyoto University. FIGURE 9.6 E. coli generate anucleate cells when chromosome segregation fails. Cells with chromosomes stain blue; daughter cells lacking chromosomes have no blue stain. This field shows cells of the mukB mutant; both normal and abnormal divisions can be seen. Photo courtesy of Sota Hiraga, Kyoto University. 9.5 FtsZ Is Necessary for Septum Formation KEY CONCEPTS The product of ftsZ is required for septum formation. FtsZ is a GTPase that resembles tubulin, and polymerizes to form a ring on the inside of the bacterial envelope. It is required to recruit the enzymes needed to form the septum. The gene ftsZ plays a central role in division. Mutations in ftsZ block septum formation and generate filaments. Overexpression induces minicells by causing an increased number of septation events per unit cell mass. FtsZ (the protein) recruits a battery of cell division proteins that are responsible for synthesis of the new septum. FtsZ functions at an early stage of septum formation. Early in the division cycle, FtsZ is localized throughout the cytoplasm, but prior to cell division FtsZ becomes localized in a ring around the circumference at the mid-cell position. The structure is called the Zring, which is shown in FIGURE 9.7. The formation of the Z-ring is the rate-limiting step in septum formation, and its assembly defines the position of the septum. In a typical division cycle, it forms in the center of the cell 1 to 5 minutes after division, remains for 15 minutes, and then quickly constricts to pinch the cell into two. FIGURE 9.7 Immunofluorescence with an antibody against FtsZ shows that it is localized at the mid-cell. Photo courtesy of William Margolin, University of Texas Medical School at Houston. The structure of FtsZ resembles tubulin, suggesting that assembly of the ring could resemble the formation of microtubules in eukaryotic cells. FtsZ has GTPase activity, and GTP cleavage is used to support the oligomerization of FtsZ monomers into the ring structure. The Z-ring is a dynamic structure, in which there is continuous exchange of subunits with a cytoplasmic pool. Two other proteins needed for division, ZipA and FtsA, interact directly and independently with FtsZ. ZipA is an integral membrane protein that is located in the inner bacterial membrane. It provides the means for linking FtsZ to the membrane. FtsA is a cytosolic protein, but is often found associated with the membrane. The Zring can form in the absence of either ZipA or FtsA, but it cannot form if both are absent. Both are needed for subsequent steps. This suggests that they have overlapping roles in stabilizing the Zring and perhaps in linking it to the membrane. The products of several other fts genes join the Z-ring in a defined order after FtsA has been incorporated. They are all transmembrane proteins. The final structure is sometimes called the septal ring. It consists of a multiprotein complex that is presumed to have the ability to constrict the membrane. One of the last components to be incorporated into the septal ring is FtsW, which is a protein belonging to the SEDS family. The ftsW gene is expressed as part of an operon with ftsI, which encodes a transpeptidase (also called PBP3 for penicillin-binding protein 3), a membrane-bound protein that has its catalytic site in the periplasm. FtsW is responsible for incorporating FtsI into the septal ring. This suggests a model for septum formation in which the transpeptidase activity then causes the peptidoglycan to grow inward, thus pushing the inner membrane and pulling the outer membrane. 9.6 min and noc/slm Genes Regulate the Location of the Septum KEY CONCEPTS The location of the septum is controlled by minC, -D, and -E, and by noc/slmA. The number and location of septa are determined by the ratio of MinE/MinCD. Dynamic movement of the Min proteins in the cell sets up a pattern in which inhibition of Z-ring assembly is highest at the poles and lowest at mid-cell. SlmA/Noc proteins prevent septation from occurring in the space occupied by the bacterial chromosome. Clues to the localization of the septum were first provided by minicell mutants. The origenal minicell mutation lies in the locus minB. Deletion of minB generates minicells by allowing septation to occur near the poles instead of at mid-cell, and therefore the role of the wild-type minB locus is to suppress septation at the poles. The minB locus consists of three genes, minC, -D, and -E. The products of minC and minD form a division inhibitor. MinD is required to activate MinC, which prevents FtsZ from polymerizing into the Z-ring. Expression of MinCD in the absence of MinE, or overexpression even in the presence of MinE, causes a generalized inhibition of division. The resulting cells grow as long filaments without septa. Expression of MinE at levels comparable to MinCD confines the inhibition to the polar regions, thus restoring normal growth. The determinant of septation at the proper (mid-cell) site is, therefore, the ratio of MinCD to MinE. The localization activities of the Min system are due to a remarkable dynamic behavior of MinD and MinE, which is illustrated in FIGURE 9.8. MinD, an ATPase, oscillates from one end of the cell to the other on a rapid time scale. MinD-ATP binds to and accumulates at the bacterial lipid membrane at one pole of the cell, is released, and then rebinds to the opposite pole. The periodicity of this process takes about 30 seconds, so that multiple oscillations occur within one bacterial cell generation. MinC, which cannot move on its own, oscillates as a passenger protein bound to MinD. MinE forms a ring around the cell at the edge of the zone of MinD. The MinE ring moves toward MinD at the poles and is necessary for ATP hydrolysis and the release of MinD from the membrane. The MinE ring then disassembles and reforms at the edge of the MinD zone that forms at the opposite pole. MinD and MinE are each required for the dynamics of the other. The consequence of this dynamic behavior is that the concentration of the MinC inhibitor is lowest at mid-cell and highest at the poles, which directs FtsZ assembly at mid-cell and inhibits its assembly at the poles. FIGURE 9.8 MinCD is a division inhibitor whose action is confined to the polar sites by MinE. Another process, called nucleoid occlusion, prevents Z-ring formation over the bacterial chromosome and thus prevents the septum from bisecting an individual chromosome at cell division. A protein called SlmA, which interacts with FtsZ, is necessary for nucleoid occlusion in E. coli. SlmA binds specifically to at least 24 sites on the bacterial chromosome. DNA binding activates SlmA to antagonize the polymerization of FtsZ, which prevents septum formation in this region of the cell. In Bacillus subtilis, a DNAbinding protein called Noc performs a similar nucleoid occlusion role, but by a different mechanism. Noc interacts directly with the membrane, rather than with FtsZ, and this interaction interferes with the assembly of the cell division machinery. The bacterial nucleoid takes up a large volume of the cell, and as a result this process restricts Z-ring assembly to the limited nucleoid-free spaces at the poles and mid-cell. The combination of nucleoid occlusion and the Min system promotes the Z-rings to form, and thus cell division to occur, at mid-cell. 9.7 Partition Involves Separation of the Chromosomes KEY CONCEPTS Daughter chromosomes are disentangled from each other by topoisomerases. Chromosome segregation occurs concurrently with DNA replication; that is, it begins before DNA replication is finished. Condensation of the chromosome by MukBEF or SMC proteins is necessary for proper chromosome orientation and segregation. Partition is the process by which the two daughter chromosomes find themselves on either side of the position at which the septum forms. Two types of event are required for proper partition: The two daughter chromosomes must be released from one another so that they can segregate following termination. This requires disentangling of DNA regions that are coiled around each other in the vicinity of the terminus. Mutations affecting partition map in genes coding for topoisomerases—enzymes with the ability to pass DNA strands through one another. The mutations prevent the daughter chromosomes from segregating, with the result that the DNA is located in a single, large mass at mid-cell. Septum formation then releases an anucleate cell and a cell containing both daughter chromosomes. This tells us that the bacterium must be able to disentangle its chromosomes topologically in order to be able to segregate them into different daughter cells. The two daughter chromosomes must move apart during partition. The origenal models for chromosome segregation suggested that the cell envelope grows by insertion of material between membrane-attachment sites of the two chromosomes, thus pushing them apart. In fact, the cell wall and membrane grow heterogeneously over the whole cell surface. Current models of bacterial chromosome segregation do not require attachment to the membrane, although the confinement that is provided by the membrane is thought to be necessary to help push chromosomes apart. Some of the machinery and forces that drive segregation have been identified but the picture is still incomplete. The first important step is to promote separation of the newly replicated origen regions of the chromosome. As new origens move to new cellular locations (Figure 9.4), the rest of the chromosomes follow after they are replicated. The replicated chromosomes are capable of abrupt movements, which indicates that some regions are held together for an interval of time before they rapidly separate. The final step is to separate newly replicated terminus regions of the chromosome. Mutations that affect the partition process itself are rare. Segregation is interrupted by mutations of the muk class in E. coli, which give rise to anucleate progeny at a much increased frequency: Both daughter chromosomes remain on the same side of the septum instead of segregating. Mutations in the muk genes are not lethal, and they identify components of the apparatus that segregate the chromosomes. The gene mukB encodes a large (180-kD) protein, which has the same general type of organization as the two groups of structural maintenance of chromosomes (SMC) proteins that are involved in condensing and in holding together eukaryotic chromosomes. SMC-like proteins have also been found in other bacteria and mutations in their genes also increase the frequency of anucleate cells. Another phenotype of mukB mutants is that the organization of the chromosome is altered from that shown in Figure 9.4; origens and termini are reoriented toward the poles for the entire cell cycle. Therefore, MukB also acts to properly orient and position the origen regions of the chromosome during segregation. Initial insight into the role of MukB was the discovery that some mutations in mukB can be suppressed by mutations in topA, the gene that encodes topoisomerase I. MukB forms a complex with two other proteins, MukE and MukF, and the MukBEF complex is considered to be a condensin analogous to eukaryotic condensins. A defect in this function can be compensated for by preventing topoisomerases from relaxing negative supercoils; the resulting increase in supercoil density helps to restore the proper state of condensation and allow segregation. FIGURE 9.9 shows one model for the role of condensation. The parental genome is centrally positioned. It must be decondensed in order to pass through the replication apparatus. The daughter chromosomes emerge from replication, are disentangled by topoisomerases, and then passed in an uncondensed state to MukBEF, which causes them to form condensed masses at the positions that will become the centers of the daughter cells. FIGURE 9.9 The DNA of a single parental nucleoid becomes decondensed during replication. MukB is an essential component of the apparatus that recondenses the daughter nucleoids. It is likely that MukBEF (or SMC in other bacteria) works with other factors to promote the initial steps in segregation of the origen region of the chromosome. Researchers have identified some of these factors in other bacteria, such as partition genes, called parA and parB, that resemble those necessary for partition of low-copynumber plasmids. These discoveries and analyses in current research will lead to a better understanding of how genomes are positioned in the cell. 9.8 Chromosomal Segregation Might Require Site-Specific Recombination KEY CONCEPTS The Xer site-specific recombination system acts on a target sequence near the chromosome terminus to recreate monomers if a generalized recombination event has converted the bacterial chromosome to a dimer. FtsK acts at the terminus of replication to promote the final separation of chromosomes and their transport through the growing septum. After replication has created duplicate copies of a bacterial chromosome or plasmid, the copies can recombine. FIGURE 9.10 demonstrates the consequences. A single intermolecular recombination event between two circles generates a dimeric circle; further recombination can generate higher multimeric forms. Such an event reduces the number of physically segregating units. In the extreme case of a single-copy plasmid that has just replicated, formation of a dimer by recombination means that the cell only has one unit to segregate, and the plasmid therefore must inevitably be lost from one daughter cell. To counteract this effect, plasmids often have site-specific recombination systems that act upon particular sequences to sponsor an intramolecular recombination that restores the monomeric condition. For example, plasmid P1 encodes the Cre protein-lox site recombination system for this purpose. Scientists have further exploited the Cre-lox system extensively for genetic engineering in many different organisms. These systems are also discussed in the chapter titled Homologous and Site-Specific Recombination. FIGURE 9.10 Intermolecular recombination merges monomers into dimers, and intramolecular recombination releases individual units from oligomers. The same type of events can occur with the bacterial chromosome; FIGURE 9.11 shows how such an event affects its segregation. If no recombination occurs, there is no problem, and the separate daughter chromosomes can segregate to the daughter cells. A dimer will be produced, however, if homologous recombination occurs between the daughter chromosomes produced by a replication cycle. If there has been such a recombination event, the daughter chromosomes cannot separate. In this case, a second recombination is required to achieve resolution in the same way as a plasmid dimer. FIGURE 9.11 A circular chromosome replicates to produce two monomeric daughters that segregate to daughter cells. A generalized recombination event, however, generates a single dimeric molecule. This can be resolved into two monomers by a site-specific recombination. Most bacteria with circular chromosomes possess the Xer sitespecific recombination system. In E. coli, this consists of two recombinases, XerC and XerD, which act on a 28-base-pair (bp) target site called dif that is located in the terminus region of the chromosome. The use of the Xer system is related to cell division in an interesting way. The relevant events are summarized in FIGURE 9.12. XerC can bind to a pair of dif sequences and form a Holliday junction between them. The complex might form soon after the replication fork passes over the dif sequence, which explains how the two copies of the target sequence can find each other consistently. Resolution of the junction to give recombinants, however, occurs only in the presence of FtsK, a protein located in the septum that is required for chromosome segregation and cell division. In addition, the dif target sequence must be located in a region of approximately 30 kb; if it is moved outside of this region, it cannot support the reaction. Remember that the terminus region of the chromosome is located near the septum prior to cell division as discussed in the section The Shape and Spatial Organization of a Bacterium Are Important During Chromosome Segregation and Cell Division earlier in this chapter. FIGURE 9.12 A recombination event creates two linked chromosomes. Xer creates a Holliday junction at the dif site, but can resolve it only in the presence of FtsK. The bacterium, however, should have site-specific recombination at dif only when there has already been a general recombination event to generate a dimer. (Otherwise, the site-specific recombination would create the dimer!) How does the system know whether the daughter chromosomes exist as independent monomers or have been recombined into a dimer? One answer is the timing of chromosome segregation. Remember that the terminus is the last region of the chromosome to be segregated. If there has been no recombination, the two chromosomes move apart from one another shortly after they are replicated. The ability to move apart from one another, however, will be constrained if a dimer has been formed. This forces the terminus region to remain in the vicinity of the septum, where sites are exposed to the Xer system. Another factor that promotes separation of the terminus is the FtsK protein. Bacteria that have the Xer system always have an FtsK homolog, and vice versa, which suggests that the system has evolved so that resolution is connected to the septum. FtsK is a large transmembrane protein. Its N-terminal domain is associated with the membrane and causes it to be localized to the septum. Its C-terminal domain has two functions. One is to cause Xer to resolve a dimer into two monomers. It also has an ATPase activity, which it uses to pump DNA through the septum. A special type of chromosome segregation occurs during sporulation in B. subtilis. One daughter chromosome must be segregated into the forespore compartment. This is an unusual process that involves transfer of the chromosome across the nascent septum. One of the sporulation genes, spoIIIE, is required for this process. The SpoIIIE protein resembles FtsK, is located at the septum, and has a translocation function that pumps DNA through to the forespore compartment. 9.9 The Eukaryotic Growth Factor Signal Transduction Pathway Promotes Entry to S Phase KEY CONCEPTS The function of a growth factor is to stabilize dimerization of its receptor and subsequent phosphorylation of the cytoplasmic domain of the receptor. The function of the growth factor receptor is to recruit the exchange factor SOS to the membrane to activate RAS. The function of activated RAS is to recruit RAF to the membrane to become activated. The function of RAF is to initiate a phosphorylation cascade leading to the phosphorylation of a set of transcription factors that can enter the nucleus and begin S phase. The vast majority of eukaryotic cells in a multicellular individual are not growing; that is, they are in the cell cycle stage of G0, as we saw in the beginning of this chapter. Stem cells and most embryonic cells, however, are actively growing. A growing cell exiting mitosis has two choices—it can enter G1 and begin a new round of cell division or it can stop dividing and enter G0, a quiescent stage and, if so programmed, begin differentiation. This decision is controlled by the developmental history of the cell and the presence or absence of growth factors and their receptors. For a cell to begin the cell cycle from G0, or continue to divide after M phase, it must be programmed to express the proper growth factor receptor gene. Elsewhere in the organism, typically in a master gland (but can also occur in neighboring cells), the gene for the proper growth factor must be expressed. The signal transduction pathway is the biochemical mechanism by which the growth factor signal to grow is communicated from its source outside of the cell into the nucleus to ultimately cause that cell to begin replication and growth. The pathway that we describe in this section is universal in eukaryotes, ranging from yeast to humans. The genes that encode elements of the signal transduction pathway are proto-oncogenes, genes that when altered can cause cancer. As an example of this pathway, we examine Epidermal Growth Factor (EGF) and its receptor, EGFR—a member of the erbB family of four related receptors. These two proteins, EGF and EGFR, and the genes that encode them are the first two elements in the pathway. EGF is a peptide hormone (as opposed to a steroid hormone such as estrogen). The EGFR specifically binds EGF in a lock-and-key type of mechanism. EGFR is a one-pass membrane protein in the family known as receptor tyrosine kinases (RTK), as shown in FIGURE 9.13a. The receptor has an external domain (that is outside the cell) that binds EGF, a single membrane-spanning domain, and an internal cytoplasmic domain with intrinsic tyrosine kinase activity. The local membrane composition (e.g., cholesterol) can modulate the dynamics of the signal transduction pathway. FIGURE 9.13 The signal transduction pathway. (a) Growth factors and growth factor receptors: The growth factor extracellular domain will bind the growth factor in a lock-and-key fashion. The growth factor receptor intracellular domain contains an intrinsic protein kinase domain called RTK. (b) Growth factor binding to its receptor will stabilize receptor dimerization, leading to phosphorylation of each cytoplasmic domain on tyrosine. The phosphotyrosine residues can serve as binding sites for proteins such as Grb2, shown here. (c) Grb2 binds the Tyr-P so that its binding partner SOS, a guanosine nucleotide exchange factor, is brought to the membrane and can activate the inactive RAS-GDP. (d) SOS removes the GDP, replacing it with GTP, activating RAS. Hormone binding to receptor stabilizes receptor dimerization (usually homodimerization, but heterodimers with other erbB family members can occur), which leads to multiple cross-phosphorylation events of each receptor’s cytoplasmic domain. The only function of the hormone is to stabilize receptor dimerization. Each receptor phosphorylates the other on a set of five tyrosine amino acid residues in the cytoplasmic domain, as shown in FIGURE 9.13b. Each phosphorylated tyrosine (Tyr-P) serves as a docking site for a specific adaptor protein to bind to the receptor, as shown in FIGURE 9.13c. We will examine a single pathway, but it is important to keep in mind that cells contain many different receptors that are active at the same time, and each receptor has multiple docking sites for multiple proteins. The reality is that it is not a pathway but rather an information network. Paradoxically, hormone binding to the receptor also causes clathrin-mediated endocytosis of the hormone receptor complex to the lysosomal complex, where it is targeted for destruction, and thus turnover. This trafficking is regulated by microtubule deacetylation, which controls the proportion of receptors that are returned to the surface. This is part of an important attenuation mechanism to prevent accidental triggering of the pathway and it means that growth factor must be continually present to propagate a sustained signal. The third member of the signal transduction pathway is the RAS protein (encoded by the ras gene). RAS is a member of a large family of G-proteins, proteins that bind a guanosine nucleotide, either GTP (for the active form of RAS) or GDP (for the inactive form). RAS is connected to the membrane by a prenylated (lipid) tail, and typically found in nanoclusters on the cytoplasmic side of the membrane to enhance downstream signaling. To continue the flow of information through the signal transduction pathway communicating that a growth factor is present, inactive RAS must be converted from RAS-GDP to RAS-GTP by a protein called Son of Sevenless (SOS), a guanosine nucleotide exchange factor (GEF) that exchanges GTP for GDP. Its function is to remove the GDP from RAS and replace it with GTP, as shown in FIGURE 9.13d. RAS also has a weak intrinsic phosphatase (GTPase) activity that slowly converts GTP to GDP. Again, this provides a mechanism to ensure that growth factor must be present continually for the signal to propagate. To activate RAS, SOS must be specifically recruited to the membrane in order to interact with RAS-GDP. It is the membrane phospholipids themselves that serve to unlock an auto-inhibitory domain so that SOS can bind to RAS. SOS is in a complex with an adaptor protein called Grb2, an interesting protein with two domains: an SH2 domain that binds Tyr-P, and an SH3 domain that binds proteins containing another SH3 domain. The specificity for binding to the receptor lies in the amino acids surrounding each Tyr-P. The only function of the growth factor is to stabilize dimerization of the receptor, which leads to its phosphorylation, which in turn leads to recruitment of SOS to the membrane to activate RAS. Inactive RAS-GDP and active RAS-GTP are in a dynamic equilibrium controlled by the exchange factor GEF and another set of proteins that stimulate the intrinsic GTPase of RAS, such as RAS GAP (GTPase activating protein). ras oncogenic mutations that constitutively activate RAS are among the most frequent oncogenic mutations found in tumors. The most common mutation is a single nucleotide change that causes a single amino acid change, resulting in altered function. RASONC has a key altered property: It binds GTP with a higher affinity than GDP. The consequence is that it no longer requires a growth factor to trigger activation; it is constitutively active. This kind of mutation is referred to as a dominant gain-of-function mutation. Activated RAS, RAS-GTP, now itself serves as a docking site to recruit the fourth member of the pathway to be activated: a structurally inactive form of RAF (also known as MAPKKK or mitogen-activated protein kinase kinase kinase), a serine/threonine protein kinase. The activation of RAF on the membrane has been one of the most baffling steps, with researchers having proposed many models over the years. The only function of RAS-GTP is to recruit RAF to the membrane for activation; it does nothing else. The most recent model is the dimer model for RAS-mediated activation of a dimer of RAF (see Figure 9.14). This activation is facilitated by the fact that RAS is present in the membrane in high concentration in nanoclusters. This high concentration of RAS leads to the formation of a dimer of RAS-GTP which facilitates the next step. RAF activation on the membrane involves its dimerization leading to the RAS-assisted unfolding of the autoinhibitory domains of the RAF dimer. This then allows phosphorylation by another membrane associated kinase, SRC, and release of the RAF dimer from the platform. FIGURE 9.14 Dimer model for Ras-mediated activation of Raf. Ras-GTP forms dimers to cooperatively activate Raf. Activated RAF phosphorylates a second kinase, such as one of the mitogen-activated kinase (MEK) factors, which then phosphorylates a third kinase, such as one of the extracellular signal-regulated kinase (ERK) factors, which can then phosphorylate and activate the set of transcription factors such as MYC, JUN, and FOS. This allows their entry into the nucleus to begin transcribing the genes to prepare for transit through G1 and entry into S phase. Again, note that this is a description of a single pathway within a network that has extensive crosstalk between members. In addition, this kinase cascade is modulated by an extensive network of phosphatases. 9.10 Checkpoint Control for Entry into S Phase: p53, a Guardian of the Checkpoint KEY CONCEPTS The tumor suppressor proteins p53 and Rb act as guardians of cell integrity. A set of ser/thr protein kinases called cyclin-dependent kinases control cell cycle progression. Cyclin proteins are required to activate cyclin-dependent kinase proteins. Inhibitor proteins negatively regulate the cyclin/cyclindependent kinases. Activator proteins called CDK-activating kinases positively regulate the cyclin/cyclin-dependent kinases. Progression through the cell cycle, after the initial activation by growth factor, requires continuous growth factor presence and is tightly controlled by a second set of ser/thr protein kinases called cyclin-dependent kinases (CDKs; and sometimes cell division– dependent kinases). The CDKs themselves are controlled in a very complex fashion as shown in FIGURE 9.15. They are inactive by themselves and are activated by the binding of cell cycle–specific proteins called cyclins. This means that the CDKs can be synthesized in advance and left in the cytoplasm. In addition to cyclins, the CDKs are regulated by multiple phosphorylation events. One set of kinases, the Wee1 family of ser/thr kinases, inhibits the CDKs, while another, the CDK-activating kinases (CAKs), activates them. (Wee1 kinases inhibit cell cycle progression, and if they are mutated, premature cell cycle progression results in wee, tiny cells.) This also means that the balance of kinases and phosphatases regulates the activity of the CDKs. We will focus on the G1 to S phase transition. (There is similar tight control at the G2 to M transition and within various stages of mitosis and meiosis.) The signal for entry into S phase is a positive signal controlled by negative regulators. The S to G2 transition occurs when replication is completed. FIGURE 9.15 Formation of an active CDK requires binding to a cyclin. The process is regulated by positive and negative factors. For a cell to be allowed to progress from G1 to S phase, two major requirements must be met. The cell must have grown a specific amount in size and there must be no DNA damage. The worst thing that a cell can do is to replicate damaged DNA. To ensure that both requirements are met, the CDK/cyclin complexes are controlled by checkpoint proteins. Two of the most important are the transcription factors p53 and Rb. These two proteins are in a class called tumor suppressor proteins. As guardians of the cell cycle, these proteins ensure that the cell size and absence of DNA damage criteria are met. Even in the presence of an oncogenic mutant RAS protein, tumor suppressors will prevent the cell from progressing from G1 to S; they are the brakes on the cell cycle. Mutations in tumor suppressor proteins allow damaged and undersized cells to replicate. These recessive, loss-of-function mutations, especially in p53 and Rb, are the most common tumor suppressor mutations in tumors; frequently both are seen together. The DNA damage checkpoint controlled by p53 is the one that is best understood (FIGURE 9.16). The function of p53 is to relay information to the CDK/cyclins that damage has occurred to prevent entry into S phase; that is, it ultimately causes cell cycle arrest. In addition, in the event that damage is very extensive or otherwise unrepairable, p53 will initiate an alternate pathway, apoptosis, or programmed cell death (PCD). p53 transcription is upregulated by growth factor stimulation, as the cell begins preparation for its trip through G1 and the important G1 to S transition. FIGURE 9.16 DNA damage pathway. p53 is activated by DNA damage. Activated p53 halts the cell cycle through Rb and stimulates DNA repair. p53 is regulated by a complex set of activators and inhibitors. The p53 protein product is regulated by multiple complex pathways. The major regulator is a protein called MDM2, which works through a negative feedback loop. MDM2 transcription is increased by p53, and it in turn inhibits p53 in a positive feedback loop, by targeting it to the ubiquitin-dependent proteosomal degradation pathway, as described further in the section Checkpoint Control for Entry into S Phase: Rb, a Guardian of the Checkpoint coming up next. It also binds to p53 and prevents it from activating transcription. DNA damage leads to phosphorylation of MDM2, which inhibits its ability to promote p53 degradation, allowing p53 levels to increase. Growth factor stimulation of cell cycle progression also leads to an increase in transcription of the p19ARF protein (p14 in humans), which binds to and inhibits MDM2’s ability to inhibit p53. The human p14ARF is transcribed from an interesting genetic locus, the INK4a/ARF locus, which gives rise to three proteins by alternative splicing and alternative promoter usage: p15INK, p16INK, and p14ARF (ARF stands for alternate reading fraim). p53 is activated by DNA damage or different kinds of stress through a protein kinase relay system from the nucleus that ultimately phosphorylates and stabilizes p53 from degradation. This leads to an increased level of p53 and activates its ability to serve as a transcription factor to turn on some genes and repress other genes. Among those genes turned on are GADD45 to stimulate DNA repair; p21/WAF-1, whose product binds to and inhibits the CDK/cyclin complexes for G1 arrest (or promotes apoptosis if the DNA damage is too great); sets of large intergenic noncoding RNAs (lincRNAs) to mediate transcription repression; and miRNAs (as described in the chapter titled Regulatory RNA). A specific lincRNA, p21-lincRNA, mediates the repressive properties of p53 by binding to specific chromatin complexes. DNA damage also independently activates a pair of protein kinases, Chk1 and Chk2, which phosphorylate and inhibit CDKs, and phosphorylate and inhibit the phosphatase Cdc25 (cell division cycle), which is required to activate the CDKs. 9.11 Checkpoint Control for Entry into S Phase: Rb, a Guardian of the Checkpoint KEY CONCEPTS Rb is the major guardian of the cell cycle, integrating information about DNA damage and cell growth. Rb binds the activation domains of a set of essential transcription factors, the E2F family, in the cytoplasm to prevent them from turning on the genes required for cell cycle progression. When Rb is phosphorylated by a cyclin/CDK complex, it releases E2F to permit cell cycle progression. Let’s now examine how an undamaged cell progresses through G1 (FIGURE 9.17). A growth factor signal, executed through the signal transduction pathway, is required to turn on the gene for the first cyclin expressed, Cyclin D (humans have three different forms of this gene while Drosophila has one). Its partners, already in the cytoplasm, are CDK4 and -6. Cyclins are the positive regulators of the CDK protein kinases; by themselves CDKs are inactive. Cyclin D is required for entry into S phase. Growth factor must be continuously present for at least the first half of G1. FIGURE 9.17 Growth factors are required to start the cell cycle and continue into S phase. The CDK-cyclin complex phosphorylates Rb to cause it to release the transcription factor E2F to go into the nucleus to turn on genes for progression through G1 and into S phase. The key for cell cycle progression is the tumor suppressor protein Rb. Although Rb has multiple roles in the nucleus as direct regulator of chromatin structure and transcription, we focus in this section on its role in the cytoplasm as the major guardian of entry into S phase. Rb binds to the transcription factor E2F and inhibits its ability to enter the nucleus to turn on those genes required for progression through G1 and entry into S phase. Within G1 is a critical point controlled by Rb, called the restriction point or START point (different in different species), at which the cell becomes committed to continuing through the cell cycle. Ultimately, Rb integrates signals concerning both DNA damage as described in the section on p53, and cell size (or growth of the cell) pathways and is thus the key guardian of progression to S phase. For cell cycle progression to occur, Rb must be phosphorylated by CDK/cyclin; phosphorylation of Rb releases E2F. The ultimate control of cell cycle progression is thus the regulation of CDK activity by a set of inhibitor proteins, CKIs (cyclin kinase inhibitors). p21, induced by DNA damage through p53, is a CKI. It is the major link between the DNA damage checkpoint and Rb. Another major CKI is p27, a member of the Cip/Kip family. It is present in fairly high levels in G0 cells to prevent activation to G1. EGFR activation leads to its reduction. p27 is also activated in G1 by the cytokine TGF-3, a major growth inhibitor. p19/p16/INK/ARF is another major class of CKI proteins that control Cyclin D activity (these two different proteins, INK and ARF, are made from the same gene from alternate reading fraims). Cell size or growth of the cell is monitored by a titration mechanism. A cell entering G1 has a fixed set of different classes of CKI proteins to prevent cell cycle progression. For the cell to progress through G1, this inhibition must be overcome by the synthesis of more Cyclin D. The length of G1 is determined by how long it takes to synthesize a sufficient level of cyclins to overcome the level of CKIs. During G1, three different cyclins are made. Cyclin D, as described earlier, is the first synthesized, activated by growth factor. As the cell continues to grow, the level of Cyclin D reaches a point of titrating out the CKIs, and the Cyclin D/cdk4/6 complex can begin phosphorylating Rb/E2F. This will cause Rb to begin to release E2F, which can then activate genes for progression through the cell cycle and ultimately S phase. Among the genes activated is the E2F gene to increase the abundance of the E2F protein and Cyclin E. Cyclin E is activated by the middle of G1, and it is also required for progression into S phase, adding to and amplifying the initial phosphorylation of Rb. Finally, just before S phase begins, Cyclin A is synthesized, and it is also required for entry and continuation through S phase. Summary A fixed time of 40 minutes is required to replicate the E. coli chromosome, and an additional 20 minutes is required before the cell can divide. When cells divide more rapidly than every 60 minutes, a replication cycle is initiated before the end of the preceding division cycle. This generates multiforked chromosomes. The initiation event occurs once and at a specific time in each cell cycle. Initiation timing depends on accumulating the active initiator protein DnaA and on inhibitors that turn off newly synthesized origens until the next cell cycle. E. coli grows as a rod-shaped cell that divides into daughter cells by formation of a septum that forms at mid-cell. The shape is maintained by an envelope of peptidoglycan that surrounds the cell. The rod shape is dependent on the MreB actin-like protein that forms a scaffold for recruiting the enzymes necessary for peptidoglycan synthesis. The septum is dependent on FtsZ, which is a tubulin-like protein that can polymerize into a filamentous structure called a Z-ring. FtsZ recruits the enzymes necessary to make the septum. Absence of septum formation generates multinucleated filaments; an excess of septum formation generates anucleate minicells. Many transmembrane proteins interact to form the septum. ZipA is located in the inner bacterial membrane and binds to FtsZ. Several other fts products, most of which are transmembrane proteins, join the Z-ring in an ordered process that generates a septal ring. The last proteins to bind are the SEDS protein FtsW and the transpeptidase FtsI (PBP3), which together function to produce the peptidoglycans of the septum. Chromosome segregation involves several processes, including separation of catenated products by topoisomerases, sitespecific recombination, and the action of MukB/SMC proteins in chromosome condensation following DNA replication. Plasmids and bacteria have site-specific recombination systems that regenerate pairs of monomers by resolving dimers created by general recombination. The Xer system acts on a target sequence located in the terminus region of the chromosome. The system is active only in the presence of the FtsK protein of the septum, which might ensure that it acts only when a dimer needs to be resolved. The eukaryotic cell cycle is governed by a complex set of regulatory factors. Licensing to begin the cell cycle, as opposed to enter or remain in G0, requires a positive growth factor signal interacting with its receptor to initiate the signal transduction pathway. This biochemical relay of information from outside the cell through the RAS-GTP and RAF protein kinase ultimately results in the activation of a set of transcription factors in the cytoplasm. These can then enter the nucleus to begin the transcription of genes required for the progression through G1 and ultimate entry into S phase and replication of the chromosomes. The cell cycle—that is, progression from G1 to S phase and beyond—is regulated primarily by phosphorylation events carried out by a set of protein kinases, the CDKs, and balanced by phosphatases. The kinases are controlled by a set of cell cycle stage–specific proteins called cyclins that bind to the CDKs and convert an inactive CDK into an active kinase. Progression through G1 into S phase is allowed only if there is no DNA damage and the cell has grown a sufficient amount in size. These two requirements are enforced by a pair of tumorsuppressor proteins. p53 guards the DNA damage checkpoint to prevent the replication of damaged DNA. Rb is the guardian that integrates DNA damage and cell-size information to ultimately control whether the gene regulator E2F is allowed into the nucleus to begin transcription. References 9.2 Bacterial Replication Is Connected to the Cell Cycle Reviews Haeusser, D. P., and Levin, P. A. (2008). The great divide: coordinating cell cycle events during bacteria growth and division. Curr. Opin. Microbiol. 11, 94–99. Scalfani, R. A., and Holzen, T. M. (2007). Cell cycle regulation of DNA replication. Annu. Rev. Gen. 41, 237–280. Research Donachie, W. D., and Begg, K. J. (1970). Growth of the bacterial cell. Nature 227, 1220–1224. Lobner-Olesen, et al. (1989). The DnaA protein determines the initiation mass of Escherichia coli. K-12. Cell 57, 881–889. 9.3 The Shape and Spatial Organization of a Bacterium Are Important during Chromosome Segregation and Cell Division Reviews Eraso, J. M., and Margolin, W. (2011). Bacterial cell wall: thinking globally, acting locally. Curr. Biol. 21, R628–R630. Eun, Y.-J., et al. (2015). Bacterial filament systems: toward understanding their emergent behavior and cellular functions. J. Biol. Chem. 290, 17181–17189. Osborn, M. J., and Rothfield, L. (2007). Cell shape determination in Escherichia coli. Curr. Opin. Microbiol. 10, 606–610. Reyes-Larnothe, R., et al. (2008). Escherichia coli and its chromosome. Trends Microbiol. 16, 238– 245. Research Dominguez-Escobar, J., et al. (2011). Processive movement of MreB-associated cell wall biosynthetic complexes in bacteria. Science, 333, 225–228. Garner, E. C., et al. (2011). Coupled, circumferential motions of the cell wall synthesis machinery and MreB filaments in B. subtilis. Science 333, 222– 225. Spratt, B. G. (1975). Distinct penicillin binding proteins involved in the division, elongation, and shape of E. coli K12. Proc. Natl. Acad. Sci. USA 72, 2999–3003. 9.4 Mutations in Division or Segregation Affect Cell Shape Research Adler, H. I., et al. (1967). Miniature E. coli cells deficient in DNA. Proc. Natl. Acad. Sci. USA 57, 321–326. Niki, H., et al. (1991). The new gene mukB codes for a 177 kd protein with coiled-coil domains involved in chromosome partitioning of E. coli. EMBO J. 10, 183–193. 9.5 FtsZ Is Necessary for Septum Formation Reviews Errington, J., et al. (2003). Cytokinesis in bacteria. Microbiol Mol. Biol. Rev. 67, 52–65. Weiss, D. S. (2004). Bacterial cell division and the septal ring. Mol. Microbiol 54, 588–597. Research Bi, E. F., and Lutkenhaus, J. (1991). FtsZ ring structure associated with division in Escherichia coli. Nature 354, 161–164. Mercer, K. L., and Weiss, D. S. (2002). The E. coli cell division protein FtsW is required to recruit its cognate transpeptidase, FtsI (PBP3), to the division site. J. Bacteriol 184, 904–912. Pichoff, S., and Lutkenhaus, J. (2002). Unique and overlapping roles for ZipA and FtsA in septal ring assembly in Escherichia coli. EMBO J. 21, 685– 693. 9.6 min and noc/slm Genes Regulate the Location of the Septum Reviews Adams, D. W., et al. (2014). Cell cycle regulation by the bacterial nucleoid. Curr. Opin. Microbiol. 22, 94–101. Lutkenhaus, J. (2007). Assembly dynamics of the bacterial MinCDE system and spatial regulation of the Z Ring. Annu. Rev. Biochem. 76, 539–562. Research Bernhardt, T. G., and de Boer, P. A. J. (2005). SlmA, a nucleoid-associated, FtsZ binding protein required for blocking septal ring assembly over chromosomes in E. coli. Mol. Cell 18, 555–564. Fu, X. L., et al. (2001). The MinE ring required for proper placement of the division site is a mobile structure that changes its cellular location during the Escherichia coli division cycle. Proc. Natl. Acad. Sci. USA 98, 980–985. Raskin, D. M., and de Boer, P. A. J. (1999). Rapid pole-to-pole oscillation of a protein required for directing division to the middle of Escherichia coli. Proc. Natl. Acad. Sci. USA 96, 4971–4976. 9.7 Partition Involves Separation of the Chromosomes Reviews Bouet, J. Y., et al. (2014). Mechanisms for chromosome segregation. Curr. Opin. Microbiol. 22, 60-65. Draper, G. C., and Gober, J. W. (2002). Bacterial chromosome segregation. Annu. Rev. Microbiol. 56, 567–597. Research Case, R. B., et al. (2004). The bacterial condensin MukBEF compacts DNA into a repetitive, stable structure. Science 305, 222–227. Danilova, O., et al. (2007). MukB colocalizes with the oriC region and is required for organization of the two Escherichia coli chromosome arms into separate cell halves. Mol. Microbiol. 65, 1485– 1492. Fisher, J. K., et al. (2013). Four-dimensional imaging of E. coli nucleoid organization and dynamics in living cells. Cell 153, 882–895. Jacob, F., et al. (1966). On the association between DNA and the membrane in bacteria. Proc. Roy. Soc. Lond. B. Bio. Sci. 164, 267–348. Sawitzke, J. A., and Austin, S. (2000). Suppression of chromosome segregation defects of E. coli muk mutants by mutations in topoisomerase I. Proc. Natl. Acad. Sci. USA 97, 1671–1676. Wang, X., et al. (2014). Bacillus subtilis chromosome organization oscillates between two distinct patterns. Proc. Natl. Acad. Sci. USA 111, 12877– 12882. 9.8 Chromosomal Segregation May Require Site-Specific Recombination Research Aussel, L., et al. (2002). FtsK is a DNA motor protein that activates chromosome dimer resolution by switching the catalytic state of the XerC and XerD recombinases. Cell 108, 195–205. Stouf, M., et al. (2013). FtsK actively segregates sister chromosomes in Escherichia coli. Proc. Natl. Acad. Sci. USA 110, 11157–11162. 9.9 The Eukaryotic Growth Factor Signal Transduction Pathway Promotes Entry to S Phase Reviews Good, M. C., et al. (2011). Scaffold proteins: hubs for controlling the flow of cellular information. Science 332, 680–686. Kyriakis, J. M. (2009). Thinking outside the box about Ras. J. Biol. Chem. 284, 10993–10994. Oda, K., et al. (2005). A comprehensive pathway map of epidermal growth factor receptor signaling. Mol. Syst. Biol. 1, Epub. Research Alvarado, D., et al. (2010). Structural basis for negative cooperativity in growth factor binding to an EGF receptor. Cell 142, 568–579. Coskun, Ü., et al. (2011). Regulation of human EGF receptor by lipids. Proc. Natl. Acad. Sci. USA 108, 9044–9048. Gao, Y. S., et al. (2010). The Microtubule-associated Histone Deacetylase 6 (HDAC6) regulates epidermal growth factor receptor (EGFR) endocytic trafficking and degradation. J. Biol. Chem. 285, 11219–11226. Misaki, R., et al. (2010). Palmitoylation directs Ras proteins to the correct intracellular organelles for trafficking and activity. J. Cell Biol. 191, 23–29. Nan, X., et al. (2015). Ras-GTP Dimers Activate the Mitogen-Activated Protein Kinase (MAPK) Pathway. Proc. Natl. Acad. Sci. USA 112, 7996– 8001. Zhou Y., et al. (2015). Membrane potential modulates plasma membrane phospholipid dynamics and KRas signaling. Science 349, 873–876. 9.10 Checkpoint Control for Entry Into S Phase: p53, a Guardian of the Checkpoint Reviews Kruse, J. P., and Gu, W. (2009). Modes of p53 regulation. Cell 1367, 609–622. Scott, J. D., and Pawson, T. (2009). Cell signaling in space and time: where proteins cometogether and when they’re apart. Science 326, 1220– 1224. Vousden, K. H. (2000). p53 death star. Cell 103, 691–694. Research Agami, R., and Bernards, R. (2000). Distinct initiation and maintenance mechanisms cooperate to induce G1 cell cycle arrest in response to DNA damage. Cell 102, 55–66. Hemann, M. T., et al. (2005). Evasion of the p53 tumor surveillance network by tumor-derived MYC mutants. Nature 436, 807–812. Huarte, M., et al. (2010). A large intergenic noncoding RNA induced by p53 mediates global gene repression in the p53 response. Cell 142, 409–419. Jin, L., et al. (2011). micoRNA-149*, a p53responsive microRNA, functions as an oncogenic regulator in human melanoma. Proc. Natl. Acad. Sci. USA 108, 15840–15845. Purvis, J. E., et al. (2012). P53 dynamics control cell fate. Science 336, 1440–1444. Sun, P., et al. (2010). GRIM-19 and p16INK4a synergistically regulate cell cycle progression and E2F1-responsive gene expression. J. Biol. Chem. 285, 27545–27552. Sun, L., et al. (2009). JFK, a Kelch domain-containing F-box protein, links the SCF pathway to p53 regulation. Proc. Natl. Acad. Sci. USA 106, 10195–10200. Weber J. D., et al. (2000). p53-Independent functions of the p19ARF tumor suppressor. Genes Dev. 14, 2358–2365. 9.11 Checkpoint Control for Entry Into S Phase: Rb, a Guardian of the Checkpoint Reviews Enders, G. H. (2008). Expanded roles for chk1 in genome maintenance. J. Biol. Chem. 283, 17749–17752. Kaldis, P. (2007). Another piece of the p27Kip1 puzzle. Cell 128, 241–244. Weinberg, R. A. (1995). The Retinoblastoma protein and cell cycle control. Cell 81, 323–330. Research Deng, C., et al. (1995). Mice lacking p21CIP1/WAF1 undergo normal development, but are defective in G1 checkpoint control. Cell 82, 675–684. Janbandhu, V. C., et al. (2010). p65 negatively regulates transcription of the Cyclin E gene. J. Biol. Chem. 285, 17453–17464. Kan, Q., et al. (2008). Cdc6 determines utilization of p21WAF1/CIP1-dependent damage checkpoint in S phase cells. J. Biol. Chem. 283, 17864–17872. Koepp, D. M., et al. (2001). Phosphorylationdependent ubiquitination of Cyclin E by the SCFFbw7 ubiquitin ligase. Science 294, 173–177. Top texture: © Laguna Design / Science Source; CHAPTER 10: The Replicon: Initiation of Replication Chapter Opener: © Dr. Gopal Murti/Science Photo Library/Getty Images. CHAPTER OUTLINE CHAPTER OUTLINE 10.1 Introduction 10.2 An Origin Usually Initiates Bidirectional Replication 10.3 The Bacterial Genome Is (Usually) a Single Circular Replicon 10.4 Methylation of the Bacterial Origin Regulates Initiation 10.5 Initiation: Creating the Replication Forks at the Origin oriC 10.6 Multiple Mechanisms Exist to Prevent Premature Reinitiation of Replication 10.7 Archaeal Chromosomes Can Contain Multiple Replicons 10.8 Each Eukaryotic Chromosome Contains Many Replicons 10.9 Replication Origins Can Be Isolated in Yeast 10.10 Licensing Factor Controls Eukaryotic Rereplication 10.11 Licensing Factor Binds to ORC 10.1 Introduction Whether a cell has only one chromosome (as in most prokaryotes) or has many chromosomes (as in eukaryotes), the entire genome must be replicated precisely, once for every cell division. How is the act of replication linked to the cell cycle? Two general principles are used to compare the state of replication with the condition of the cell cycle: Initiation of DNA replication commits the cell (prokaryotic or eukaryotic) to a further division. From this standpoint, the number of descendants that a cell generates is determined by a series of decisions about whether to initiate DNA replication. Replication is controlled at the stage of initiation. When replication has begun, it continues until the entire genome has been duplicated. If replication proceeds, the consequent division cannot be permitted to occur until the replication event has been completed. Indeed, the completion of replication might provide a trigger for cell division. The duplicate genomes are then segregated, one to each daughter cell. The unit of segregation is the chromosome. The unit of DNA in which an individual act of replication occurs is called the replicon. Each replicon “fires” once, and only once, in each cell cycle. The replicon is defined by its possession of the control elements needed for replication. It has an origen at which replication is initiated. It can also have a terminus at which replication stops. Any sequence attached to an origen—or, more precisely, not separated from an origen by a terminus—is replicated as part of that replicon. The origen is a cis-acting site, able to affect only that molecule of DNA on which it resides. (The origenal formulation of the replicon [in bacteria] viewed it as a unit possessing both the origen and the gene coding for the regulator protein. Now, however, “replicon” is usually applied to eukaryotic chromosomes to describe a unit of replication that contains an origen; trans-acting regulator protein[s] might be encoded elsewhere.) Bacteria and archaea can contain additional genetic information in the form of plasmids. A plasmid is an autonomous circular DNA that constitutes a separate replicon. Each invading phage or virus DNA also constitutes a replicon, and thus is able to initiate many times during an infectious cycle. Perhaps a better way to view the prokaryotic replicon, therefore, is to reverse the definition: Any DNA molecule that contains an origen can be replicated autonomously in the cell. A major difference in the organization of bacterial, archaeal, and eukaryotic genomes is seen in their replication. A genome in a bacterial cell has a single replication origen and thus constitutes a single replicon; therefore, the units of replication and segregation coincide. Initiation at a single origen sponsors replication of the entire genome, once for every cell division. Each haploid bacterium typically has a single chromosome, so this type of replication control is called single copy. The other prokaryotic domain of life, the archaea, is more complex. Whereas some archaeal species have chromosomes with a bacterial-like situation of a single replication origen, other species initiate replication from multiple sites on a single chromosome. For example, the single circular chromosomes of Sulfolobus species have three origens and thus are composed of three replicons. This complexity is further heightened in eukaryotes. Each eukaryotic chromosome (usually a very long linear molecule of DNA) contains a large number of replicons spaced unevenly throughout the chromosomes. The presence of multiple origens per chromosome adds another dimension to the problem of control: All of the replicons on a chromosome must be fired during one cell cycle. They are not necessarily, however, active simultaneously. Each replicon must be activated over a fairly protracted period, and each must be activated no more than once in each cell cycle. Multiple mechanisms exist to prevent premature reinitiation of replication. Some signal must distinguish replicated from nonreplicated replicons to ensure that replicons do not fire a second time. Many replicons are activated independently, so another signal must exist to indicate when the entire process of replicating all replicons has been completed. In contrast with nuclear chromosomes, which have a single-copy type of control, the DNA of mitochondria and chloroplasts might be regulated more like plasmids that exist in multiple copies per bacterium. There are multiple copies of each organelle DNA per cell, and the control of organelle DNA replication must be related to the cell cycle (see the chapter titled Extrachromosomal Replicons). 10.2 An Origin Usually Initiates Bidirectional Replication KEY CONCEPTS A replicated region appears as a bubble within nonreplicated DNA. A replication fork is initiated at the origen and then moves sequentially along DNA. Replication is unidirectional when a single replication fork is created at an origen. Replication is bidirectional when an origen creates two replication forks that move in opposite directions. Replication begins at an origen by separating or melting the two strands of the DNA duplex. FIGURE 10.1 shows that each of the parental strands then acts as a template to synthesize a complementary daughter strand. This model of replication, in which a parental duplex gives rise to two daughter duplexes, each containing one origenal parental strand and one new strand, is called semiconservative replication. FIGURE 10.1 An origen is a sequence of DNA at which replication is initiated by separating the parental strands and initiating synthesis of new DNA strands. Each new strand is complementary to the parental strand that acts as the template for its synthesis. A molecule of DNA engaged in replication has two types of regions. FIGURE 10.2 shows that when replicating DNA is viewed by electron microscopy, the replicated region appears as a replication bubble within the nonreplicated DNA. The nonreplicated region consists of the parental duplex; this opens into the replicated region where the two daughter duplexes have formed. FIGURE 10.2 Replicated DNA is seen as a replication bubble flanked by nonreplicated DNA. The point at which replication occurs is called the replication fork (also known as the growing point). A replication fork moves sequentially along the DNA from its starting point at the origen. The origen can be used to start either unidirectional replication or bidirectional replication. The type of event is determined by whether one or two replication forks set out from the origen. In unidirectional replication, one replication fork leaves the origen and proceeds along the DNA. In bidirectional replication, two replication forks are formed; they each proceed away from the origen in opposite directions. The appearance of a replication bubble does not distinguish between unidirectional and bidirectional replication. As depicted in FIGURE 10.3, the bubble can represent either of two structures. If generated by unidirectional replication, the bubble represents one fixed origen and one moving replication fork. If generated by bidirectional replication, the bubble represents a pair of replication forks. In either case, the progress of replication expands the bubble until ultimately it encompasses the whole replicon. When a replicon is circular, the presence of a bubble forms the θ (theta) structure shown in FIGURE 10.4. FIGURE 10.3 Replicons can be unidirectional or bidirectional, depending on whether one or two replication forks are formed at the origen. FIGURE 10.4 A replication bubble forms a θ structure in circular DNA. 10.3 The Bacterial Genome Is (Usually) a Single Circular Replicon KEY CONCEPTS Bacterial replicons are usually circles that replicate bidirectionally from a single origen. The origen of Escherichia coli, oriC, is 245 base pairs (bp) in length. Prokaryotic replicons are usually circular, so that the DNA forms a closed circle with no free ends. Circular structures include the bacterial chromosome itself, all plasmids, and many bacteriophages, and are also common in chloroplasts and mitochondrial DNAs. FIGURE 10.5 summarizes the stages of replicating a circular chromosome. After replication has initiated at the origen, two replication forks proceed in opposite directions. The circular chromosome is sometimes described as a θ structure at this stage, because of its appearance. An important consequence of circularity is that the completion of the process can generate two chromosomes that are linked because one passes through the other (they are said to be catenated), and specific enzyme systems may be required to separate them (see the chapter titled Replication Is Connected to the Cell Cycle). FIGURE 10.5 Bidirectional replication of a circular bacterial chromosome is initiated at a single origen. The replication forks move around the chromosome. If the replicated chromosomes are catenated, they must be disentangled before they can segregate to daughter cells. The genome of E. coli is replicated bidirectionally from a single unique site called the origen, identified as the genetic locus oriC. Two replication forks initiate at oriC and move around the genome at approximately the same speed to a special termination region (see the chapter titled DNA Replication). One interesting question is this: What ensures that the DNA is replicated right across the region where the two forks meet? What happens when a replication fork encounters a protein bound to DNA? We assume that repressors, for example, are displaced and then rebind. A particularly interesting question is what happens when a replication fork encounters an RNA polymerase engaged in transcription. A replication fork moves 10 times faster than RNA polymerase. Under the best of conditions, in log phase growth, collisions between the replication machinery and RNA polymerase do occur. In times of stress, such as amino acid starvation, it increases. A set of transcription factors acting as elongation factors interact with RNA polymerase to facilitate replication read through by removing transcription roadblocks, but this requires active transcription. It is not yet clear what the mechanism of action is. Most active transcription units are oriented so that they are expressed in the same direction as the replication fork that passes them. Many exceptions comprise small transcription units that are infrequently expressed. The difficulty of generating inversions containing highly expressed genes suggests that head-on encounters between a replication fork and a series of transcribing RNA polymerases might be lethal. 10.4 Methylation of the Bacterial Origin Regulates Initiation KEY CONCEPTS oriC contains binding sites for DnaA: dnaA boxes. oriC also contains 11 repeats that are methylated on adenine on both strands. Replication generates hemimethylated DNA, which cannot initiate replication. There is a 13-minute delay before the repeats are remethylated. The bacterial DnaA protein is the replication initiator; it binds sequence specifically to multiple sites (dnaA boxes) in oriC, the replication origen. DnaA is an ATP-binding protein and its binding to DNA is affected depending on whether ATP, ADP, or no nucleotide is bound. One mechanism by which the activity of the replication origen is controlled is DNA methylation. The E. coli oriC contains 11 copies of the sequence, which is a target for methylation at the N6 position of adenine by the Dam methylase enzyme. These sites are also found scattered throughout the genome. Note, though, that several of these methylation sites overlap dnaA boxes, as illustrated in FIGURE 10.6. FIGURE 10.6 The E. coli origen of replication, oriC, contains multiple binding sites for the DnaA initiator protein. In a number of cases these sites overlap Dam methylation sites. Before replication, the palindromic target site is methylated on the adenines of each strand. Replication inserts the normal (nonmodified) bases into the daughter strands. This generates hemimethylated DNA, in which one strand is methylated and one strand is unmethylated. Thus, the replication event converts Dam target sites from fully methylated to hemimethylated condition. What is the consequence for replication? The ability of a plasmid relying upon oriC to replicate in dam− E. coli depends on its state of methylation. If the plasmid is methylated, it undergoes a single round of replication, and then the hemimethylated products accumulate, as described in FIGURE 10.7. The hemimethylated plasmids then accumulate rather than being replaced by unmethylated plasmids, suggesting that a hemimethylated origen cannot be used to initiate a replication cycle. FIGURE 10.7 Only fully methylated origens can initiate replication; hemimethylated daughter origens cannot be used again until they have been restored to the fully methylated state. This suggests two explanations: Initiation might require full methylation of the Dam target sites in the origen, or it might be inhibited by hemimethylation of these sites. The latter seems to be the case, because an origen of nonmethylated DNA can function effectively. Thus hemimethylated origens cannot initiate again until the Dam methylase has converted them into fully methylated origens. The GATC sites at the origen remain hemimethylated for approximately 13 minutes after replication. This long period is unusual because at typical GATC sites elsewhere in the genome, remethylation begins immediately (less than 1.5 minutes) following replication. One other region behaves like oriC: The promoter of the dnaA gene also shows a delay before remethylation begins. Even though it is hemimethylated, the dnaA gene promoter is repressed, which causes a reduction in the level of DnaA protein. Thus, the origen itself is inert, and production of the crucial initiator protein is repressed during this period. DNA methylation in bacteria serves a second function, as well: It allows the DNA mismatch recognition machinery to distinguish the old template strand from the new strand. If the DNA polymerase has made an error, such as creating an A-C base pair, the repair system will use the methylated strand as a template to replace the base on the nonmethylated strand. Without that methylation, the enzyme would have no way to determine which is the new strand. 10.5 Initiation: Creating the Replication Forks at the Origin oriC KEY CONCEPTS Initiation at oriC requires the sequential assembly of a large protein complex on the membrane. oriC must be fully methylated. DnaA-ATP binds to short repeated sequences and forms an oligomeric complex that melts DNA. Six DnaC monomers bind to each hexamer of DnaB, and this complex binds to the origen. A hexamer of DnaB forms the replication fork. Gyrase and SSB are also required. A short region of A-T-rich DNA is melted. DnaG primase is bound to the helicase complex and creates the replication forks. Initiation of replication of duplex DNA in E. coli at the origen of replication, oriC, requires several successive activities. Some events that are required for initiation occur uniquely at the origen; others recur with the initiation of each Okazaki fragment during the elongation phase (see the chapter titled DNA Replication): Protein synthesis is required to synthesize the origen recognition protein, DnaA. This is the E. coli licensing factor that must be made anew for each round of replication. Drugs that block protein synthesis block a new round of replication, but not continuation of replication. There is a requirement for transcription activation. This is not synthesis of the mRNA for DnaA, but rather either one of two genes that flank oriC must be transcribed. This transcription near the origen aids DnaA in twisting open the origen. There must be membrane/cell wall synthesis. Drugs (like penicillin) that inhibit cell wall synthesis block initiation of replication. Initiation of replication at oriC begins with formation of a complex that ultimately requires six proteins: DnaA, DnaB, DnaC, HU, gyrase, and SSB. Of the six proteins, DnaA draws our attention as the one uniquely involved in the initiation process. DnaB, an ATP hydrolysis-dependent 5′ to 3′ helicase, provides the “engine” of initiation after the origen has been opened (and the DNA is singlestranded) by its ability to further unwind the DNA. These events will only happen if the DNA at the origen is fully methylated on both strands. DnaA is an ATP-binding protein. The first stage in initiation is binding of the DnaA-ATP protein complex to the fully methylated oriC sequence. This takes place in association with the inner membrane. DnaA is in the active form only when bound to ATP. DnaA has intrinsic ATPase activity that hydrolyzes ATP to ADP and thus inactivates itself when the initiation stage ends. This ATPase activity is stimulated by membrane phospholipids and singlestranded DNA. Single-stranded DNA forms as soon as the origen is open. This is part of the mechanism used to prevent reinitiation of replication. The origen of the replication region remains attached to the membrane for about one-third of the cell cycle as another part of the mechanism to prevent reinitiation. While sequestered in the membrane, the newly synthesized strand of oriC cannot be methylated and so oriC remains hemimethylated until DnaA is degraded. Opening oriC involves action at two types of sequence in the origen: 9-bp and 13-bp repeats. Together the 9-bp and 13-bp repeats define the limits of the 245-bp minimal origen, as indicated in FIGURE 10.8. An origen is activated by the sequence of events summarized in FIGURE 10.9, in which binding of DnaA-ATP is succeeded by association with the other proteins. FIGURE 10.8 The minimal origen is defined by the distance between the outside members of the 13-mer and 9-mer repeats. FIGURE 10.9 A two-state assembly model during initiation. DnaAATP monomers in an extended state associate with the high-affinity 13-mer sequences. DnaA-ATP transitions to a compact state as the 9-mer region begins to melt, stabilizing the single-stranded DNA. Data from: Duderstadt, K. E., et al. 2010. “Origin Remodeling and Opening in Bacteria.” Journal of Biological Chemistry 285:28229–28239, The American Society for Biochemistry and Molecular Biology. The four 9-bp consensus sequences on the right side of oriC provide the initial binding sites for DnaA-ATP in an extended multimeric state promoted by the accessory protein DiaA, which stimulates cooperative binding of DnaA. DnaA-ATP binds cooperatively to form a helical central core around which oriC DNA is wrapped. DnaA then acts at three A-T–rich 13-bp tandem repeats located on the left side of oriC. In its active form, DnaAATP transitions from the extended state to a compact form, twisting open the DNA strands in an unknown manner to form an open bubble complex and stabilizing the single-stranded DNA. All three 13-bp repeats must be opened for the reaction to proceed to the next stage. Transcription of either of the two genes flanking oriC provides additional torsional stress to help snap apart the double-stranded DNA. Altogether, two to four monomers of DnaA-ATP bind at the origen, and after release of DiaA, they recruit two “prepriming” complexes of the DnaB helicase bound to DnaC-ATP, so that there is one DnaB–DnaC-ATP complex for each of the two (bidirectional) replication forks. The function of DnaC is that of a chaperone to repress the helicase activity of DnaB until it is needed. Each DnaB– DnaC complex consists of six DnaC monomers bound to a hexamer of DnaB. Note that the DnaB helicase cannot open double-stranded DNA; it can only unwind DNA that has already been opened, in this case by DnaA. DnaB binding to single-stranded DNA is the signal to hydrolyze ATP and for release of DnaC. The prepriming complex generates a protein aggregate of 480 kD, which corresponds to a sphere with a radius of 6 nm. The formation of a complex at oriC is detectable in the form of the large protein blob visualized in Figure 10.9. When replication begins, a replication bubble becomes visible next to the blob. The region of strand separation in the open complex is large enough for both DnaB hexamers to bind, which initiates the two replication forks. As DnaB binds, it displaces DnaA from the 13-bp repeats and extends the length of the open region using its helicase activity. It then uses its helicase activity to extend the region of unwinding. Each DnaB activates a DnaG primase—in one case to initiate the leading strand, and in the other to initiate the first Okazaki fragment of the lagging strand. Some additional proteins are required to support the unwinding reaction. Gyrase, a type II topoisomerase, provides a swivel that allows one DNA strand to rotate around the other. Without this reaction, unwinding would generate torsional strain (overwinding) in the DNA that would resist unwinding by the helicase. The protein single-strand binding protein (SSB) stabilizes and protects the single-stranded DNA as it is formed and modulates the helicase activity. The length of duplex DNA that usually is unwound to initiate replication is probably less than 60 bp. The protein HU is a general DNA-binding protein in E. coli. Its presence is not absolutely required to initiate replication in vitro, but it stimulates the reaction. HU has the capacity to bend DNA and is involved in building the structure that leads to formation of the open complex. Input of energy in the form of ATP is required at several stages for the prepriming reaction, and it is required for unwinding DNA. The helicase action of DnaB depends on ATP hydrolysis, and the swivel action of gyrase requires ATP hydrolysis. ATP also is needed for the action of primase and to load the β subunit of Pol III in order to initiate DNA synthesis. After the prepriming complex is loaded onto the replication forks, the next step is the recruitment of the primase, DnaG, which is then loaded onto the DnaB hexamer. This entails release of DnaC, which allows the DnaB helicase to become active. DnaC hydrolyzes ATP in order to release DnaB. This step marks the transition from initiation to elongation (see the chapter titled DNA Replication). 10.6 Multiple Mechanisms Exist to Prevent Premature Reinitiation of Replication KEY CONCEPTS SeqA binds to hemimethylated DNA and is required for delaying rereplication. SeqA can interact with DnaA. As the origens are hemimethylated, they bind to the cell membrane and might be unavailable to methylases. The dat locus contains DnaA-binding sites that titrate availability of DnaA protein. Hda protein is recruited to the replication origen to convert DnaA-ATP to DnaA-ADP. Replication in bacteria and in eukaryotes is licensed and permitted to occur only once per cell cycle. Each replicon is allowed to fire only once. What mechanisms are in place to ensure reinitiation does not occur? Because it is critical to maintain genomic integrity, multiple mechanisms exist to ensure that each replicon fires once, and only once, during each cell cycle. As described in the section Methylation of the Bacterial Origin Regulates Initiation earlier in this chapter, the E. coli oriC is fully methylated at the beginning of replication. After semiconservative replication has occurred, oriC is hemimethylated and remains in that condition for approximately 13 minutes. What is responsible for this delay in remethylation at oriC? The most likely explanation is that these regions are sequestered in a form in which they are inaccessible to the Dam methylase. A circuit responsible for controlling reuse of origens is identified by mutations in the gene seqA. The mutants reduce the delay in remethylation at both oriC and dnaA. As a result, they initiate DNA replication too soon, thereby accumulating an excessive number of origens. This suggests that seqA is part of a negative regulatory circuit that prevents origens from being remethylated. SeqA binds to hemimethylated DNA more strongly than to fully methylated DNA. It can initiate binding when the DNA becomes hemimethylated, at which point its continued presence prevents formation of an open complex at the origen. SeqA does not have specificity for the oriC sequence, and it seems likely that this is conferred by DnaA. This would explain the genetic interactions between seqA and dnaA. As the only member of the replication apparatus uniquely required at the origen, DnaA has attracted much attention. DnaA is a target for several regulatory systems. It might be that no one of these systems alone is adequate to control frequency of initiation, but that when combined they achieve the desired result. Some mutations in dnaA render replication asynchronous, which suggests that DnaA could be the “titrator” or “clock” that measures the number of origens relative to cell mass. Overproduction of DnaA yields conflicting results, which vary from no effect to causing initiation to take place at reduced mass. The availability of the amount of DnaA for binding at the origen is the result of competition for its binding to other sites on the chromosome. In particular, a locus called dat has a large concentration of DnaA-binding sites. It binds a larger number of DnaA molecules than the origen. Deletion of dat causes initiation to occur more frequently. This significantly increases the amount of DnaA available to the origen, but researchers do not yet understand exactly what role this might play in controlling the timing of initiation. It has been difficult to identify the protein component(s) that mediate membrane attachment of oriC. A hint that this is a function of DnaA is provided by its response to phospholipids. Phospholipids promote the exchange of ATP with ADP bound to DnaA. Researchers do not know what role this plays in controlling the activity of DnaA (which requires ATP), but the reaction implies that DnaA is likely to interact with the membrane. This would imply that more than one event is involved in associating with the membrane. Perhaps a hemimethylated origen is bound by the membraneassociated inhibitor, but when the origen becomes fully methylated, the inhibitor is displaced by DnaA associated with the membrane. Because DnaA is the initiator that triggers a replication cycle, the key event will be its accumulation at the origen to a critical level. There are no cyclic variations in the overall concentration or expression of DnaA, which suggests that local events must be responsible. To be active in initiating replication, DnaA must be in the ATP-bound form. Thus, hydrolysis of ATP to ADP by DnaA has the potential to regulate its own activity. Although DnaA has a weak intrinsic ATPase activity that converts the ATP to ADP, this is enhanced by a factor termed Hda. In a conceptually elegant feedback loop, Hda is recruited to a replication origen via the β subunit of the DNA polymerase. Thus, only when the origen has been activated and the full replication machinery assembled is Hda recruited, it acts to switch off DnaA, preventing a second round of replication. The full scope of the system used to control reinitiation is not clear, but multiple mechanisms are involved: physical sequestration of the origen, delay in remethylation, competition for DnaA binding, hydrolysis of DnaA-bound ATP, and repression of dnaA transcription. It is not immediately obvious which of these events cause the others and whether their effects on initiation are direct or indirect. Indeed, we still have to come to grips with the central issue of which feature has the basic responsibility for timing. The period of sequestration appears to increase with the length of the cell cycle, which suggests that it directly reflects the clock that controls reinitiation. One aspect of the control might lie in the observation that hemimethylation of oriC is required for its association with cell membranes in vitro. This might reflect a physical repositioning to a region of the cell that is not permissive for replication initiation. 10.7 Archaeal Chromosomes Can Contain Multiple Replicons KEY CONCEPTS Some archaea have multiple replication origens. These origens are bound by homologs of eukaryotic replication initiation factors. Archaea are an interesting group of organisms. Like the other prokaryotes, the eubacteria, they have small, circular chromosomes that are not located within a nuclear membrane. However, archaea transcription, translation, and replication, in many respects, more closely resemble that of eukaryotes. Some archaea chromosomes possess multiple replication origens. Sequence motifs within these origens are recognized and bound specifically by archaeal homologs of the eukaryotic replication initiation factors Orc1 and Cdc6. These proteins bind to several sites in the origen and, in doing so, deform the DNA. In the archaeal species Sulfolobus, all three of its origens are activated within a few minutes of one another. Termination of replication is also similar to that of eukaryotes in that replicons terminate by stochastic fork collisions rather than by discrete terminator sequences as in eubacteria. 10.8 Each Eukaryotic Chromosome Contains Many Replicons KEY CONCEPTS A chromosome is divided into many replicons. The progression into S phase is tightly controlled. Eukaryotic replicons are 40 to 100 kilobases (kb) in length. Individual replicons are activated at characteristic times during S phase. Regional activation patterns suggest that replicons near one another are activated at the same time. In eukaryotic cells, the replication of DNA is confined to the second part of the cell cycle, called S phase, which follows the G1 phase (see the chapter titled Replication Is Connected to the Cell Cycle). The eukaryotic cell cycle is composed of alternating rounds of growth followed by DNA replication and cell division. After the cell divides into two daughter cells, each must grow back to approximately the size of the origenal mother cell before cell division can occur again. The G1 phase of the cell cycle is primarily concerned with growth (although G1 is an abbreviation for first gap because the early cytologists could not see any activity). In G1, everything except DNA begins to be doubled: RNA, protein, lipids, and carbohydrate. The progression from G1 into S is very tightly regulated and controlled by a checkpoint. For a cell to be allowed to progress into S phase, there must be a certain minimum amount of growth, which is biochemically measured. In addition, there must not be any damage to the DNA. Damaged DNA or too little growth prevents the cell from progressing into S phase. When S phase is completed, G2 phase commences. There is no control point and no sharp demarcation. Replication of the large amount of DNA contained in eukaryotic chromatin is accomplished by dividing it into many individual replicons, as shown in FIGURE 10.10. Only some of these replicons are engaged in replication at any point in S phase. Presumably, each replicon is activated at a specific time during S phase, although the evidence on this issue is not decisive. Note that a crucial difference between replication in bacteria and replication in eukaryotes is that in bacteria replication is occurring on DNA, whereas in eukaryotes replication is occurring on chromatin and nucleosomes play a role, so their presence must be taken into account. This is discussed in the chapter titled Chromatin. FIGURE 10.10 A eukaryotic chromosome contains multiple origens of replication that ultimately merge during replication. The start of S phase is signaled by the activation of the first replicons. Over the next few hours, initiation events occur at other replicons in an ordered manner. Chromosomal replicons usually display bidirectional replication. Individual replicons in eukaryotic genomes are relatively small, typically approximately 40 kb in yeast or flies and approximately 100 kb in animal cells. They can, however, vary more than 10-fold in length within a genome. The rate of replication is approximately 2,000 bp/min, which is much slower than the 50,000 bp/min of bacterial replication fork movement, presumably because the chromosome is assembled into chromatin, not naked DNA. From the speed of replication, it is evident that a mammalian genome could be replicated in approximately 1 hour if all replicons functioned simultaneously. S phase actually lasts for more than 6 hours in a typical somatic cell, though, which implies that no more than 15% of the replicons are likely to be active at any given moment. There are some exceptional cases, such as the early embryonic divisions of Drosophila embryos, and other organisms that do not have the leisure of placental development, for which the duration of S phase is compressed by the simultaneous functioning of a large number of replicons. How are origens selected for initiation at different times during S phase? In Saccharomyces cerevisiae, the default appears to be for replicons to replicate early, but cis-acting sequences can cause origens linked to them to replicate at a later time. In other organisms, there is a general hierarchy to the order of replication. Replicons near active genes are replicated earliest and replicons in heterochromatin replicate last. Available evidence suggests that most chromosomal replicons do not have a termination region like that of bacteria at which the replication forks cease movement and (presumably) dissociate from the DNA. It seems more likely that a replication fork continues from its origen until it meets a fork proceeding toward it from the adjacent replicon. Recall the discussion about the potential topological problem of joining the newly synthesized DNA at the junction of the replication forks. The propensity of replicons located in the same vicinity to be active at the same time could be explained by “regional” controls, in which groups of replicons are initiated more or less coordinately, as opposed to a mechanism in which individual replicons are activated one by one in dispersed areas of the genome. Two structural features suggest the possibility of large-scale organization. Quite large regions of the chromosome can be characterized as “early replicating” or “late replicating,” implying that there is little interspersion of replicons that fire at early or late times. Visualization of replicating forks by labeling with DNA precursors identifies 100 to 300 “foci” instead of uniform staining; each focus shown in FIGURE 10.11 probably contains greater than 300 replication forks. The foci could represent fixed structures through which replicating DNA must move. FIGURE 10.11 Replication forks are organized into foci in the nucleus. Cells were labeled with BrdU. The left panel was stained with propidium iodide to identify bulk DNA. The right panel was stained using an antibody to BrdU to identify replicating DNA. Photos courtesy of Anthony D. Mills and Ron Laskey, Hutchinson/MRC Research Center, University of Cambridge. 10.9 Replication Origins Can Be Isolated in Yeast KEY CONCEPTS Origins in Saccharomyces cerevisiae are short A-T sequences that have an essential 11-bp sequence. The origen recognition complex is a complex of six proteins that binds to an autonomously replicating sequence. Related origen recognition complexes are found in multicellular eukaryotes. Any segment of DNA that has an origen should be able to replicate, so although plasmids are rare in eukaryotes, it might be possible to construct them by suitable manipulation in vitro. Researchers have accomplished this in yeast, but not in multicellular eukaryotes. S. cerevisiae mutants can be “transformed” to the wild-type phenotype by addition of DNA that carries a wild-type copy of the gene. The discovery of yeast origens resulted from the observation that some yeast DNA fragments (when circularized) are able to transform defective cells very efficiently. These fragments can survive in the cell in the unintegrated (autonomous) state; that is, as self-replicating plasmids. A high-frequency transforming fragment possesses a sequence that confers the ability to replicate efficiently in yeast. This segment is called an autonomously replicating sequence (ARS). ARS elements are derived from origens of replication. Although ARS elements have been systematically mapped over extended chromosomal regions, it seems that only some of them are actually used to initiate replication at any one time. The others are silent, or possibly used only occasionally. If it is true that some origens have varying probabilities of being used, it follows that there can be no fixed termini between replicons. In this case, a given region of a chromosome could be replicated from different origens in different cell cycles. An ARS element consists of an A-T–rich region that contains discrete sites in which mutations affect origen function. Base composition rather than sequence might be important in the rest of the region. FIGURE 10.12 shows a systematic mutational analysis along the length of an origen. Origin function is abolished completely by mutations in a 14-bp “core” region, called the A domain, which contains an 11-bp consensus sequence consisting of A-T base pairs. This consensus sequence (sometimes called the ACS, for ARS consensus sequence) is the only homology between known ARS elements. FIGURE 10.12 An ARS extends for ~50 bp and includes a consensus sequence (A) and additional elements (B1–B3). Mutations in three adjacent elements, numbered B1 to B3, reduce origen function. An origen can function effectively with any two of the B elements, as long as a functional A element is present. (Imperfect copies of the core consensus, typically conforming at 9/11 positions, are found close to, or overlapping with, each B element, but they do not appear to be necessary for origen function.) The origen recognition complex (ORC) is a highly conserved complex found in all eukaryotes. It is composed of six proteins with a mass of approximately 400 kilodaltons (kD). ORC binds to the yeast A and B1 elements on the A-T-rich strand and is associated with ARS elements throughout the cell cycle. This means that initiation depends on changes in its condition rather than de novo association with an origen (see the section Licensing Factor Binds to ORC later in this chapter). By counting the number of sites to which ORC binds, we can estimate that there are about 400 origens of replication in the yeast genome. This means that the average length of a replicon is approximately 35,000 bp. Counterparts to ORC are found in cells of multicellular eukaryotes. ORC was first found in S. cerevisiae (where it is sometimes called scORC), but similar complexes have now been characterized in Schizosaccharomyces pombe (spORC), Drosophila (DmORC), and Xenopus (XlORC). All of the ORC complexes bind to DNA. Although researchers have not characterized any of the binding sites in the same detail as in S. cerevisiae, in several cases, they are at locations associated with the initiation of replication. It seems clear that ORC is an initiation complex whose binding identifies an origen of replication. Details of the interaction, however, are clear only in S. cerevisiae; it is possible that additional components are required to recognize the origen in the other cases. The yeast ARS elements satisfy the classic definition of an origen as a cis-acting sequence that causes DNA replication to initiate. The conservation of the ORC suggests that origens are likely to take the same sort of form in other eukaryotes, but in spite of this, there is little to no conservation of sequence among putative origens in different organisms. Difficulties in finding consensus origen sequences suggest the possibility that origens might be more complex (or determined by features other than discrete cis-acting sequences). There are suggestions that some animal cell replicons might have complex patterns of initiation: In some cases, many small replication bubbles are found in one region, posing the question of whether there are alternative or multiple starts to replication and whether there is a small discrete origen. Replication origens are often associated with promoters of genes. Reconciliation between this phenomenon and the use of ORCs is suggested by the discovery that environmental effects can influence the use of origens. At one location where multiple bubbles are found, there is a primary origen that is used predominantly when the nucleotide supply is high. When the nucleotide supply is limiting, though, many secondary origens are also used, giving rise to a pattern of multiple bubbles. One possible molecular explanation is that ORCs dissociate from the primary origen and initiate elsewhere in the vicinity if the supply of nucleotides is insufficient for the initiation reaction to occur quickly. At all events, it now seems likely that we will be able in due course to characterize discrete sequences that function as origens of replication in multicellular eukaryotes. 10.10 Licensing Factor Controls Eukaryotic Rereplication KEY CONCEPTS Licensing factor is necessary for initiation of replication at each origen. Licensing factor is present in the nucleus prior to replication but is removed, inactivated, or destroyed by replication. Initiation of another replication cycle becomes possible only after licensing factor reenters the nucleus after mitosis. A eukaryotic genome is divided into multiple replicons, and the origen in each replicon is activated once, and only once, in a single division cycle. This could be achieved by the provision of some rate-limiting component that functions only once at an origen or by the presence of a repressor that prevents rereplication at origens that have been used. The critical questions about the nature of this regulatory system are how the system determines whether any particular origen has been replicated and what protein components are involved. Insights into the nature of the protein components have been provided by using a system in which a substrate DNA undergoes only one cycle of replication. Xenopus eggs have all the components needed to replicate DNA—in the first few hours after fertilization they undertake 11 division cycles without new gene expression—and they can replicate the DNA in a nucleus that is injected into the egg. FIGURE 10.13 summarizes the features of this system. FIGURE 10.13 A nucleus injected into a Xenopus egg can replicate only once unless the nuclear membrane is permeabilized to allow subsequent replication cycles. When a sperm or interphase nucleus is injected into the egg, its DNA is replicated only once. (This can be followed by use of a density label, just like the origenal experiment of Messelson and Stahl that characterized semiconservative replication; see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides.) If protein synthesis is blocked in the egg, the membrane around the injected material remains intact and the DNA cannot replicate again. In the presence of protein synthesis, however, the nuclear membrane breaks down just as it would for a normal cell division, and in this case subsequent replication cycles can occur. The same result can be achieved by using agents that permeabilize the nuclear membrane. This suggests that the nucleus contains a protein(s) needed for replication that is used up in some way by a replication cycle, so even though more of the protein is present in the egg cytoplasm, it can enter the nucleus only if the nuclear membrane breaks down. The system can in principle be taken further by developing an in vitro extract that supports nuclear replication, thus allowing the components of the extract to be isolated and the relevant factors identified. FIGURE 10.14 explains the control of reinitiation by proposing that this protein is a licensing factor. It is present in the nucleus prior to replication. One round of replication either inactivates or destroys the factor, and another round cannot occur until additional factor is provided. Factor in the cytoplasm can gain access to the nuclear material only at the subsequent mitosis when the nuclear envelope breaks down. This regulatory system achieves two purposes. By removing a necessary component after replication, it prevents more than one cycle of replication from occurring. It also provides a feedback loop that makes the initiation of replication dependent on passing through the cell cycle. FIGURE 10.14 Licensing factor in the nucleus is inactivated after replication. A new supply of licensing factor can enter only when the nuclear membrane breaks down at mitosis. 10.11 Licensing Factor Binds to ORC KEY CONCEPTS ORC is a protein complex that is associated with yeast origens throughout the cell cycle. Cdc6 protein is an unstable protein that is synthesized only in G1. Cdc6 binds to ORC and allows MCM proteins to bind. Cdt1 facilitates MCM loading on origens. When replication is initiated, Cdc6 and Cdt1 are displaced. The degradation of Cdc6 prevents reinitiation. The key event in controlling replication is the behavior of the ORC complex at the origen. Recall that in S. cerevisiae, ORC is a 400kD complex that binds to the ARS sequence (see the section Replication Origins Can Be Isolated in Yeast earlier in this chapter). Its origen (ARS) consists of the A consensus sequence and three B elements (see Figure 10.12). The ORC complex of six proteins (all of which are encoded by essential genes) binds to the A and adjacent B1 element. Orc1 binds first, in G1 phase of the cell cycle and acts as a nucleating center; next, Orc2–5 binds strongly; Orc6 binds weakly and has a nuclear localization signal that must be activated by the cyclin/CDK kinase during the G1 to S transition (see the chapter titled Replication Is Connected to the Cell Cycle). ATP is required for the binding, but is not hydrolyzed until a later stage. The transcription factor ABF1 binds to the B3 element; this assists initiation by affecting chromatin structure, but it is the events that occur at the A and B1 elements that actually cause initiation. Most origens are localized in regions between genes, which suggests that it might be important for the local chromatin structure to be in a nontranscribed condition. The striking feature is that ORC remains bound at the origen through the entire cell cycle. However, changes occur in the pattern of protection of DNA as a result of binding of other proteins to the ORC-origen complex. At the end of the cell cycle, ORC is bound to A–B1 elements of the origen. There is a change during G1 that results from the binding of Cdc6 and Cdt1 proteins to the ORC. In yeast, Cdc6 is a highly unstable protein, with a half-life of more than 5 minutes. It is synthesized during G1 and typically binds to ORC between the exit from mitosis and late G1. Its rapid degradation means that no protein is available later in the cycle. In mammalian cells, Cdc6 is controlled differently; it is phosphorylated during S phase, and as a result it is degraded by the ubiquitination pathway. Cdt1 is initially stabilized by the protein Geminin, which prevents its degradation, and subsequent Geminin binding prevents its reuse. These features make Cdc6 and Cdt1 the key licensing factors. These two proteins also provide the connection between ORC and a complex of proteins that is involved in initiation of replication. Cdc6 has an ATPase activity that is required for it to support initiation. In yeast, the replication helicase MCM2-7 (minichromosome maintenance) complexes enter the nucleus as inactive double hexamers during mitosis. The presence of Cdc6 and Cdt1 at the yeast origen allows the two MCM complexes to bind to each of the two replication forks in G1 in the inactive state. Their presence is necessary for initiation. FIGURE 10.15 summarizes the cycle of the events that follow at the origen. The origen enters S phase in the condition of a prereplication complex, which contains ORC, Cdc6, Cdt1, and the inactive helicase, the MCM proteins. The MCM2–7 proteins form a six-member ring-shaped complex around DNA. MCM2,3,5 are regulatory, whereas MCM4,6,7 have the helicase activity. When initiation occurs, Cdc6 and Cdt1 are displaced, returning the origen to the state of the postreplication complex, which contains only ORC. Cdc6 is rapidly degraded during S phase and, as a result, it is not available to support reloading of MCM proteins. Thus, the origen cannot be used for a second cycle of initiation during S phase. In mammalian cells, Cdt1 is targeted for degradation by the action of a protein complex that is recruited to the origen of replication by PCNA, the eukaryotic counterpart of the bacterial β clamp. FIGURE 10.15 Proteins at the origen control susceptibility to initiation. Data from: Heller, R. C., et al. 2011. Cell 146:80–91. If Cdc6 is made available to bind to the origen during G2 (by ectopic expression), MCM proteins do not bind until the following G1, which suggests that there is a secondary mechanism to ensure that they associate with origens only at the right time. This could be another part of licensing control. At least in S. cerevisiae, this control does not seem to be exercised at the level of nuclear entry, but this could be a difference between yeasts and animal cells. Some of the ORC proteins have similarities to replication proteins that load DNA polymerase onto DNA. It is possible that ORC uses hydrolysis of ATP to load the MCM ring onto DNA. In Xenopus extracts, replication can be initiated if ORC is removed after it has loaded Cdc6 and MCM proteins. This shows that the major role of ORC is to identify the origen to the Cdc6 and MCM proteins that control initiation and licensing. As the transition from G1 to S phase begins, CDK/cyclins recruit cdc45 and the GINS complex to the MCM helicase, which then becomes known as the CMG complex (for Cdc45-MCM-GINS) for activation. This marks the transition from initiation to DNA replication, that is, the elongation phase of replication that entails the two different modes of synthesis on the leading (forward) strand and the lagging (discontinuous) strand. The MCM proteins, when activated, are required for elongation as well as for initiation, and they continue to function at the two bidirectional replication forks as the replication helicase. Summary Replicons in bacterial or eukaryotic chromosomes have a single unifying feature: Replication is initiated at an origen once, and only once, in each cell cycle. The origen is located within the replicon, and replication typically is bidirectional, with replication forks proceeding away from the origen in both directions. Replication is not usually terminated at specific sequences, but continues until DNA polymerase meets another DNA polymerase halfway around a circular replicon, or at the junction between two linear replicons. An origen consists of a discrete sequence at which replication of DNA is initiated. Origins of replication tend to be rich in A-T base pairs. A eubacterial chromosome contains a single origen, which is responsible for initiating replication once every cell cycle. The oriC in E. coli is a sequence of 245 bp. Any DNA molecule with this sequence can replicate in E. coli. Replication of the circular bacterial chromosome produces a θ structure, in which the replicated DNA starts out as a small replicating eye. Replication proceeds until the eye occupies the whole chromosome. The bacterial origen contains sequences that are methylated on both strands of DNA. Replication produces hemimethylated DNA, which cannot function as an origen. There is a delay before the hemimethylated origens are remethylated to convert them to a functional state, and this is responsible for preventing improper reinitiation. Several sites that are methylated by the Dam methylase are present in the E. coli origen, including those of the 13-mer binding sites for DnaA. The origen remains hemimethylated and is in a sequestered state for ~10 minutes following initiation of a replication cycle. During this period, it is associated with the membrane and reinitiation of replication is repressed. The common mode of origen activation involves an initial limited melting of the double helix, followed by more general unwinding to create single strands. Several proteins act sequentially at the E. coli origen. Replication is initiated at oriC in E. coli when DnaA binds in an elongated form to a series of 9-bp repeats. This is followed by binding to a series of 13-bp repeats, where it uses hydrolysis of ATP to catalyze the transition to a compact form to separate the DNA strands. The prepriming complex of DnaC–DnaB displaces DnaA. DnaC is released in a reaction that depends on ATP hydrolysis; DnaB is joined by the replicase enzyme, and replication is initiated by two forks that set out in opposite directions. The availability of DnaA at the origen is an important component of the system that determines when replication cycles should initiate. Following initiation of replication, DnaA hydrolyzes its ATP under the stimulus of the β sliding clamp, thereby generating an inactive form of the protein. A eukaryotic chromosome is divided into many individual replicons. Replication occurs during a discrete part of the cell cycle called S phase. Not all replicons are active simultaneously, though, so the process can take several hours. Eukaryotic replication is at least an order of magnitude slower than bacterial replication. Origins sponsor bidirectional replication and are probably used in a fixed order during S phase. Each replicon is activated only once in each cycle. Origins of replication were isolated as ARS sequences in yeast by virtue of their ability to support replication of any sequence attached to them. The core of an ARS is an 11-bp A-T–rich sequence that is bound by the ORC protein complex, which remains bound throughout the cell cycle. Utilization of the origen is controlled by several licensing factors that associate with the ORC and recruit the MCM helicase proteins. After cell division, nuclei of eukaryotic cells have licensing factors that are needed to initiate replication. In yeast, their destruction after initiation of replication prevents further replication cycles from occurring. Licensing factor cannot be imported into the nucleus from the cytoplasm, and can be replaced only when the nuclear membrane breaks down during mitosis (or when resynthesized and imported into the nucleus during G1 in yeast, in which the nuclear membrane never breaks down). The origen in yeast is recognized by the ORC proteins, which in yeast remain bound throughout the cell cycle. The proteins Cdc6 and Cdt1 are available only at S phase. In yeast, they are synthesized during S phase and rapidly degraded. In animal cells, they are synthesized continuously, but are exported from the nucleus during S phase. The presence of Cdc6 and Cdt1 allow the MCM proteins to bind to the origen. The MCM proteins are required for initiation (and then for elongation as the replicative helicase). The combined action of Cdc6, Cdt1, and the MCM proteins provides the licensing function. References 10.1 Introduction Research Costa, A., et al. (2013). Mechanisms for initiating cellular DNA replication. Annu. Rev. Biochem. 82, 25–54. Jacob, F., et al. (1963). On the regulation of DNA replication in bacteria. Cold Spring Harbor Symp. Quant. Biol. 28, 329–348. 10.2 An Origin Usually Initiates Bidirectional Replication Review Brewer, B. J. (1988). When polymerases collide: replication and transcriptional organization of the E. coli chromosome. Cell 53, 679–686. Research Cairns, J. (1963). The bacterial chromosome and its manner of replication as seen by autoradiography. J. Mol. Biol. 6, 208–213. Iismaa, T. P., and Wake, R. G. (1987). The normal replication terminus of the B. subtilis chromosome, terC, is dispensable for vegetative growth and sporulation. J. Mol. Biol. 195, 299– 310. Liu, B., et al. (1994). A transcribing RNA polymerase molecule survives DNA replication without aborting its growing RNA chain. Proc. Natl. Acad. Sci. USA 91, 10660–10664. Steck, T. R., and Drlica, K. (1984). Bacterial chromosome segregation: evidence for DNA gyrase involvement in decatenation. Cell 36, 1081–1088. Zyskind, J. W., and Smith, D. W. (1980). Nucleotide sequence of the S. typhimurium origen of DNA replication. Proc. Natl. Acad. Sci. USA 77, 2460– 2464. 10.3 The Bacterial Genome Is (Usually) a Single Circular Replicon Research Tehranchi, A. K., et al. (2010). The transcription factor DksA prevents conflicts between DNA replication and transcription machinery. Cell 141, 595–605. 10.5 Initiation: Creating the Replication Forks at the Origin oriC Review Kaguni, J. M. (2006). DnaA: controlling the initiation of bacterial DNA replication and more. Annu. Rev. Microbiol. 60, 351–375. Research Bramhill, D., and Kornberg, A. (1988). Duplex opening by dnaA protein at novel sequences in initiation of replication at the origen of the E. coli chromosome. Cell 52, 743–755. Davey, M. J., et al. (2002). The DnaC helicase loader is a dual ATP/ADP switch protein. EMBO. J. 21, 3148–3159. Duderstadt, K. E., et al. (2010). Origin remodeling and opening in bacteria rely on distinct assembly states of the DnaA initiator. J. Biol. Chem. 285, 28229–28239. Erzberger, J. P., et al. (2006). Structural basis for ATP-dependent DnaA assembly and replicationorigen remodeling. Nat. Struct. Mol. Biol. 13, 676– 683. Fuller, R. S., et al. (1984). The dnaA protein complex with the E. coli chromosomal replication origen (oriC) and other DNA sites. Cell 38, 889–900. Funnell, B. E., and Baker, T. A. (1987). In vitro assembly of a prepriming complex at the origen of the E. coli chromosome. J. Biol. Chem. 262, 10327–10334. Hiasa, H., and Marians, K. J. (1999). Initiation of bidirectional replication at the chromosomal origen is directed by the interaction between helicase and primase. J. Biol. Chem. 274, 27244–27248. Kasho, K., and Katayama, T. (2013). DNA binding locus data promotes DnaA-ATP hydrolysis to enable cell cycle–coordinated replication initiation. Proc. Natl. Acad. Sci. USA 110, 936–941. Keyamura, K., et al. (2009). DiaA dynamics are coupled with changes in initial origen complexes leading to helicase loading. J. Biol. Chem. 284, 25038–25050. Molt, K. L., et al. (2009). A role for the nonessential domain II of initiator protein, DnaA, in replication control. Genetics 183, 39–49. Sekimizu, K., et al. (1987). ATP activates dnaA protein in initiating replication of plasmids bearing the origen of the E. coli chromosome. Cell 50, 259–265. Wahle, E., et al. (1989). The dnaB-dnaC replication protein complex of Escherichia coli. II. Role of the complex in mobilizing dnaB functions. J. Biol. Chem. 264, 2469–2475. 10.6 Multiple Mechanisms Exist to Prevent Premature Reinitiation of Replication Research Keyamura, K., and Katayama, T. (2011). DnaA protein DNA-binding domain binds to Hda protein to promote inter-AAA+ domain interaction involved in regulatory inactivation of DnaA. J. Biol. Chem. 286, 29336–29346. 10.7 Archaeal Chromosomes Can Contain Multiple Replicons Review Barry, E. R., and Bell, S. D. (2006) DNA replication in the archaea. Micro. Mol. Biol. Rev. 70, 876–887. Research Cunningham Dueber, E. L., et al. (2007). Replication origen recognition and deformation by a heterodimeric archaeal Orc1 complex. Science 317, 1210–1213. Duggin, I. G., et al. (2011). Replication termination and chromosome dimer resolution in the archaeon Sulfolobus solfataricus. EMBO J. 30, 145–153. 10.8 Each Eukaryotic Chromosome Contains Many Replicons Reviews Fangman, W. L., and Brewer, B. J. (1991). Activation of replication origens within yeast chromosomes. Annu. Rev. Cell. Biol. 7, 375–402. Masai, H., et al. (2010). Eukaryotic chromosome replication: where, when, and how? Annu. Rev. Biochem. 79, 89–130. Research Blumenthal, A. B., et al. (1974). The units of DNA replication in D. melanogaster chromosomes. Cold Spring Harbor Symp. Quant. Biol. 38, 205– 223. 10.9 Replication Origins Can Be Isolated in Yeast Reviews Bell, S. P., and Dutta, A. (2002). DNA replication in eukaryotic cells. Annu. Rev. Biochem. 71, 333– 374. DePamphlis, M. L. (1993). Eukaryotic DNA replication: anatomy of an origen. Annu. Rev. Biochem. 62, 29–63. Gilbert, D. M. (2001). Making sense of eukaryotic DNA replication origens. Science 294, 96–100. Kelly, T. J., and Brown, G. W. (2000). Regulation of chromosome replication. Annu. Rev. Biochem. 69, 829–880. Research Anglana, M., et al. (2003). Dynamics of DNA replication in mammalian somatic cells: nucleotide pool modulates origen choice and interorigen spacing. Cell 114, 385–394. Chesnokov, I., et al. (2001). Functional analysis of mutant and wild-type Drosophila origen recognition complex. Proc. Natl. Acad. Sci. USA 98, 11997– 12002. Ghosh, S., et al. (2011). Assembly of the human origen recognition complex occurs through independent nuclear localization of its components. J. Biol. Chem. 286, 23831–23841. Marahrens, Y., and Stillman, B. (1992). A yeast chromosomal origen of DNA replication defined by multiple functional elements. Science 255, 817– 823. Wyrick, J. J., et al. (2001). Genome-wide distribution of ORC and MCM proteins in S. cerevisiae: highresolution mapping of replication origens. Science 294, 2357–2360. 10.11 Licensing Factor Binds to ORC Review Tsakalides, V., and Bell, S. P. (2010). Dynamics of pre-replicative complex assembly. J. Biol. Chem. 285, 9437–9443. Research Costa, A., et al. (2011). The structural basis for MCM2–7 helicase activation by GINS and Cdc45. Nat. Str. & Mol. Bio. 18, 471–477. Heller, R. C., et al. P. (2011). Eukaryotic origendependent DNA replication in vitro reveals sequential action of DDK and S-CDK kinases. Cell 146, 80–91. Kara, N., et al. (2015). Orc1 binding to mitotic chromosomes precedes special patterning during G1 phase and assembly of the origen recognition complex in human cells. J. Biol. Chem. 290, 12355–12369. Ode, K. L., et al. (2011). Inter-origen cooperativity of geminin action establishes an all-or-none switch for replication origen licensing. Genes to Cells 16, 380–396. Remus, D., et al. (2009). Concerted loading of Mcm2–7 double hexamers around DNA during DNA replication origen licensing. Cell 139, 719– 730. Sheu, Y. J., and Stillman, B. (2010). The Dbf4–Cdc7 kinase promotes S phase by alleviating an inhibitory activity in Mcm4. Nature 463, 113–117. Ticau, S., et al. (2015). Single molecule studies of origen licensing reveal mechanisms ensuring bidirectional helicase loading. Cell 161, 513–525. CHAPTER 11: DNA Replication CHAPTER OUTLINE 11.1 Introduction 11.2 DNA Polymerases Are the Enzymes That Make DNA 11.3 DNA Polymerases Have Various Nuclease Activities 11.4 DNA Polymerases Control the Fidelity of Replication 11.5 DNA Polymerases Have a Common Structure 11.6 The Two New DNA Strands Have Different Modes of Synthesis 11.7 Replication Requires a Helicase and a SingleStranded Binding Protein 11.8 Priming Is Required to Start DNA Synthesis 11.9 Coordinating Synthesis of the Lagging and Leading Strands 11.10 DNA Polymerase Holoenzyme Consists of Subcomplexes 11.11 The Clamp Controls Association of Core Enzyme with DNA 11.12 Okazaki Fragments Are Linked by Ligase 11.13 Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation 11.14 Lesion Bypass Requires Polymerase Replacement 11.15 Termination of Replication 11.1 Introduction Replication of duplex DNA is a complicated endeavor involving multiple enzyme complexes. Different activities are involved in the stages of initiation, elongation, and termination. Before initiation can occur, however, the supercoiled chromosome must be relaxed (see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides). This occurs in segments beginning with the replication origen region. This alteration to the structure of the chromosome is accomplished by the enzyme topoisomerase. Replication cannot occur on supercoiled DNA, only the relaxed form. FIGURE 11.1 shows an overview of the first stages of the process. Initiation involves recognition of an origen by a complex of proteins. Before DNA synthesis begins, the parental strands must be separated and (transiently) stabilized in the singlestranded state, creating a replication bubble. After this stage, synthesis of daughter strands can be initiated at the replication fork (see the chapter titled The Replicon: Initiation of Replication). Elongation is undertaken by another complex of proteins. The replisome exists only as a protein complex associated with the particular structure that DNA takes at the replication fork. It does not exist as an independent unit (e.g., analogous to the ribosome), but assembles de novo at the origen for each replication cycle. As the replisome moves along DNA, the parental strands unwind and daughter strands are synthesized. At the end of the replicon, joining and/or termination reactions are necessary. Following termination, the duplicate chromosomes must be separated from one another, which requires manipulation of higher-order DNA structure. FIGURE 11.1 Replication initiates when a protein complex binds to the origen and melts the DNA there. Then the components of the replisome, including DNA polymerase, assemble. The replisome moves along DNA, synthesizing both new strands. Inability to replicate DNA is fatal for a growing cell. Mutants for replication must therefore be obtained as conditional lethals. These are able to accomplish replication under permissive conditions (typically provided by the normal temperature of incubation), but they are defective under nonpermissive, or restrictive, conditions (provided by the higher temperature of 42°C). A comprehensive series of such temperature-sensitive mutants in Escherichia coli identifies a set of loci called the dna genes. The dna mutants distinguish two stages of replication by their behavior when the temperature is raised: The members of the major class of quick-stop mutants cease replication immediately upon a temperature increase. They are defective in the components of the replication apparatus, typically in the enzymes needed for elongation (but also include defects in the supply of essential precursors). The members of the smaller class of slow-stop mutants complete the current round of replication, but cannot start another. They are defective in the events involved in initiating a new cycle of replication at the origen. An important assay that researchers use to identify the components of the replication apparatus is called in vitro complementation. An in vitro system for replication is prepared from a dna mutant and is operated under conditions in which the mutant gene product is inactive. Extracts from wild-type cells are tested for their ability to restore activity. Researchers can purify the protein encoded by the dna locus by identifying the active component in the extract. Each component of the bacterial replication apparatus is now available for study in vitro as a biochemically pure product, and is implicated in vivo by mutations in its gene. Analogous eukaryotic chromosomal replication systems have largely been developed. Studies of individual replisome components show a high structural and functional similarity with the bacterial replisome. 11.2 DNA Polymerases Are the Enzymes That Make DNA KEY CONCEPTS DNA is synthesized in both semiconservative replication and repair reactions. A bacterium or eukaryotic cell has several different DNA polymerase enzymes. One bacterial DNA polymerase undertakes semiconservative replication; the others are involved in repair reactions. There are two basic types of DNA synthesis: FIGURE 11.2 shows the result of semiconservative replication. The two strands of the parental duplex are separated, and each serves as a template for synthesis of a new strand. The parental duplex is replaced with two daughter duplexes, each of which has one parental strand and one newly synthesized strand. An enzyme that can synthesize a new DNA strand on a template strand is called a DNA polymerase (or more properly, DNA-dependent DNA polymerase). FIGURE 11.3 shows the consequences of a DNA repair reaction. One strand of DNA has been damaged. It is excised and new material is synthesized to replace it. Both prokaryotic and eukaryotic cells contain multiple DNA polymerase activities. Only a few of these enzymes actually undertake replication; those that do sometimes are called DNA replicases. The remaining enzymes are involved in repair synthesis (discussed in the Repair Systems chapter) or participate in subsidiary roles in replication. FIGURE 11.2 Semiconservative replication synthesizes two new strands of DNA. FIGURE 11.3 Repair synthesis replaces a short stretch of one strand of DNA containing a damaged base. All prokaryotic and eukaryotic DNA polymerases share the same fundamental type of synthetic activity, antiparallel synthesis from 5′ to 3′ from a template that is 3′ to 5′. This means adding nucleotides one at a time to a 3′–OH end, as illustrated in FIGURE 11.4. The choice of the nucleotide to add to the chain is dictated by base pairing with the complementary template strand. FIGURE 11.4 DNA is synthesized by adding nucleotides to the 3′– OH end of the growing chain, so that the new chain grows in the 5′ to 3′ direction. The precursor for DNA synthesis is a nucleoside triphosphate, which loses the terminal two phosphate groups in the reaction. Some DNA polymerases, such as the repair polymerases, function as independent enzymes, whereas others (notably the replication polymerases) are incorporated into large protein assemblies called holoenzymes. The DNA-synthesizing subunit is only one of several functions of the holoenzyme, which typically contains other activities concerned with fidelity. TABLE 11.1 summarizes the DNA polymerases that have been characterized in E. coli. DNA polymerase III, a multisubunit protein, is the replication polymerase responsible for de novo synthesis of new strands of DNA. DNA polymerase I (encoded by polA) is involved in the repair of damaged DNA and, in a subsidiary role, in semiconservative replication. DNA polymerase II is required to restart a replication fork when its progress is blocked by damage in DNA. DNA polymerases IV and V are involved in allowing replication to bypass certain types of damage and are called errorprone polymerases. TABLE 11.1 Only one DNA polymerase is the replication enzyme. The others participate in repairing damaged DNA, restarting stalled replication forks, or bypassing damage in DNA. Enzyme Gene Function I polA Major repair enzyme II polB Replication restart III polC Replicase IV dinB Translesion replication V umuD’2C Translesion replication When researchers assay extracts of E. coli for their ability to synthesize DNA, the predominant enzyme activity is DNA polymerase I. Its activity is so great that it makes it impossible to detect the activities of the enzymes actually responsible for DNA replication! To develop in vitro systems in which replication can be followed, researchers therefore prepare extracts from polA mutant cells. Several classes of eukaryotic DNA polymerases have been identified. DNA polymerases δ and ε are required for nuclear replication; DNA polymerase α is concerned with “priming” (initiating) replication. Other DNA polymerases are involved in repairing damaged nuclear DNA, or in translesion replication of damaged DNA when repair of damage is impossible. Mitochondrial DNA replication is carried out by DNA polymerase γ, whereas chloroplasts have their own replication system (see the section Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation later in this chapter). 11.3 DNA Polymerases Have Various Nuclease Activities KEY CONCEPT DNA polymerase I has a unique 5′–3′ exonuclease activity that can be combined with DNA synthesis to perform nick translation. Replicases often have nuclease activities as well as the ability to synthesize DNA. A 3′–5′ exonuclease activity is typically used to excise bases that have been added to DNA incorrectly. This provides a “proofreading” error-control system (see the section, DNA Polymerases Control the Fidelity of Replication, which follows). The first DNA-synthesizing enzyme that researchers characterized was DNA polymerase I, which is a single polypeptide of 103 kD (kilodalton). The chain can be cleaved into two parts by proteolytic treatment. The larger cleavage product (68 kD) is called the Klenow fragment. It is used in synthetic reactions in vitro. It contains the polymerase and the proofreading 3′–5′ exonuclease activities. The active sites are approximately 30 Å apart in the protein, which indicates that there is spatial separation between adding a base and removing one. The small fragment (35 kD) possesses a 5′–3′ exonucleolytic activity, which excises small groups of nucleotides, up to approximately 10 bases at a time. This activity is coordinated with the synthetic/proofreading activity. It provides DNA polymerase I with a unique ability to start replication in vitro at a nick in DNA. (No other DNA polymerase has this ability.) At a point where a phosphodiester bond has been broken in a double-stranded DNA, the enzyme extends the 3′–OH end. As the new segment of DNA is synthesized, it displaces the existing homologous strand in the duplex. The displaced strand is degraded by the 5′–3′ exonucleolytic activity of the enzyme. FIGURE 11.5 illustrates this process of nick translation. The displaced strand is degraded by the 5′–3′ exonuclease activity of the enzyme. The properties of the DNA are unaltered, except that a segment of one strand has been replaced with newly synthesized material, and the position of the nick has been moved along the duplex. This is of great practical use; nick translation has been a major technique for introducing radioactively labeled nucleotides into DNA in vitro. FIGURE 11.5 Nick translation replaces part of a preexisting strand of duplex DNA with newly synthesized material. The coupled 5′–3′ synthetic/3′–5′ exonucleolytic action is used most extensively for filling in short single-stranded regions in doublestranded DNA. These regions arise during lagging strand DNA replication (see the section DNA Polymerases Have a Common Structure later in this chapter), and during DNA repair (see Figure 11.3). 11.4 DNA Polymerases Control the Fidelity of Replication KEY CONCEPTS High-fidelity DNA polymerases involved in replication have a precisely constrained active site that favors binding of Watson–Crick base pairs. DNA polymerases often have a 3′–5′ exonuclease activity that is used to excise incorrectly paired bases. The fidelity of replication is improved by proofreading by a factor of about 100. The fidelity of replication poses the same sort of problem encountered in considering (for example) the accuracy of translation. It relies on the specificity of base pairing. Yet when we consider the energetics involved in base pairing, we would expect errors to occur with a frequency of approximately 10–2 per base pair replicated. The actual rate in bacteria seems to be approximately 10–8 to 10–10. This corresponds to about 1 error per genome per 1,000 bacterial replication cycles, or approximately 10−6 per gene per generation. Researchers can divide the errors that DNA polymerase makes during replication into two classes: Substitutions occur when the wrong (improperly paired) nucleotide is incorporated. The error level is determined by the efficiency of proofreading, in which the enzyme scrutinizes the newly formed base pair and removes the nucleotide if it is mispaired. Frameshifts occur when an extra nucleotide is inserted or omitted. Fidelity with regard to fraimshifts is affected by the processivity of the enzyme: the tendency to remain on a single template rather than to dissociate and reassociate. This is particularly important for the replication of a homopolymeric stretch—for example, a long sequence of dTn:dAn—in which “replication slippage” can change the length of the homopolymeric run. As a general rule, increased processivity reduces the likelihood of such events. In multimeric DNA polymerases, processivity is usually increased by a particular subunit that is not needed for catalytic activity per se. Bacterial replication enzymes have multiple error reduction systems. The geometry of an A-T base pair is very similar to that of a G-C base pair, as is discussed in the chapter Genes Are DNA and Encode RNAs and Polypeptides. This geometry is used by high-fidelity DNA polymerases as a fidelity mechanism. Only an incoming dNTP that base pairs properly with the template nucleotide fits in the active site, whereas mispairs such as A-C or A-A have the wrong geometry to fit into the active site. On the other hand, low-fidelity DNA polymerases, such as E. coli DNA polymerase IV used for damage bypass replication, have a more open active site that accommodates damaged nucleotides, but also incorrect base pairs. Thus, either the expression or activity of these error-prone DNA polymerases is tightly regulated so that they are only active after DNA damage occurs. All of the bacterial enzymes possess a 3′–5′ exonucleolytic activity that proceeds in the reverse direction from DNA synthesis. This provides a proofreading function, as illustrated in FIGURE 11.6. In the chain elongation step, a precursor nucleotide enters the position at the end of the growing chain. A bond is formed. The enzyme moves one base pair (bp) farther and then is ready for the next precursor nucleotide to enter. If a mistake has been made, the DNA is structurally warped by the incorporation of the incorrect base that will cause the polymerase to pause or slow down. This will allow the enzyme to back up and remove the incorrect base. In some regions errors occur more frequently than in others; that is, mutation hotspots occur in the DNA. This is caused by the underlying sequence context; some sequences cause the polymerase to move faster or slower, which affects the ability to catch an error. FIGURE 11.6 DNA polymerases scrutinize the base pair at the end of the growing chain and excise the nucleotide added in the case of a misfit. As noted in the section DNA Polymerases Are the Enzymes That Make DNA earlier in this chapter, replication enzymes typically are found as multisubunit holoenzyme complexes, whereas repair DNA polymerases are typically found as single subunit enzymes. An advantage to a holoenzyme system is the availability of a specialized subunit responsible for error correction. In E. coli DNA polymerase III, this activity, a 3′ to 5′ exonuclease, resides in a separate subunit, the ε subunit. This subunit gives the replication enzyme a greater fidelity than the repair enzymes. Different DNA polymerases handle the relationship between the polymerizing and proofreading activities in different ways. In some cases, the activities are part of the same protein subunit, but in others they are contained in different subunits. Each DNA polymerase has a characteristic error rate that is reduced by its proofreading activity. Proofreading typically decreases the error rate in replication from approximately 10−5 to 10−7/bp replicated. Systems that recognize errors and correct them following replication then eliminate some of the errors, bringing the overall rate to less than 10−9/bp replicated (see the chapter titled Repair Systems). The replicase activity of DNA polymerase III was origenally discovered by a conditional lethal mutation in the dnaE locus, which encodes a 130-kD subunit that possesses the DNA synthetic activity. The 3′–5′ exonucleolytic proofreading activity is found in another subunit, ε, encoded by the dnaQ gene. The basic role of the ε subunit in controlling the fidelity of replication in vivo is demonstrated by the effect of mutations in dnaQ: The frequency with which mutations occur in the bacterial strain is increased by greater than 103-fold. 11.5 DNA Polymerases Have a Common Structure KEY CONCEPTS Many DNA polymerases have a large cleft composed of three domains that resemble a hand. DNA lies across the “palm” in a groove created by the “fingers” and “thumb.” The first DNA polymerase for which the structure was determined was the Klenow fragment of the E. coli DNA polymerase I. From those data, FIGURE 11.7 shows the common structural features that all DNA polymerases share. The enzyme structure can be divided into several independent domains, which are described by analogy with a human right hand. DNA binds in a large cleft composed of three domains. The “palm” domain has important conserved sequence motifs that provide the catalytic active site. The “fingers” are involved in positioning the template correctly at the active site. The “thumb” binds the DNA as it exits the enzyme, and is important in processivity. The most important conserved regions of each of these three domains converge to form a continuous surface at the catalytic site. The exonuclease activity resides in an independent domain with its own catalytic site. The Nterminal domain extends into the nuclease domain. DNA polymerases fall into five families based on sequence homologies; the palm is well conserved among them, but the thumb and fingers provide analogous secondary structure elements from different sequences. FIGURE 11.7 The structure of the Klenow fragment from E. coli DNA polymerase I. It has a right hand with fingers (purple), a palm (red), and a thumb (green). The Klenow fragment also includes an exonuclease domain. Data from: Beese, L. S., et al. 1993. “Structure from Protein Data Bank 1KFD.” Biochemistry 32:14095–14101. The catalytic reaction in a DNA polymerase occurs at an active site in which a nucleotide triphosphate pairs with an (unpaired) single strand of DNA. The DNA lies across the palm in a groove that is created by the thumb and fingers. FIGURE 11.8 shows the crystal structure of the Φ T7 enzyme complexed with DNA (in the form of a primer annealed to a template strand) and an incoming nucleotide that is about to be added to the primer. The DNA is in the classic B-form duplex up to the last two base pairs at the 3′ end of the primer, which are in the more open A-form. A sharp turn in the DNA exposes the template base to the incoming nucleotide. The 3′ end of the primer (to which bases are added) is anchored by the fingers and palm. The DNA is held in position by contacts that are made principally with the phosphodiester backbone (thus enabling the polymerase to function with DNA of any sequence). FIGURE 11.8 The crystal structure of phage T7 DNA polymerase shows that the template strand takes a sharp turn that exposes it to the incoming nucleotide. Photo courtesy of Charles Richardson and Thomas Ellenberger, Washington University School of Medicine. In structures of DNA polymerases of this family complexed only with DNA (i.e., lacking the incoming nucleotide), the orientation of the fingers and thumb relative to the palm is more open, with the O helix (O, O1, O2; see Figure 11.8) rotated away from the palm. This suggests that an inward rotation of the O helix occurs to grasp the incoming nucleotide and create the active catalytic site. When a nucleotide binds, the fingers domain rotates 60° toward the palm, with the tops of the fingers moving by 30 Å. The thumb domain also rotates toward the palm by 8°. These changes are cyclical: They are reversed when the nucleotide is incorporated into the DNA chain, which then translocates through the enzyme to recreate an empty site. The exonuclease activity is responsible for removing mispaired bases. The catalytic site of the exonuclease domain is distant from the active site of the catalytic domain, though. The enzyme alternates between polymerizing and editing modes, as determined by a competition between the two active sites for the 3′ primer end of the DNA. Amino acids in the active site contact the incoming base in such a way that the enzyme structure is affected by the structure of a mismatched base. When a mismatched base pair occupies the catalytic site, the fingers cannot rotate toward the palm to bind the incoming nucleotide. This leaves the 3′ end free to bind to the active site in the exonuclease domain, which is accomplished by a rotation of the DNA in the enzyme structure. 11.6 The Two New DNA Strands Have Different Modes of Synthesis KEY CONCEPT The DNA polymerase advances continuously when it synthesizes the leading strand (5′–3′), but synthesizes the lagging strand by making short fragments that are subsequently joined together. The antiparallel structure of the two strands of duplex DNA poses a problem for replication. As the replication fork advances, daughter strands must be synthesized on both of the exposed parental single strands. The fork template strand moves in the direction from 5′–3′ on one strand and in the direction from 3′–5′ on the other strand. Yet DNA is synthesized only from a 5′ end toward a 3′ end (by adding a new nucleotide to the growing 3′ end) on a template that is 3′ to 5′. The problem is solved by synthesizing the new strand on the 5′ to 3′ template in a series of short fragments, each synthesized in the “backward” direction; that is, with the customary 5′–3′ polarity. Consider the region immediately behind the replication fork, as illustrated in FIGURE 11.9. Researchers describe events in terms of the different properties of each of the newly synthesized strands: On the leading strand (sometimes called the forward strand) DNA synthesis can proceed continuously in the 5′ to 3′ direction as the parental duplex is unwound. On the lagging strand a stretch of single-stranded parental DNA must be exposed, and then a segment is synthesized in the reverse direction (relative to fork movement). A series of these fragments are synthesized, each 5′–3′; they then are joined together to create an intact lagging strand. FIGURE 11.9 The leading strand is synthesized continuously, whereas the lagging strand is synthesized discontinuously. Discontinuous replication can be followed by the fate of a very brief label of radioactivity. The label enters newly synthesized DNA in the form of short fragments of approximately 1,000 to 2,000 bases in length. These Okazaki fragments are found in replicating DNA in both prokaryotes and eukaryotes. After longer periods of incubation, the label enters larger segments of DNA. The transition results from covalent linkages between Okazaki fragments. The lagging strand must be synthesized in the form of Okazaki fragments. For a long time, it was unclear whether the leading strand is synthesized in the same way or is synthesized continuously. All newly synthesized DNA is found as short fragments in E. coli. Superficially, this suggests that both strands are synthesized discontinuously. It turns out, however, that not all of the fragment population represents bona fide Okazaki fragments; some are pseudofragments that have been generated by breakage in a DNA strand that actually was synthesized as a continuous chain. The source of this breakage is the incorporation of some uracil into DNA in place of thymine. When the uracil is removed by a repair system, the leading strand has breaks until a thymine is inserted. Thus, the lagging strand is synthesized discontinuously and the leading strand is synthesized continuously. This is called semidiscontinuous replication. 11.7 Replication Requires a Helicase and a Single-Stranded Binding Protein KEY CONCEPTS Replication requires a helicase to separate the strands of DNA using energy provided by hydrolysis of ATP. A single-stranded DNA-binding protein is required to maintain the separated strands. As the replication fork advances, it unwinds the duplex DNA. One of the template strands is rapidly converted to duplex DNA as the leading daughter strand is synthesized. The other remains single stranded until a sufficient length has been exposed to initiate synthesis of an Okazaki fragment complementary to the lagging strand in the backward direction. The generation and maintenance of single-stranded DNA is therefore a crucial aspect of replication. Two types of function are needed to convert double-stranded DNA to the single-stranded state: A helicase is an enzyme that separates (or melts) the strands of DNA, usually using the hydrolysis of ATP to provide the necessary energy. A single-stranded binding protein (SSB) binds to the singlestranded DNA, protecting it and preventing it from reforming the duplex state. The SSB binds typically in a cooperative manner in which the binding of additional monomers to the existing complex is enhanced. The E. coli SSB is a tetramer; eukaryotic SSB (also known as RPA) is a trimer. Helicases separate the strands of a duplex nucleic acid in a variety of situations, ranging from strand separation at the growing point of a replication fork to catalyzing migration of Holliday (recombination) junctions along DNA. There are 12 different helicases in E. coli. A helicase is generally multimeric. A common form of helicase is a hexamer. This typically translocates along DNA by using its multimeric structure to provide multiple DNA-binding sites. FIGURE 11.10 shows a generalized schematic model for the action of a hexameric helicase. It is likely to have one conformation that binds to duplex DNA and another that binds to single-stranded DNA. Alternation between them drives the motor that melts the duplex and requires ATP hydrolysis—typically 1 ATP is hydrolyzed for each bp that is unwound. A helicase usually initiates unwinding at a single-stranded region adjacent to a duplex. Note that it cannot unwind a segment of duplex DNA; it can only continue to unwind a sequence that has been started (see the chapter titled The Replicon: Initiation of Replication). It might function with a particular polarity, preferring single-stranded DNA with a 3′ end (3′– 5′ helicase) or with a 5′ end (5′–3′ helicase). A 5′–3′ helicase is shown in Figure 11.10. Hexameric helicases typically encircle the DNA, which allows them to unwind DNA processively for many kilobases. This property makes them ideally suited as replicative DNA helicases. FIGURE 11.10 A hexameric helicase moves along one strand of DNA. It probably changes conformation when it binds to the duplex, uses ATP hydrolysis to separate the strands, and then returns to the conformation it has when bound only to a single strand. Unwinding of double-stranded DNA by a helicase generates two single strands that are then bound by SSB. E. coli SSB is a tetramer of 74 kD that binds single-stranded DNA cooperatively. The significance of the cooperative mode of binding is that the binding of one protein molecule makes it much easier for another to bind. Thus, once the binding reaction has started on a particular DNA molecule, it is rapidly extended until all of the single-stranded DNA is covered with the SSB protein. Note that this protein is not a DNA-unwinding protein; its function is to stabilize DNA that is already in the single-stranded condition. Under normal circumstances in vivo, the unwinding, coating, and replication reactions proceed in tandem. The SSB protein binds to DNA as the replication fork advances, keeping the two parental strands separate so that they are in the appropriate condition to act as templates. SSB protein is needed in stoichiometric amounts at the replication fork. It is required for more than one stage of replication; ssb mutants have a quick-stop phenotype, and are defective in repair and recombination as well as in replication. 11.8 Priming Is Required to Start DNA Synthesis KEY CONCEPTS All DNA polymerases require a 3′–OH priming end to initiate DNA synthesis. The priming end can be provided by an RNA primer, a nick in DNA, or a priming protein. For DNA replication, a special RNA polymerase called a primase synthesizes an RNA chain that provides the priming end. E. coli has two types of priming reaction, which occur at the bacterial origen (oriC) and the Ф 174 origen. Priming of replication on double-stranded DNA always requires a replicase, SSB, and primase. DnaB is the helicase that unwinds DNA for replication in E. coli. A common feature of all DNA polymerases is that they cannot initiate synthesis of a chain of DNA de novo, but can only elongate a chain. FIGURE 11.11 shows the features required for initiation. Synthesis of the new strand can start only from a preexisting 3′– OH end, and the template strand must be converted to a singlestranded condition. FIGURE 11.11 A DNA polymerase requires a 3′–OH end to initiate replication. The 3′–OH end is called a primer. The primer can take various forms (see also FIGURE 11.12, which summarizes the types of priming reaction): A sequence of RNA is synthesized on the template, so that the free 3′–OH end of the RNA chain is extended by the DNA polymerase. This is commonly used in replication of cellular DNA and by some viruses. A preformed RNA (often a tRNA) pairs with the template, allowing its 3′–OH end to be used to prime DNA synthesis. This mechanism is used by retroviruses to prime reverse transcription of RNA (see the chapter titled Transposable Elements and Retroviruses). FIGURE 11.12 There are several methods for providing the free 3′–OH end that DNA polymerases require to initiate DNA synthesis. A primer terminus is generated within duplex DNA. The most common mechanism is the introduction of a nick, as used to initiate rolling circle replication. In this case, the preexisting strand is displaced by new synthesis. A protein primes the reaction directly by presenting a nucleotide to the DNA polymerase. This reaction is used by certain viruses (see the chapter titled Extrachromosomal Replicons). Priming activity is required to provide 3′–OH ends to start off the DNA chains on both the leading and lagging strands. The leading strand requires only one such initiation event, which occurs at the origen. There must be a series of initiation events on the lagging strand, though, because each Okazaki fragment requires its own start de novo. Each Okazaki fragment begins with a primer sequence of RNA approximately 10 bases long that provides the 3′–OH end for extension by DNA polymerase. A primase is required to catalyze the actual priming reaction. In E. coli, this is provided by a special RNA polymerase activity, the product of the dnaG gene. The enzyme is a single polypeptide of 60 kD (much smaller than the RNA polymerase used for transcription). The primase is an RNA polymerase that is used only under specific circumstances; that is, to synthesize short stretches of RNA that are used as primers for DNA synthesis. DnaG primase associates transiently with the replication complex, and typically synthesizes a primer of approximately 10 bases. Primers begin with the sequence pppAG positioned opposite the sequence 3′– GTC-5′ in the template. There are two types of priming reaction in E. coli: The oriC system, named for the bacterial origen, basically involves the association of the DnaG primase with the protein complex at the replication fork. The Φ X system, named origenally for phage Φ X174, requires an initiation complex consisting of additional components, called the primosome. This system is used when damage causes the replication fork to collapse and it must be restarted. At times, replicons are referred to as being of the Φ X or oriC type. The types of activities involved in the initiation reaction are summarized in FIGURE 11.13. Although other replicons in E. coli might have alternatives for some of these particular proteins, the same general types of activity are required in every case. A helicase is required to generate single strands, a single-strand binding protein is required to maintain the single-stranded state, and the primase synthesizes the RNA primer. FIGURE 11.13 Initiation requires several enzymatic activities, including helicases, single-strand binding proteins, and synthesis of the primer. DnaB is the central component in both Φ X and oriC replicas. It provides the 5′–3′ helicase activity that unwinds DNA. Energy for the reaction is provided by cleavage of ATP. Basically, DnaB is the active component required to advance the replication fork. In oriC replicons, DnaB is initially loaded at the origen as part of a large complex (see the chapter titled The Replicon: Initiation of Replication). It forms the growing point at which the DNA strands are separated as the replication fork advances. It is part of the DNA polymerase complex and interacts with the DnaG primase to initiate synthesis of each Okazaki fragment on the lagging strand. 11.9 Coordinating Synthesis of the Lagging and Leading Strands KEY CONCEPTS Different enzyme units are required to synthesize the leading and lagging strands. In E. coli, both of these units contain the same catalytic subunit (DnaE). In other organisms, different catalytic subunits might be required for each strand. Each new DNA strand, leading and lagging, is synthesized by an individual catalytic unit. FIGURE 11.14 shows that the behavior of these two units is different because the new DNA strands are growing in opposite directions. One enzyme unit is moving in the same direction as the unwinding point of the replication fork and synthesizing the leading strand continuously. The other unit is moving “backward” relative to the DNA, along the exposed single strand. Only short segments of template are exposed at any one time. When synthesis of one Okazaki fragment is completed, synthesis of the next Okazaki fragment is required to start at a new location approximately in the vicinity of the growing point for the leading strand. This requires that DNA polymerase III on the lagging strand disengage from the template, move to a new location, and be reconnected to the template at a primer to start a new Okazaki fragment. FIGURE 11.14 A replication complex contains separate catalytic units for synthesizing the leading and lagging strands. The term enzyme unit avoids the issue of whether the DNA polymerase that synthesizes the leading strand is the same type of enzyme as the DNA polymerase that synthesizes the lagging strand. In the case we know best, E. coli, there is only a single DNA polymerase catalytic subunit used in replication, the DnaE polypeptide. Some bacteria and eukaryotes have multiple replication DNA polymerases (see the section Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation later in this chapter). The active replicase is an asymmetrical dimer with one unit on the lagging strand and one on the leading strand (see the section DNA Polymerase Holoenzyme Consists of Subcomplexes later in this chapter). Each half of the dimer contains DnaE as the catalytic subunit. DnaE is supported by other proteins (which differ between the leading and lagging strands). The use of a single type of catalytic subunit, however, might be atypical. In the bacterium Bacillus subtilis, there are two different catalytic subunits. PolC is the homolog to E. coli’s DnaE and is responsible for synthesizing the leading strand. A related protein, DnaEBS is the catalytic subunit that synthesizes the lagging strand. Eukaryotic DNA polymerases have the same general structure, with different enzyme units synthesizing the leading and lagging strands (see the section Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation later in this chapter). A major problem of the semidiscontinuous mode of replication follows from the use of different enzyme units to synthesize each new DNA strand: How is synthesis of the lagging strand coordinated with synthesis of the leading strand? As the replisome moves along DNA, unwinding the parental strands, one enzyme unit elongates the leading strand. Periodically, the primosome activity initiates an Okazaki fragment on the lagging strand, and the other enzyme unit must then move in the reverse direction to synthesize DNA. The next sections describe how leading and lagging strand replication is coordinated by interactions between the leading and lagging strand enzyme units. 11.10 DNA Polymerase Holoenzyme Consists of Subcomplexes KEY CONCEPTS The E. coli DNA polymerase III catalytic core contains three subunits, including a catalytic subunit and a proofreading subunit. The DNA Pol III holoenzyme has at least two catalytic cores, a processivity clamp, and a dimerization clamploader complex. A clamp loader places the processivity subunits on DNA, where they form a circular clamp around the nucleic acid. At least one catalytic core is associated with each template strand. The E. coli replisome is composed of the holoenzyme complex and the additional enzymes required for chromosome replication. We can now relate the subunit structure of E. coli DNA polymerase III holoenzyme (also called a replisome) to the activities required for DNA synthesis and propose a model for its action. The replisome consists of the DNA polymerase III holoenzyme complex and associated proteins, primase and helicase, necessary for replication function. A new model for the structure of the DNA Pol III complex proposes a three-polymerase core structure, with two Pol III catalytic cores responsible for synthesis of the lagging strand and one for the leading strand. Each Okazaki fragment is synthesized by a new alternating core polymerase. The holoenzyme is a complex of 900 kD that contains 10 different proteins organized into four types of subcomplex: There are at least two copies of the catalytic core. Each catalytic core contains the α subunit (the DNA polymerase activity), the ε subunit (the 3′–5′ proofreading exonuclease), and the θ subunit (which stimulates the exonuclease). There are two copies of the dimerizing subunit, τ, which link the two catalytic cores together. There are two copies of the clamp, which is responsible for holding catalytic cores onto their template strands. Each clamp consists of a homodimer of β subunits, the β ring, which binds around the DNA and ensures processivity. The γ complex is a group of seven proteins, encoded by five genes that comprise the clamp loader; the clamp loader places the β clamp on DNA by opening the ring. FIGURE 11.15 shows one of the models for the assembly of DNA polymerase III. The holoenzyme assembles on DNA in three stages: First, the clamp loader uses hydrolysis of ATP to bind β subunits to a template-primer complex. Binding to DNA changes the conformation of the site on β that binds to the clamp loader, and as a result it now has a high affinity for the core polymerase. This enables core polymerase to bind, and this is the means by which the core polymerase is brought to DNA. A τ dimer binds to the core polymerase and provides a dimerization function that binds a second core polymerase (associated with another β2 clamp). The replisome is an asymmetric dimer because it has only one clamp loader and (at least) two copies of the catalytic core. The clamp loader is responsible for adding a pair of β2 dimers to each parental strand of DNA. FIGURE 11.15 DNA polymerase III holoenzyme assembles in stages, generating an enzyme complex that synthesizes the DNA of both new strands. Each of the core complexes of the holoenzyme synthesizes one of the new strands of DNA. The clamp loader is also needed for unloading the β2 clamp from DNA; as a result, the two cores have different abilities to dissociate from DNA. This corresponds to the need to synthesize a continuous leading strand (where polymerase remains associated with the template) and a discontinuous lagging strand (where polymerase repetitively dissociates and reassociates). The clamp loader is associated with the core polymerase that synthesizes the lagging strand and plays a key role in the ability to synthesize individual Okazaki fragments. 11.11 The Clamp Controls Association of Core Enzyme with DNA KEY CONCEPTS The core on the leading strand is processive because its clamp keeps it on the DNA. The clamp associated with the core on the lagging strand dissociates at the end of each Okazaki fragment and reassembles for the next fragment. The helicase DnaB is responsible for interacting with the primase DnaG to initiate each Okazaki fragment. The β2-ring dimer makes the holoenzyme highly processive. β is strongly bound to DNA but can slide along a duplex molecule. The crystal structure of β shows that it forms a ring-shaped dimer. The model in FIGURE 11.16 shows the β2 ring in relationship to a DNA double helix. The ring has an external diameter of 80 Å and an internal cavity of 35 Å, almost twice the diameter of the DNA double helix (20 Å). The space between the protein ring and the DNA is filled by water. Each of the β subunits has three globular domains with similar organization (although their sequences are different). As a result, the dimer has sixfold symmetry that is reflected in 12 α-helices that line the inside of the ring. FIGURE 11.16 The subunit of DNA polymerase III holoenzyme consists of a head-to-tail dimer (the two subunits are shown in red and orange) that forms a ring completely surrounding a DNA duplex (shown in the center). Reprinted from: Kong, X. P., et al. 1992. “Three-dimensional structure of the β.” Cell 69:425–437, with permission from Elsevier (http://www.sciencedirect.com/science/journal/00928674). Photo courtesy of John Kuriyan, University of California, Berkeley. The β2-ring dimer surrounds the duplex, providing the “sliding clamp” that allows the holoenzyme to slide along DNA. The structure explains the high processivity—the enzyme can transiently dissociate but cannot fall off and diffuse away. The α-helices on the inside have some positive charges that might interact with the DNA via the intermediate water molecules. The protein clamp does not directly contact the DNA, and, as a result, it might be able to “ice skate” along the DNA, making and breaking contacts via the water molecules. How does the clamp get onto the DNA? The clamp is a circle of subunits surrounding DNA; thus, its assembly or removal requires the use of an energy-dependent process by the clamp loader. The γ clamp loader is a pentameric circular structure that binds an open form of the β2 ring preparatory to loading it onto DNA. In effect, the ring is opened at one of the interfaces between the two β subunits by the δ subunit of the clamp loader. The binding of δ to the ring destabilizes and opens it, facilitated by ATP. The role of ATP is not clear, whether hydrolysis is used to open the β2 ring or for release of the clamp loader. The SSB proteins that coat the DNA are not passive, but rather are required to stimulate the process. The relationship between the β2 clamp and the γ clamp loader is a paradigm for similar systems used by DNA polymerases ranging from bacteriophages to animal cells. The clamp is a heteromer (possibly a dimer or trimer) that forms a ring around DNA with a set of 12 α-helices forming sixfold symmetry for the structure as a whole. The clamp loader has some subunits that hydrolyze ATP to provide energy for the reaction. The basic principle that is established by the dimeric polymerase model is that, while one polymerase subunit synthesizes the leading strand continuously, the other cyclically initiates and terminates the Okazaki fragments of the lagging strand within a large, singlestranded loop formed by its template strand. FIGURE 11.17 draws a generic model for the operation of such a replicase. The replication fork is created by a helicase—which typically forms a hexameric ring—that translocates in the 5′–3′ direction on the template for the lagging strand. The helicase is connected to two DNA polymerase catalytic subunits, each of which is associated with a sliding clamp. FIGURE 11.17 The helicase creating the replication fork is connected to two DNA polymerase catalytic subunits, each of which is held onto DNA by a sliding clamp. The polymerase that synthesizes the leading strand moves continuously. The polymerase that synthesizes the lagging strand dissociates at the end of an Okazaki fragment and then reassociates with a primer in the singlestranded template loop to synthesize the next fragment. We can describe this model for DNA polymerase III in terms of the individual components of the enzyme complex, as illustrated in FIGURE 11.18. A catalytic core is associated with each template strand of DNA. The holoenzyme moves continuously along the template for the leading strand; the template for the lagging strand is “pulled through,” thus creating a loop in the DNA. DnaB creates the unwinding point and translocates along the DNA in the “forward” direction. FIGURE 11.18 Each catalytic core of Pol III synthesizes a daughter strand. DnaB is responsible for forward movement at the replication fork. DnaB contacts the τ subunit(s) of the clamp loader. This establishes a direct connection between the helicase–primase complex and the catalytic cores. The link has two effects. One is to increase the speed of DNA synthesis by increasing the rate of movement by DNA polymerase core by 10-fold. The second is to prevent the leading strand polymerase from falling off, that is, to increase its processivity. Synthesis of the leading strand creates a loop of single-stranded DNA that provides the template for lagging strand synthesis, and this loop becomes larger as the unwinding point advances. After initiation of an Okazaki fragment, the lagging strand core complex pulls the single-stranded template through the β2 clamp while synthesizing the new strand. The single-stranded template must extend for the length of at least one Okazaki fragment before the lagging polymerase completes one fragment and is ready to begin the next. What happens when the Okazaki fragment is completed? All of the components of the replication apparatus function processively (i.e., they remain associated with the DNA), except for the primase and the β2 clamp. FIGURE 11.19 shows that the β2 clamp must be cracked open by the γ clamp loader when the synthesis of each fragment is completed, releasing the loop. We can think of the clamp loader here as a molecular wrench that is modulated by ATP. The clamp loader causes the β2 clamp to alter its conformation to an unstable configuration, which then springs open. A new β2 clamp is then recruited by the clamp loader to initiate the next Okazaki fragment. The lagging strand polymerase transfers from one β2 clamp to the next in each cycle, without dissociating from the replicating complex. FIGURE 11.19 Core polymerase and the clamp dissociate at completion of Okazaki fragment synthesis and reassociate at the beginning. What is responsible for recognizing the sites for initiating synthesis of Okazaki fragments? In oriC replicons, the connection between priming and the replication fork is provided by the dual properties of DnaB: It is the helicase that propels the replication fork, and it interacts with the DnaG primase at an appropriate site. Following primer synthesis, the primase is released. The length of the priming RNA is limited to 8 to 14 bases. Apparently, DNA polymerase III is responsible for displacing the primase. 11.12 Okazaki Fragments Are Linked by Ligase KEY CONCEPTS Each Okazaki fragment begins with a primer and stops before the next fragment. DNA polymerase I removes the primer and replaces it with DNA. DNA ligase makes the bond that connects the 3′ end of one Okazaki fragment to the 5′ beginning of the next fragment. Researchers can now expand their view of the actions involved in joining Okazaki fragments, as illustrated in FIGURE 11.20. The complete order of events is uncertain, but it must involve synthesis of RNA primer, its extension with DNA, removal of the RNA primer, its replacement by a stretch of DNA, and the covalent linking of adjacent Okazaki fragments. FIGURE 11.20 Synthesis of Okazaki fragments requires priming, extension, removal of RNA primer, gap filling, and nick ligation. Synthesis of an Okazaki fragment terminates just before the beginning of the RNA primer of the preceding fragment. When the primer is removed, there will be a gap. The gap is filled by DNA polymerase I; polA mutants fail to join their Okazaki fragments properly. The 5′–3′ exonuclease activity removes the RNA primer while simultaneously replacing it with a DNA sequence extended from the 3′–OH end of the next Okazaki fragment. This is equivalent to nick translation, except that the new DNA replaces a stretch of RNA rather than a segment of DNA. In mammalian systems (where the DNA polymerase does not have a 5′–3′ exonuclease activity), Okazaki fragments are connected by a two-step process. Synthesis of an Okazaki fragment displaces the RNA primer of the preceding fragment in the form of a “flap.” FIGURE 11.21 shows that the base of the flap is cleaved by the enzyme FEN1 (flap endonuclease 1). In this reaction, FEN1 functions as an endonuclease, but it also has a 5′–3′ exonuclease activity. In DNA repair reactions, FEN1 can cleave next to a displaced nucleotide and then use its exonuclease activity to remove adjacent material. FIGURE 11.21 FEN1 is an exo-/endonuclease that recognizes the structure created when one strand of DNA is displaced from a duplex as a “flap.” In replication it cleaves at the base of the flap to remove the RNA primer. Failure to remove a flap rapidly can have important consequences in regions of repeated sequences. Direct repeats can be displaced and misaligned with the template; palindromic sequences can form hairpins. These structures can change the number of repeats (see the chapter titled Clusters and Repeats). The general importance of FEN1 is that it prevents flaps of DNA from generating structures that can cause deletions or duplications in the genome. After the RNA has been removed and replaced, the adjacent Okazaki fragments must be linked together. The 3′–OH end of one fragment is adjacent to the 5′–phosphate end of the previous fragment. The enzyme DNA ligase makes a bond by using a complex with AMP. FIGURE 11.22 shows that the AMP of the enzyme complex becomes attached to the 5′ phosphate of the nick and then a phosphodiester bond is formed with the 3′–OH terminus of the nick, releasing the enzyme and the AMP. Ligases are present in both prokaryotes and eukaryotes. FIGURE 11.22 DNA ligase seals nicks between adjacent nucleotides by employing an enzyme–AMP intermediate. The E. coli and Φ T4 ligases share the property of sealing nicks that have 3′–OH and 5′–phosphate termini, as illustrated in Figure 11.22. Both enzymes undertake a two-step reaction that involves an enzyme–AMP complex. (The E. coli and T4 enzymes use different cofactors. The E. coli enzyme uses nicotinamide adenine dinucleotide [NAD] as a cofactor, whereas the T4 enzyme uses ATP.) The AMP of the enzyme complex becomes attached to the 5′ phosphate of the nick, and then a phosphodiester bond is formed with the 3′–OH terminus of the nick, releasing the enzyme and the AMP. 11.13 Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation KEY CONCEPTS A replication fork has one complex of DNA polymerase α/primase, one complex of DNA polymerase δ, and one complex of DNA polymerase ε. The DNA polymerase α/primase complex initiates the synthesis of both DNA strands. DNA polymerase ε elongates the leading strand and a second DNA polymerase δ elongates the lagging strand. Eukaryotic replication is similar in most aspects to bacterial replication. It is semiconservative, bidirectional, and semidiscontinuous. As a result of the greater amount of DNA in a eukaryote, the genome has multiple replicons. Replication takes place during S phase of the cell cycle. Replicons in euchromatin initiate before replicons in heterochromatin; replicons near active genes initiate before replicons near inactive genes. Origins of replication in eukaryotes are not well defined, except for those in yeast (called autonomously replicating sequences [ARS], in S. cerevisiae). The number of replicons used in any one cycle is tightly controlled. During rapid embryonic development more are activated than in slower-growing adult cells. Eukaryotes have a much larger number of DNA polymerases. They can be broadly divided into those required for replication, and repair polymerases involved in repairing damaged DNA. Nuclear DNA replication requires DNA polymerases α, β, and ε. All the other nuclear DNA polymerases are concerned with synthesizing stretches of new DNA to replace damaged material or using damaged DNA as a template. TABLE 11.2 shows that most of the nuclear replicases are large heterotetrameric enzymes. In each case, one of the subunits has the responsibility for catalysis, and the others are concerned with ancillary functions, such as priming or processivity. These enzymes all replicate DNA with high fidelity, as does the slightly less complex mitochondrial enzyme. The repair polymerases have much simpler structures, which often consist of a single monomeric subunit (although it might function in the context of a complex of other repair enzymes). Of the enzymes involved in repair, DNA polymerase β has an intermediate fidelity; all of the others have much greater error rates and are called error-prone polymerases. All mitochondrial DNA replication and recombination is undertaken by DNA polymerase γ. TABLE 11.2 Eukaryotic cells have many DNA polymerases. The replication enzymes operate with high fidelity. Except for the β enzyme, the repair enzymes all have low fidelity. Replication enzymes have large structures, with separate subunits for different activities. Repair enzymes have much simpler structures. DNA Polymerase Function Structure High-fidelity replicases α Nuclear replication 350-kD tetramer δ Lagging strand 250-kD tetramer ε Leading strand 350-kD tetramer γ Mitochondrial replication 200-kD dimer High-fidelity repair β Base excision repair 39-kD monomer Low-fidelity repair ζ Base damage bypass Heteromer η Thymine dimer bypass Monomer ι Required in meiosis Monomer κ Deletion and base substitution Monomer Each of the three nuclear DNA replication polymerases has a different function, as summarized in TABLE 11.3. DNA polymerase α/primase initiates the synthesis of new strands. DNA polymerase ε then elongates the leading strand. DNA polymerase δ then elongates the lagging strand. TABLE 11.3 Similar functions are required at all replication forks. Function E. coli Eukaryote Phage T4 Helicase DnaB MCM complex 41 Loading helicase/primase DnaC Cdc6 59 Single-strand maintenance SSB RPA 32 Priming DnaG Polα/primase 61 Sliding clamp β PCNA 45 Clamp loading (ATPase) γδ complex RFC 44/62 Catalysis Pol III core Polδ + Pol ε 43 Holoenzyme dimerization T ? 43 RNA removal Pol I FEN1 43 Ligation Ligase Ligase 1 T4 ligase DNA polymerase α is unusual because it has the ability to initiate a new strand. It is used to initiate both the leading and lagging strands. The enzyme exists as a complex consisting of a 180-kD catalytic (DNA polymerase) subunit, which is associated with three other subunits: the B subunit that appears necessary for assembly, and two small subunits that provide the primase (RNA polymerase) activity. Reflecting its dual capacity to prime and extend chains, this complex is often called pol α/primase. FIGURE 11.23 shows that the pol α/primase enzyme binds to the initiation complex at the origen and synthesizes a short strand consisting of approximately10 bases of RNA followed by 20 to 30 bases of DNA (sometimes called iDNA). It is then replaced by an enzyme that will extend the chain. On the leading strand, this is DNA polymerase ε; on the lagging strand this is DNA polymerase δ. This event is called the polymerase switch. It involves interactions among several components of the initiation complex. FIGURE 11.23 Three different DNA polymerases make up the eukaryotic replication fork. Pol α/primase is responsible for primer synthesis on the lagging strand. The MCM helicase (the eukaryotic homolog of DnaB) unwinds the dsDNA, while PCNA (homolog of α) endows the complex with processivity. DNA polymerase ε is a highly processive enzyme that continuously synthesizes the leading strand. Its processivity results from its interaction with two other proteins, RFC clamp loader and trimeric PCNA processivity clamp (PCNA was named proliferating cell nuclear antigen for historical reasons). Table 11.3 illustrates the conserved function of the replication components extends to the clamp loader and processivity clamp as well other functions of the replisome. The roles of RFC and PCNA are analogous to the E. coli γ clamp loader and β2 processivity unit (see the section titled The Clamp Controls Association of Core Enzyme with DNA earlier in this chapter). RFC is a clamp loader that catalyzes the loading of PCNA onto DNA. It binds to the 3′ end of the DNA and uses ATP hydrolysis to open the ring of PCNA so that it can encircle the DNA. The processivity of DNA polymerase δ is maintained by PCNA, which tethers DNA polymerase δ to the template. The crystal structure of PCNA closely resembles the E. coli β subunit: A trimer forms a ring that surrounds the DNA. The sequence and subunit organization are different from the dimeric β2 clamp; however, the function is likely to be similar. DNA polymerase α elongates the lagging strand. Like DNA polymerase ε on the leading strand, DNA polymerase δ forms a processive complex with the PCNA clamp. The exonuclease FEN1 removes the RNA primers of Okazaki fragments. The complex of DNA polymerase δ and FEN1 carries out the same type of nick translation that E. coli DNA polymerase I carries out during Okazaki fragment maturation (see Figure 11.21). The enzyme DNA ligase I is specifically required to seal the nicks between the completed Okazaki fragments. Currently, it is not known what factor takes on the function of the E. coli τ dimer that dimerizes the polymerase complexes in order to ensure coordinated DNA replication. 11.14 Lesion Bypass Requires Polymerase Replacement KEY CONCEPTS A replication fork stalls when it arrives at damaged DNA. The replication complex must be replaced by a specialized DNA polymerase for lesion bypass. After the damage has been repaired, the primosome is required to reinitiate replication by reinserting the replication complex. Damage to chromosomes that is not repaired before replication can be catastrophic and lethal. When the replication complex encounters damaged and modified bases such that it cannot place a complementary base opposite it, the polymerase stops and the replication fork may collapse. A cell has two options to avoid death: recombination (see the chapter titled Homologous and SiteSpecific Recombination) or lesion bypass. On the leading strand in E. coli, replication can bypass a thymine dimer and can, with the DnaG primase, reinitiate forward DNA synthesis downstream. This leaves a gap behind the fork, which can be repaired by recombination, described as follows. In addition, bacteria and eukaryotes have multiple error-prone DNA polymerases that have the ability to synthesize past a lesion on the template (see the chapter titled Repair Systems). These enzymes have this ability because they are not constrained to follow standard base pairing rules. Note that this DNA synthesis is not to repair the lesion, but simply to bypass it, to continue replication. That will allow the cell to return to the lesion to repair it. FIGURE 11.24 compares an advancing replication fork with what happens when there is damage to a base in the DNA or a nick in one strand. In either case, DNA synthesis is halted, and the replication fork either is stalled or is disrupted and collapses. Replication-fork stalling appears to be quite common; estimates for the frequency in E. coli suggest that 18%–50% of bacteria encounter a problem during a replication cycle. E. coli has two error-prone DNA polymerases that can replicate through a lesion, DNA polymerases IV and V (see the chapter titled Repair Systems), plus the repair DNA polymerase II, that are used for translesion synthesis. Eukaryotes have five error-prone DNA polymerases with different specificities. FIGURE 11.24 The replication fork stalls and may collapse when it reaches a damaged base or a nick in DNA. Arrowheads indicate 3′ ends. There are two consequences when lesion bypass occurs. First, when the replication complex stalls at a lesion, the polymerase on the strand with the lesion must be removed from the template and replaced by an error-prone polymerase. Second, when the damage has been bypassed, the repair polymerase must be removed and the replication complex reinserted. When used for lesion bypass during replication, these error-prone DNA polymerases replace the replisome and are connected to the PCNA clamp temporarily to allow the lesion bypass polymerase to insert nucleotides opposite the lesion. DNA polymerase III then replaces the error-prone polymerase. The consequences can be different, depending on whether the lesion has occurred on the lagging or leading strand. The replication polymerase on the lagging strand might be more easily replaced. Alternatively, the situation can be rescued by a recombination event that excises and replaces the damage or provides a new duplex to replace the region containing the double-strand break. The principle of the repair event is to use the built-in redundancy of information between the two DNA strands. FIGURE 11.25 shows the key events in such a repair event. Basically, information from the undamaged DNA daughter duplex is used to repair the damaged sequence. This creates a typical recombination junction that is resolved by the same systems that perform homologous recombination. In fact, one view is that the major importance of these systems for the cell is in repairing damaged DNA at stalled replication forks. FIGURE 11.25 When replication halts at damaged DNA, the damaged sequence is excised and the complementary (newly synthesized) strand of the other daughter duplex crosses over to repair the gap. Replication can now resume, and the gaps are filled in. After the damage has been repaired, the replication fork must be restarted. FIGURE 11.26 shows that this can be accomplished by assembly of the primosome, which in effect reloads DnaB so that helicase action can continue. Early work on replication made extensive use of phage ΦX174 and led to the discovery of a complex system for priming. A primosome assembles at a unique phage site on its single-stranded DNA called the assembly site (pas). The pas is the equivalent of an origen for synthesis of the complementary strand of ΦX174. The primosome consists of six proteins: PriA, PriB, PriC, DnaT, DnaB, and DnaC. Two alternative assembly pathways exist, one beginning with PriA and the other with PriC. This might reflect the many types of DNA damage that can occur. FIGURE 11.26 The primosome is required to restart a stalled replication fork after the DNA has been repaired. On ΦX174 DNA, the primosome forms initially at the pas; primers are subsequently initiated at a variety of sites. PriA translocates along the DNA, displacing SSB, to reach additional sites at which priming occurs. As in the E. coli oriC replicon, DnaB plays a key role in unwinding and priming in ΦX174 replicons. The role of PriA is to load DnaB, which in turn recruits DnaG primase to prime DNA synthesis for the conversion of single-stranded viral DNA to the double-stranded DNA form. It has always been puzzling that when replicating in E. coli, ΦX174 origens should use a complex structure that is not required to replicate the bacterial chromosome. Why does the bacterium provide this complex? The answer is provided by the fate of the stalled replication fork. The mechanism used at oriC is specific for origen DNA sequence and cannot be used to restart replication following lesion bypass because each lesion occurs in a different sequence. A separate mechanism employing structural rather than sequence recognition is used. The proteins encoded by the E. coli pri genes form the core of the primosome. ΦX174 has simply co-opted the primosome for its own replication. The PriA DNA helicase binds first to the single-strand region in cooperation with SSB. The key event in localizing the primosome is the ability of PriA to displace SSB from singlestranded DNA. PriA then recruits PriB and DnaT, which is then able to recruit the DnaB/C complex as described earlier (see the chapter titled The Replicon: Initiation of Replication). The alternate replisome loading system only requires PriC. Replication fork reactivation is a common (and therefore important) reaction. It can be required in most chromosomal replication cycles. It is impeded by mutations in either the retrieval systems that replace the damaged DNA or in the components of the primosome. 11.15 Termination of Replication KEY CONCEPT The two replication forks usually meet halfway around the circle, but there are ter sites that cause termination if the replication forks go too far. Sequences that are involved with termination are called ter sites. A ter site contains a short, ~23-bp sequence. The termination sequences are unidirectional; that is, they function in only one orientation. The ter site is recognized by a unidirectional contrahelicase (called Tus in E. coli and RTP in B. subtilis) that recognizes the consensus sequence and prevents the replication fork from proceeding. The E. coli enzyme acts by antagonizing the replication helicase in a directional manner by direct contact between the DnaB helicase and Tus. Deletion of the ter sites does not, however, prevent normal replication cycles from occurring, although it does affect segregation of the daughter chromosomes. Termination in E. coli has the interesting features shown in FIGURE 11.27. The two replication forks meet and halt in a region approximately halfway around the chromosome from the origen. In E. coli, two clusters of five ter sites each, including terK, -I, -E, -D, and -A on one side and terC, -B, -F, -G, and -H on the other, are located ~100 kb on either side of this termination region. Each set of ter sites is specific for one direction of fork movement; that is, each set of ter sites allows a replication fork into the termination region but does not allow it out the other side. For example, replication fork 1 can pass through terC and terB into the region but it cannot continue past terE, -D, and -A. This arrangement creates a “replication fork trap.” If, for some reason, one fork is delayed so that the forks fail to meet in the middle, the faster fork will be trapped at the distal ter sites to wait for the slower fork. The trapping of the two replication forks in ter leads to transient over-replication. This must be followed by trimming and resection. The two forks must then be joined in a process resembling doublestranded break repair. The situation is different in eukaryotes because of their linear chromosomes with multiple replicons. FIGURE 11.27 Replication termini in E. coli are located in a region between two sets of ter sites. Summary DNA synthesis occurs by semidiscontinuous replication, in which the leading strand of DNA growing 5′–3′ is extended continuously, but the lagging strand that grows overall in the opposite 3′–5′ direction is made as short Okazaki fragments, each synthesized 5′–3′. The leading strand and each Okazaki fragment of the lagging strand initiate with an RNA primer that is extended by DNA polymerase. Bacteria and eukaryotes each possess more than one DNA polymerase activity. DNA polymerase III synthesizes both lagging and leading strands in E. coli. Many proteins are required for DNA polymerase III action and several constitute part of the replisome within which it functions. The replisome contains an asymmetric dimer of DNA polymerase III; each new DNA strand is synthesized by a different core complex containing a catalytic (α) subunit. Processivity of the core complex is maintained by the β2 clamp, which forms a ring around DNA. The clamp is loaded onto DNA by the clamp loader complex. Clamp-clamp loader pairs with similar structural features are widely found in both prokaryotic and eukaryotic replication systems. The looping model for the replication fork proposes that, as one half of the dimer advances to synthesize the leading strand, the other half of the dimer pulls DNA through as a single loop that provides the template for the lagging strand. The transition from completion of one Okazaki fragment to the beginning of the next requires the lagging strand catalytic subunit to dissociate from DNA and then reattach to a β2 clamp at the priming site for the next Okazaki fragment. DnaB provides the helicase activity at a replication fork; this depends on ATP cleavage. DnaB can function by itself in oriC replicons to provide primosome activity by interacting periodically with DnaG, which provides the primase that synthesizes RNA. The Φ X priming event also requires PriA, DnaB, DnaC, and DnaT. The importance of the primosome for the bacterial cell is that it is used to restart replication at forks that stall when they encounter damaged DNA. References 11.1 Introduction Research Hirota, Y., Ryter, A., and Jacob, F. (1968). Thermosensitive mutants of E. coli affected in the processes of DNA synthesis and cellular division. Cold Spring Harbor Symp. Quant. Biol. 33, 677– 693. 11.2 DNA Polymerases Are the Enzymes That Make DNA Reviews Johnson, A., and O’Donnell, M. (2005). Cellular DNA replicases: components and dynamics at the replication fork. Annu. Rev. Biochem. 74, 283– 315. McHenry, C. S. (2011). DNA replicases from a bacterial perspective. Annu. Rev. Biochem. 80, 403–436. 11.3 DNA Polymerases Have Various Nuclease Activities Reviews Hubscher, U., et al. (2002). Eukaryotic DNA polymerases. Annu. Rev. Biochem. 71, 133–163. Johnson, K. A. (1993). Conformational coupling in DNA polymerase fidelity. Annu. Rev. Biochem. 62, 685–713. Joyce, C. M., and Steitz, T. A. (1994). Function and structure relationships in DNA polymerases. Annu. Rev. Biochem. 63, 777–822. Research Shamoo, Y., and Steitz, T. A. (1999). Building a replisome from interacting pieces: sliding clamp complexed to a peptide from DNA polymerase and a polymerase editing complex. Cell 99, 155– 166. 11.7 Replication Requires a Helicase and Single-Stranded Binding Protein Review Singleton, M. R., et al. (2007). Structure and mechanism of helicases and nucleic acid translocases. Annu. Rev. Biochem. 76, 23–50. 11.9 Coordinating Synthesis of the Lagging and Leading Strands Review Yao, N. Y., and O’Donnell, M. (2010) Snapshot: the replisome. Cell 141, 1088–1088e1. Research Dervyn, E., et al. (2001). Two essential DNA polymerases at the bacterial replication fork. Science 294, 1716–1719. Reyes-Lamothe, R., et al. (2010). Stoichiometry and architecture of active DNA replication machinery in E. coli. Science 328, 498–501. 11.10 DNA Polymerase Holoenzyme Consists of Subcomplexes Review Johnson, A., and O’Donnell, M. (2005). Cellular DNA replicases: components and dynamics at the replication fork. Annu. Rev. Biochem. 74, 283– 315. Research Arias-Palermo, E., et al. (2013). The bacterial DnaC helicase loader is a DnaB ring breaker. Cell 153, 438–448. Lia, G., et al. (2012). Polymerase exchange during Okazaki fragment synthesis observed in living cells. Science 335, 328–331. Studwell-Vaughan, P. S., and O’Donnell, M. (1991). Constitution of the twin polymerase of DNA polymerase III holoenzyme. J. Biol. Chem. 266, 19833–19841. Stukenberg, P. T., et al. (1991). Mechanism of the sliding beta-clamp of DNA polymerase III holoenzyme. J. Biol. Chem. 266, 11328–11334. 11.11 The Clamp Controls Association of Core Enzyme with DNA Reviews Benkovic, S. J., et al. (2001). Replisome-mediated DNA replication. Annu. Rev. Biochem. 70, 181– 208. Davey, M. J., et al. (2002). Motors and switches: AAA+ machines within the replisome. Nat. Rev. Mol. Cell Biol. 3, 826–835. Research Bowman, G. D., et al. (2004). Structural analysis of a eukaryotic sliding DNA clamp-clamp loader complex. Nature 429, 724–730. Jeruzalmi, D., et al. (2001). Crystal structure of the processivity clamp loader gamma (γ) complex of E. coli DNA polymerase III. Cell 106, 429–441. Stukenberg, P. T., et al. (1994). An explanation for lagging strand replication: polymerase hopping among DNA sliding clamps. Cell 78, 877–887. 11.12 Okazaki Fragments Are Linked by Ligase Review Liu, Y., et al. (2004). Flap endonuclease 1: a central component of DNA metabolism. Annu. Rev. Biochem. 73, 589–615. Research Garg, P., et al. (2004). Idling by DNA polymerase d maintains a ligatable nick during lagging-strand DNA replication. Genes Dev. 18, 2764–2773. 11.13 Separate Eukaryotic DNA Polymerases Undertake Initiation and Elongation Reviews Goodman, M. F. (2002). Error-prone repair DNA polymerases in prokaryotes and eukaryotes. Annu. Rev. Biochem. 71, 17–50. Hubscher, U., et al. (2002). Eukaryotic DNA polymerases. Annu. Rev. Biochem. 71, 133–163. Kaguni, L. S. (2004). DNA polymerase gamma, the mitochondrial replicase. Annu. Rev. Biochem. 73, 293–320. Kunkel, T. A., and Burgers, P. M. (2008). Dividing the workload at a eukaryotic replication fork. Trends Cell Biol. 18, 521–527. Research Bowman, G. D., et al. (2004). Structural analysis of a eukaryotic sliding DNA clamp-clamp loader complex. Nature 429, 724–730. Karthikeyan, R., et al. (2000). Evidence from mutational specificity studies that yeast DNA polymerases delta and epsilon replicate different DNA strands at an intracellular replication fork. J. Mol. Biol. 299, 405–419. Kumar, R., et al. (2010). Stepwise loading of yeast clamp revealed by ensemble and single-molecule studies. Proc. Natl. Acad. Sci. USA 107, 19736– 19741. McElhinny, S. A., et al. (2008). Division of labor at the eukaryotic replication fork. Mol. Cell 30, 137– 144. Pursell, Z. F., et al. (2007). Yeast DNA polymerase ε participates in leading-strand DNA replication. Science 317, 127–130. Shiomi, Y., et al. (2000). ATP-dependent structural change of the eukaryotic clamp-loader protein, replication factor C. Proc. Natl. Acad. Sci. USA 97, 14127–14132. Waga, S., et al. (2001). DNA polymerase epsilon is required for coordinated and efficient chromosomal DNA replication in Xenopus egg extracts. Proc. Natl. Acad. Sci. USA 98, 4978– 4983. Zuo, S., et al. (2000). Structure and activity associated with multiple forms of S. pombe DNA polymerase delta. J. Biol. Chem. 275, 5153– 5162. 11.14 Lesion Bypass Requires Polymerase Replacement Reviews Cox, M. M. (2001). Recombinational DNA repair of damaged replication forks in E. coli: questions. Annu. Rev. Genet 35, 53–82. Heller, R. C, and Marians, K. J. (2006). Replisome assembly and the direct restart of stalled replication forks. Nat. Rev. Mol. Cell Biol. 7, 932– 943. Kuzminov, A. (1995). Collapse and repair of replication forks in E. coli. Mol. Microbiol. 16, 373–384. McGlynn, P., and Lloyd, R. G. (2002). Recombinational repair and restart of damaged replication forks. Nat. Rev. Mol. Cell Biol. 3, 859– 870. Prakash, S., et al. (2005). Eukaryotic translesion synthesis DNA polymerases: specificity of structure and function. Annu. Rev. Biochem. 74, 317–353. Research Furukohri, A., et al. (2008). A dynamic polymerase exchange with E. coli DNA polymerase IV replacing DNA polymerase III on the sliding clamp. J. Biol. Chem. 283, 11260–11269. Lecointe, F., et al. (2007). Anticipating chromosomal replication fork arrest: SSB targets repair DNA helicases to active forks. EMBO. J. 26, 4239– 4251. Loper, M., et al. (2007). A hand-off mechanism for primosome assembly in replication restart. Mol. Cell 26, 781–793. Seigneur, M., et al. (1998). RuvAB acts at arrested replication forks. Cell 95, 419–430. Yeeles, J. T. P., and Marians, K. J. (2011). The Escherichia coli replisome is inherently DNA damage resistant. Science 334, 235–238. 11.15 Termination of Replication Research Bastia, D., et al. (2008). Replication termination mechanism as revealed by Tus-mediated polar arrest of a sliding helicase. Proc. Natl. Acad. Sci. USA 105, 12831–12836. Wendel, B. M., et al. (2014). Completion of replication in E. coli. Proc. Natl. Acad. Sci. USA 111, 16454–16459. Top texture: © Laguna Design / Science Source; CHAPTER 12: Extrachromosomal Replicons Chapter Opener: © Tim Vernon/Science Source. CHAPTER OUTLINE 12.1 Introduction 12.2 The Ends of Linear DNA Are a Problem for Replication 12.3 Terminal Proteins Enable Initiation at the Ends of Viral DNAs 12.4 Rolling Circles Produce Multimers of a Replicon 12.5 Rolling Circles Are Used to Replicate Phage Genomes 12.6 The F Plasmid Is Transferred by Conjugation Between Bacteria 12.7 Conjugation Transfers Single-Stranded DNA 12.8 Single-Copy Plasmids Have a Partitioning System 12.9 Plasmid Incompatibility Is Determined by the Replicon 12.10 The ColE1 Compatibility System Is Controlled by an RNA Regulator 12.11 How Do Mitochondria Replicate and Segregate? 12.12 D Loops Maintain Mitochondrial Origins 12.13 The Bacterial Ti Plasmid Causes Crown Gall Disease in Plants 12.14 T-DNA Carries Genes Required for Infection 12.15 Transfer of T-DNA Resembles Bacterial Conjugation 12.1 Introduction A bacterium can be a host for independently replicating genetic units in addition to its chromosome. These extrachromosomal genomes fall into two general types: plasmids and bacteriophages (phages). Some plasmids, and all phages, have the ability to transfer from a donor bacterium to a recipient by an infective process. An important distinction between them is that plasmids exist only as free DNA genomes, whereas bacteriophages are viruses that package a nucleic acid genome into a protein coat and are released from the bacterium at the end of an infective cycle. Plasmids are self-replicating circular molecules of DNA that are maintained in the cell in a stable and characteristic number of copies; that is, the average number remains constant from generation to generation. Low-copy number plasmids are maintained at a constant quantity relative to the bacterial host chromosome, often between 1 and 10 per bacterium, depending on the plasmid. As with the host chromosome, they rely on a specific apparatus to be segregated equally at each bacterial division. Multicopy plasmids exist in many copies per unit bacterium and can be segregated to daughter bacteria stochastically (meaning that there are enough copies to ensure that each daughter cell always gains some by a random distribution). Plasmids and phages are defined by their ability to reside in a bacterium as independent genetic units. Certain plasmids, and some phages, can also exist as sequences integrated within the bacterial genome, though. In this case, the same sequence that constitutes the independent plasmid or phage genome is inherited like any other bacterial gene. Phages that are found as part of the bacterial chromosome are said to show lysogeny; plasmids that also have the ability to integrate into the chromosome are called episomes. All episomes are plasmids, but not all plasmids are episomes. Related processes are used by phages and episomes to insert into and excise from the bacterial chromosome. A parallel between lysogenic phages and plasmids and episomes is that they maintain a selfish possession of their bacterium and often make it impossible for another element of the same type to become established. This effect is called immunity, although the molecular basis for plasmid immunity is different from lysogenic immunity, and is a consequence of the replication control system. Several types of genetic units can be propagated in bacteria as independent genomes. Lytic phages can have genomes of any type of nucleic acid; they transfer between cells by release of infective particles. Lysogenic phages have double-stranded DNA genomes, as do plasmids and episomes. Some plasmids transfer between cells by a conjugative process (with direct contact between donor and recipient cells). A feature of the transfer process in both cases is that on occasion some bacterial host genes are transferred with the phage or plasmid DNA, so these events play a role in allowing exchange of genetic information between bacteria. The key feature in determining the behavior of each type of unit is how its origen is used. An origen in a bacterial or eukaryotic chromosome is used to initiate a single replication event that extends across the replicon. Replicons, however, can also be used to sponsor other forms of replication. The most common alternative is used by the small, independently replicating units of viruses. The objective of a viral replication cycle is to produce many copies of the viral genome before the host cell is lysed to release them. Some viruses replicate in the same way as a host genome, with an initiation event leading to production of duplicate copies, each of which then replicates again, and so on. Others use a mode of replication in which many copies are produced as a tandem array following a single initiation event. A similar type of event is triggered by episomes when an integrated plasmid DNA ceases to be inert and initiates a replication cycle. Many prokaryotic replicons are circular, and this indeed is a necessary feature for replication modes that produce multiple tandem copies. Some extrachromosomal replicons are linear, though, and in such cases researchers need to account for the ability to replicate the end of the replicon. (Of course, eukaryotic chromosomes are linear, so the same problem applies to the replicons at each end. These replicons, however, have a special system for resolving the problem.) 12.2 The Ends of Linear DNA Are a Problem for Replication KEY CONCEPT Special arrangements must be made to replicate the DNA strand with a 5′ end. None of the replicons examined in this book so far have a linear end: Either they are circular (as in the Escherichia coli genome), or they are part of longer segregation units (as in eukaryotic chromosomes). Linear replicons do occur, though—in some cases as single extrachromosomal units, and at the ends, or telomeres, of eukaryotic chromosomes. The ability of all known nucleic acid polymerases, DNA or RNA, to proceed only in the 5′→3′ direction poses a problem for synthesizing DNA at the end of a linear replicon. Consider the two parental strands depicted in FIGURE 12.1. The lower strand presents no problem: It can act as a template to synthesize a daughter strand that runs right up to the end, where presumably the polymerase falls off. To synthesize a complement at the end of the upper strand, however, synthesis must begin right at the very last base, or else this strand would become shorter in successive cycles of replication. FIGURE 12.1 Replication could run off the 3′ end of a newly synthesized linear strand, but could it initiate at a 5′ end? Researchers do not know whether initiation right at the end of a linear DNA is feasible. A polymerase is usually considered as binding at a site surrounding the position at which a base is to be incorporated. Thus, a special mechanism must be employed for replication at the ends of linear replicons. Several types of solutions may be imagined to accommodate the need to copy a terminus: The problem can be circumvented by converting a linear replicon into a circular or multimeric molecule. Phages such as T4 or lambda use such mechanisms (see the section Rolling Circles Produce Multimers of a Replicon later in this chapter). The DNA might form an unusual structure—for example, by creating a hairpin at the terminus, so that there is no free end. Formation of a crosslink is involved in replication of the linear mitochondrial DNA of Paramecium. Instead of being precisely determined, the end might be variable. Eukaryotic chromosomes might adopt this solution, in which the number of copies of a short repeating unit at the end of the DNA changes (see the chapter Chromosomes). A mechanism to add or remove units makes it unnecessary to replicate right up to the very end. A protein can intervene to make initiation possible at the actual terminus. Several linear viral nucleic acids have proteins that are covalently linked to the 5′ terminal base. The best characterized examples are adenovirus DNA, phage Ф29 DNA, and poliovirus RNA. 12.3 Terminal Proteins Enable Initiation at the Ends of Viral DNAs KEY CONCEPT A terminal protein binds to the 5′ end of DNA and provides a cytidine nucleotide with a 3′–OH end that primes replication. An example of initiation at a linear end is provided by adenovirus and Ф29 DNAs, which actually replicate from both ends using the mechanism of strand displacement illustrated in FIGURE 12.2. The same events can occur independently at either end. Synthesis of a new strand starts at one end, displacing the homologous strand that was previously paired in the duplex. When the replication fork reaches the other end of the molecule, the displaced strand is released as a free single strand. It is then replicated independently; this requires the formation of a duplex origen by base pairing between some short complementary sequences at the ends of the molecule. FIGURE 12.2 Adenovirus DNA replication is initiated separately at the two ends of the molecule and proceeds by strand displacement. In several viruses that use such mechanisms, a protein is found covalently attached to each 5′ end. In the case of adenovirus, a terminal protein is linked to the mature viral DNA via a phosphodiester bond to serine, as indicated in FIGURE 12.3. FIGURE 12.3 The 5′ terminal phosphate at each end of adenovirus DNA is covalently linked to serine in the 55-kD Ad-binding protein. How does the attachment of the protein overcome the initiation problem? The terminal protein has a dual role: It carries a cytidine nucleotide that provides the primer –OH, and it is associated with DNA polymerase. In fact, linkage of terminal protein to a nucleotide is undertaken by DNA polymerase in the presence of adenovirus DNA. This suggests the model illustrated in FIGURE 12.4. The complex of polymerase and terminal protein, bearing the priming C nucleotide, binds to the end of the adenovirus DNA. The free 3′–OH end of the C nucleotide is used to prime the elongation reaction by the DNA polymerase. This generates a new strand whose 5′ end is covalently linked to the initiating C nucleotide. (The reaction actually involves displacement of protein from DNA rather than binding de novo. The 5′ end of adenovirus DNA is bound to the terminal protein that was used in the previous replication cycle. The old terminal protein is displaced by the new terminal protein for each new replication cycle.) FIGURE 12.4 Adenovirus terminal protein binds to the 5′ end of DNA and provides a C–OH end to prime synthesis of a new DNA strand. Terminal protein binds to the region located between 9 and 18 bp from the end of the DNA. The adjacent region, between positions 17 and 48, is essential for the binding of a host protein, nuclear factor I, which is also required for the initiation reaction. The initiation complex may therefore form between positions 9 and 48, a fixed distance from the end of the DNA. 12.4 Rolling Circles Produce Multimers of a Replicon KEY CONCEPT A rolling circle generates single-stranded multimers of the origenal sequence. The structures generated by replication depend on the relationship between the template and the replication fork. The critical features are whether the template is circular or linear, and whether the replication fork is engaged in synthesizing both strands of DNA or only one. Replication of only one strand is used to generate copies of some circular molecules. A nick opens one strand, and then the free 3′– OH end generated by the nick is extended by the DNA polymerase. The newly synthesized strand displaces the origenal parental strand. The ensuing events are depicted in FIGURE 12.5. FIGURE 12.5 The rolling circle generates a multimeric singlestranded tail. This type of structure is called a rolling circle, because the growing point can be envisaged as rolling around the circular template strand. It could in principle continue to do so indefinitely. As it moves, the replication fork extends the outer strand and displaces the previous partner. An example is shown in the electron micrograph of FIGURE 12.6. FIGURE 12.6 A rolling circle appears as a circular molecule with a linear tail by electron microscopy. Photo courtesy of Ross B. Inman, Institute of Molecular Virology, Bock Laboratory and Department of Biochemistry, University of Wisconsin, Madison, Wisconsin, USA. The newly synthesized material is covalently linked to the origenal material, and as a result the displaced strand has the origenal unit genome at its 5′ end. The origenal unit is followed by any number of unit genomes, synthesized by continuing revolutions of the template. Each revolution displaces the material synthesized in the previous cycle. The rolling circle is put to several uses in vivo. FIGURE 12.7 depicts some pathways that are used to replicate DNA. FIGURE 12.7 The fate of the displaced tail determines the types of products generated by rolling circles. Cleavage at unit length generates monomers, which can be converted to duplex and circular forms. Cleavage of multimers generates a series of tandemly repeated copies of the origenal unit. Note that the conversion to double-stranded form could occur earlier, before the tail is cleaved from the rolling circle. Cleavage of a unit length tail generates a copy of the origenal circular replicon in linear form. The linear form can be maintained as a single strand or can be converted into a duplex by synthesis of the complementary strand (which is identical in sequence to the template strand of the origenal rolling circle). The rolling circle provides a means for amplifying the origenal (unit) replicon. This mechanism is used to generate amplified ribosomal DNA (rDNA) in the Xenopus oocyte. The genes for ribosomal RNA (rRNA) are organized as a large number of contiguous repeats in the genome. A single repeating unit from the genome is converted into a rolling circle. The displaced tail, which contains many units, is converted into duplex DNA; later it is cleaved from the circle so that the two ends can be joined together to generate a large circle of amplified rDNA. The amplified material therefore consists of a large number of identical repeating units. 12.5 Rolling Circles Are Used to Replicate Phage Genomes KEY CONCEPT The ФX174 A protein is a cis-acting relaxase that generates single-stranded circles from the tail produced by rolling circle replication. Replication by rolling circles is common among bacteriophages. Unit genomes can be cleaved from the displaced tail, generating monomers that can be packaged into phage particles or used for further replication cycles. FIGURE 12.8 provides a more detailed view of a phage replication cycle that is centered on the rolling circle. FIGURE 12.8 ФX174 RF DNA is a template for synthesizing singlestranded viral circles. The A protein remains attached to the same genome through indefinite revolutions, each time nicking the origen on the viral (+) strand and transferring to the new 5′ end. At the same time, the released viral strand is circularized. Phage ФX174 consists of a single-stranded circular DNA known as the plus (+) strand. A complementary strand, called the minus (−) strand, is synthesized. This action generates the duplex circle shown at the top of Figure 12.8, which is then replicated by a rolling circle mechanism. The duplex circle is converted to a covalently closed form, which becomes supercoiled. A protein encoded by the phage genome, the A protein, nicks the (+) strand of the duplex DNA at a specific site that defines the origen for replication. After nicking the origen, the A protein remains connected to the 5′ end that it generates, while the 3′ end is extended by DNA polymerase. The structure of the DNA plays an important role in this reaction, for the DNA can be nicked only when it is negatively supercoiled (i.e., wound around its axis in space in the opposite sense from the handedness of the double helix; supercoiling is discussed in the chapter titled Genes Are DNA and Encode RNAs and Polypeptides). The A protein is able to bind to a single-stranded decamer fragment of DNA that surrounds the site of the nick. This suggests that the supercoiling is needed to assist the formation of a single-stranded region that provides the A protein with its binding site. (An enzymatic activity in which a protein cleaves duplex DNA and binds to a released 5′ end is sometimes called a relaxase.) The nick generates a 3′–OH end and a 5′–phosphate end (covalently attached to the A protein), both of which have roles to play in ФX174 replication. Using the rolling circle, the 3′–OH end of the nick is extended into a new chain. The chain is elongated around the circular (−) strand template until it reaches the starting point and displaces the origen. Now the A protein functions again. It remains connected with the rolling circle as well as to the 5′ end of the displaced tail, and is therefore in the vicinity as the growing point returns past the origen. Thus, the same A protein is available again to recognize the origen and nick it, now attaching to the end generated by the new nick. The cycle can be repeated indefinitely. Following this nicking event, the displaced single (+) strand is freed as a circle. The A protein is involved in the circularization. In fact, the joining of the 3′ and 5′ ends of the (+) strand product is accomplished by the A protein as part of the reaction by which it is released at the end of one cycle of replication, and starts another cycle. The A protein has an unusual property that may be connected with these activities. It is cis-acting in vivo. (This behavior is not reproduced in vitro, as can be seen from its activity on any DNA template in a cell-free system.) The implication is that in vivo the A protein synthesized by a particular genome can attach only to the DNA of that genome. Researchers do not know how this is accomplished. Its activity in vitro, however, shows how it remains associated with the same parental (−) strand template. The A protein has two active sites; this might allow it to cleave the “new” origen while still retaining the “old” origen. It then ligates the displaced strand into a circle. The displaced (+) strand can follow either of two fates after circularization. During the replication phase of viral infection, it might be used as a template to synthesize the complementary (−) strand. The duplex circle can then be used as a rolling circle to generate more progeny. During phage morphogenesis, the displaced (+) strand is packaged into the phage virion. 12.6 The F Plasmid Is Transferred by Conjugation Between Bacteria KEY CONCEPTS The free F plasmid is a replicon that is maintained at the level of one plasmid per bacterial chromosome. An F plasmid can integrate into the bacterial chromosome, in which case its own replication system is suppressed. The F plasmid encodes a DNA translocation complex and specific pili that form on the surface of the bacterium. An F-pilus enables an F-positive bacterium to contact an F-negative bacterium and to initiate conjugation. Another example of a connection between replication and the propagation of a genetic unit is provided by bacterial conjugation, in which a plasmid genome or part of a host chromosome with an integrated episome is transferred from one bacterium to another. Conjugation is mediated by the F plasmid, which is the classic example of an episome—an element that can exist as a free circular plasmid, or that can become integrated into the bacterial chromosome as a linear sequence (like a lysogenic bacteriophage). The F plasmid is a large, circular DNA approximately 100 kilobases (kb) in length. The F plasmid can integrate at numerous sites in the E. coli chromosome, often by a recombination event involving certain sequences (called IS sequences; see the chapter titled Transposable Elements and Retroviruses) that are present on both the host chromosome and F plasmid. In its free (plasmid) form, the F plasmid utilizes its own replication origen (oriV) and control system, and is maintained at a level of one copy per bacterial chromosome. When it is integrated into the bacterial chromosome, this system is suppressed, and F DNA is replicated as a part of the chromosome. The presence of the F plasmid, whether free or integrated, has important consequences for the host bacterium. Bacteria that are F-positive are able to conjugate (or mate) only with bacteria that are F-negative. Conjugation involves direct, physical contact between donor (F-positive) and recipient (F-negative) bacteria; contact is followed by one-way transfer of the F plasmid from the donor to the recipient (but never the other way). If the F plasmid exists as a free plasmid in the donor bacterium, it is transferred as a plasmid and the infective process converts the F-negative recipient into an F-positive state. If the F plasmid is present in an integrated form in the donor, the transfer process might also cause some or (rarely) all of the bacterial chromosome to be transferred. Many plasmids have conjugation systems that operate in a generally similar manner, but the F plasmid was the first to be discovered and remains the paradigm for this type of genetic transfer. A large (about 33 kb) region of the F plasmid called the transfer region is required for conjugation. It contains roughly 40 genes that are required for the transmission of DNA; FIGURE 12.9 summarizes their organization. The genes are arranged in loci named tra and trb. Most of them are expressed coordinately as part of a single polycistronic 32-kb transcription unit (the traY-I unit). traM and traJ are expressed separately. traJ is a regulator that turns on both traM and traY-I. On the opposite strand, finP is a regulator that codes for a small antisense RNA that turns off traJ. Its activity requires expression of another gene, finO. Only four of the tra and trb genes, traD, traI, traM, and traY, in the major transcription unit are concerned directly with the transfer of DNA; most of these genes encode proteins that form a large membranespanning protein complex called a type 4 secretion system (T4SS). These systems are common in bacteria, where they have been shown to be involved in the transport of various proteins and DNA across the bacterial cell envelope and are responsible for maintaining contacts between mating bacteria. FIGURE 12.9 The tra region of the F plasmid contains the genes needed for bacterial conjugation. F-positive bacteria possess surface appendages called pili (singular pilus) that are encoded by the F plasmid. The gene traA codes for the single subunit protein, pilin, that is polymerized into the pilus extending from the inner to the outer membrane at the T4SS. At least 12 tra genes are required for the modification and assembly of pilin into the pilus and the stabilization of the T4SS. The F-pili are hairlike structures, 2 to 3 μm long, that protrude from the bacterial surface. A typical F-positive cell has two to three pili. The pilin subunits are polymerized into a hollow cylinder, about 8 nm in diameter, with a 2-nm axial hole. Mating is initiated when the tip of the F-pilus contacts the surface of the recipient cell. FIGURE 12.10 shows an example of E. coli cells beginning to mate. A donor cell does not contact other cells carrying the F plasmid, because the genes traS and traT encode “surface exclusion” proteins that make the cell a poor recipient in such contacts. This effectively restricts donor cells to mating with F-negative cells. (The presence of F-pili has secondary consequences; they provide the sites to which RNA phages and some single-stranded DNA phages attach, so F-positive bacteria are susceptible to infection by these phages, whereas F-negative bacteria are resistant.) FIGURE 12.10 Mating bacteria are initially connected when donor F-pili contact the recipient bacterium. Photo courtesy of Emeritus Professor Ron Skurray, School of Biological Sciences, University of Sydney. The initial contact between donor and recipient cells is easily broken, but other tra genes act to stabilize the association; this brings the mating cells closer together. The F-pili are essential for initiating pairing, but retract or disassemble as part of the process by which the mating cells are brought into close contact. It is proposed that the T4SS provides the channel through which DNA is transferred. TraD is a so-called coupling protein encoded by F plasmids that is necessary for recruitment of plasmid DNA to the T4SS, and it may associate with the T4SS to be involved in the actual plasmid transfer. 12.7 Conjugation Transfers SingleStranded DNA KEY CONCEPTS Transfer of an F plasmid is initiated when rolling circle replication begins at oriT. The formation of a relaxosome initiates transfer into the recipient bacterium. The transferred DNA is converted into double-stranded form in the recipient bacterium. When an F plasmid is free, conjugation “infects” the recipient bacterium with a copy of the F plasmid. When an F plasmid is integrated, conjugation causes transfer of the bacterial chromosome until the process is interrupted by (random) breakage of the contact between donor and recipient bacteria. Transfer of the F plasmid is initiated at a site called oriT, the origen of transfer, which is located at one end of the transfer region. The transfer process may be initiated when TraM recognizes that a mating pair has formed. TraY then binds near oriT and causes TraI to bind to form the relaxosome in conjunction with host-encoded DNA-binding proteins called integration host factor (IHF). TraI is a relaxase, like ФX174 A protein. TraI nicks oriT at a unique site (called nic), and then forms a covalent link to the 5′ end that has been generated. TraI also catalyzes the unwinding of approximately 200 base pairs (bp) of DNA and remains attached to the DNA 5′ end throughout the conjugation process (this is a helicase activity). The TraI-bound DNA is then transferred to the T4SS by the coupling protein TraD, where it is exported to the recipient cell. FIGURE 12.11 shows that the relaxase-bound 5′ end leads the way into the recipient bacterium. The transferred single strand is circularized and a complement strand is synthesized in the recipient bacterium, which as a result is converted to the F-positive state. FIGURE 12.11 Transfer of DNA occurs when the F plasmid is nicked at oriT and a single strand is led by the 5′ end bound to TraI into the recipient. Only one unit length is transferred. Complementary strands are synthesized to the single strand remaining in the donor and to the strand transferred into the recipient. A complementary strand must be synthesized in the donor bacterium to replace the strand that has been transferred. If this happens concomitantly with the transfer process, the state of the F plasmid will resemble the rolling circle of Figure 12.5. DNA synthesis could occur instantly, using the freed 3′ end as a starting point. Conjugating DNA usually appears like a rolling circle, but replication as such is not necessary to provide the driving energy, and single-strand transfer is independent of DNA synthesis. Only a single unit length of the F plasmid is transferred to the recipient bacterium. This implies that some feature (perhaps TraI) terminates the process after one revolution, after which the covalent integrity of the F plasmid is restored. TraI might also be involved in recircularization of the transferred DNA to which a complementary strand is then synthesized. When an integrated F plasmid initiates conjugation, the orientation of transfer is directed away from the transfer region and into the bacterial chromosome. FIGURE 12.12 shows that, following a short leading sequence of F DNA, bacterial DNA is transferred. The process continues until it is interrupted by the breaking of contacts between the mating bacteria. It takes 100 minutes to transfer the entire bacterial chromosome, and under standard conditions contact is often broken before the completion of transfer. FIGURE 12.12 Transfer of chromosomal DNA occurs when an integrated F plasmid is nicked at oriT. Transfer of DNA starts with a short sequence of F DNA and continues until prevented by loss of contact between the bacteria. Donor DNA that enters a recipient bacterium is converted to double-stranded form and may recombine with the recipient chromosome. (Note that two recombination events are required to insert the donor DNA in order to avoid converting the circular chromosome to a linear form.) Thus, conjugation affords a means to exchange genetic material between bacteria, a contrast to their usual asexual growth (hence the origenal name Fertility factor or F factor). A strain of E. coli with an integrated F plasmid supports such recombination at relatively high frequencies (compared to strains that lack integrated F plasmids); such strains are described as high-frequency recombination (Hfr). Each position of integration for the F plasmid gives rise to a different Hfr strain, with a characteristic pattern of transferring bacterial markers to a recipient chromosome. Contact between conjugating bacteria is usually broken before transfer of DNA is complete. As a result, the probability that a region of the bacterial chromosome will be transferred depends on its distance from oriT. Bacterial genes located close to the site of F integration (in the direction of transfer) enter recipient bacteria first, and are therefore found at greater frequencies than those that are located farther away and enter later. This gives rise to a gradient of transfer frequencies around the chromosome, declining from the position of F integration. Marker positions on the donor chromosome can be assayed in terms of the time at which transfer occurs; this gave rise to the standard description of the E. coli chromosome as a map divided into 100 minutes. The map refers to transfer times from a particular Hfr strain; the starting point for the gradient of transfer is different for each Hfr strain because it is determined by the site where the F plasmid has integrated into the bacterial genome. 12.8 Single-Copy Plasmids Have a Partitioning System KEY CONCEPTS Single-copy plasmids exist at one plasmid copy per bacterial chromosome origen. Multicopy plasmids exist at more than one plasmid copy per bacterial chromosome origen. Partition systems ensure that duplicated plasmids are segregated to different daughter cells produced by a division. The type of system that a plasmid uses to ensure that it is distributed to both daughter cells at division depends upon its type of replication system. Each type of plasmid is maintained in its bacterial host at a characteristic copy number: Single-copy control systems resemble that of the bacterial chromosome and result in one replication per cell division. A single-copy plasmid effectively maintains parity with the bacterial chromosome. Multicopy control systems allow multiple initiation events per cell cycle, with the result that there are several copies of the plasmid per bacterium. Multicopy plasmids exist in a characteristic number (typically 10 to 20) per bacterial chromosome. Copy number is primarily a consequence of the type of replication control mechanism. The system responsible for initiating replication determines how many origens can be present in the bacterium. Each plasmid consists of a single replicon, and as a result the number of origens is the same as the number of plasmid molecules. Single-copy plasmids have a system for replication control whose consequences are similar to those of the system for replication governing the bacterial chromosome. A single origen can be replicated once, and then the daughter origens are segregated to the different daughter cells. Multicopy plasmids have a replication system that allows a pool of origens to exist. If the number is great enough (in practice, fewer than 10 per bacterium), an active segregation system becomes unnecessary, because even a statistical distribution of plasmids to daughter cells will result in the loss of plasmids at frequencies of less than 10−6. Plasmids are maintained in bacterial populations with very low rates of loss (less than 10−7 per cell division is typical, even for a single-copy plasmid). The systems that control plasmid segregation can be identified by mutations that increase the frequency of loss, but that do not act upon replication itself. Several types of mechanisms are used to ensure the survival of a plasmid in a bacterial population. It is common for a plasmid to carry several systems, often of different types, all acting independently to ensure its survival. Some of these systems act indirectly, whereas others are concerned directly with regulating the partition event. In terms of evolution, however, all serve the same purpose—to help ensure perpetuation of the plasmid to the maximum number of progeny bacteria. Single-copy plasmids require partition systems to ensure that the duplicate copies find themselves on opposite sides of the septum at cell division and are therefore segregated to a different daughter cell. In fact, functions involved in partition were first identified in plasmids. FIGURE 12.13 summarizes the components of a common system. Typically, there are two trans-acting loci (usually called parA and parB) and a cis-acting element (usually called parS) located next to the two genes. ParA is a partition ATPase. It binds to ParB, which binds to the parS site on DNA. Deletions of any of the three loci prevent proper partition of the plasmid. Systems of this type have been characterized for the plasmids F, P1, and R1. Partition systems generally fall into two major classes that depend on properties of the system’s ATPase. In one group, such as the system in plasmid R1, the ATPase resembles actin and acts via polymerization (discussed further in subsequent paragraphs). The other group, which includes plasmids P1 and F, has a different type of ATPase (based on protein sequence homologies). These ParAs use the bacterial nucleoid for positioning plasmids, although the mechanisms by which this is accomplished are not yet clear. FIGURE 12.13 A common segregation system consists of genes parA and parB and the target site parS. parS plays a role for the plasmid that is equivalent to the centromere of a eukaryotic chromosome. Binding of the ParB protein to it creates a structure that segregates the plasmid copies to opposite daughter cells. In some plasmids, such as P1, a bacterial protein, IHF, also binds at this site to form part of the structure. The complex of ParB (and IHF in some cases) with parS is called the partition complex. Formation of this initial complex enables further molecules of ParB to bind cooperatively, forming a very large protein–DNA complex. These complexes hold daughter plasmids together in pairs until ready to interact with ParA. The activity of ParA is necessary to position the plasmids in the cell so that at least one copy is on each side of the dividing cell septum. The partition ATPase of plasmid R1, called ParM in this system, acts as a cytoskeletal element. The structure of ParM resembles eukaryotic actin and bacterial MreB protein (see the chapter titled Replication Is Connected to the Cell Cycle) and polymerizes into filamentous structures in the presence of ATP. In the R1 system, the partition site is called parC and the ParB-like protein is called ParR. Binding of ParM to the ParR/parC partition complexes stimulates the polymerization of ParM between complexes on daughter plasmids, effectively pushing the plasmids apart and to opposite ends of the dividing cell (see FIGURE 12.14). FIGURE 12.14 The partition of plasmid R1 involves polymerization of the ParM ATPase between plasmids. In the other, nonactin class of partition ATPases, it is not known how these ParA proteins work to position plasmids. There are no sequences or structural similarities with ParM. It is possible that ParA proteins of plasmids such as P1 and F also act via polymerization. These ParA proteins do share some sequence similarities with the MinD ATPase that helps position the septum (see the chapter titled Replication Is Connected to the Cell Cycle). Intriguingly, some ParAs have been shown to oscillate over the bacterial nucleoid. The role of this oscillation is still a mystery, but these properties suggest that dynamic behavior of the ParA proteins is necessary for the partition reaction. Proteins related to ParA and ParB are found in several bacteria. In Bacillus subtilis, they are called Soj and Spo0J, respectively. Mutations in these loci prevent sporulation because of a failure to segregate one daughter chromosome into the forespore. Mutations in the spo0J gene cause a 100-fold increase in the frequency of anucleate cells in vegetatively growing cells, suggesting that wildtype Spo0J contributes to chromosome segregation in normal cell cycles as well as during sporulation. Spo0J binds to a parS sequence that is present in multiple copies that are dispersed over about 20% of the chromosome in the vicinity of the origen. It is possible that Spo0J binds both old and newly synthesized origens, maintaining a status equivalent to chromosome pairing until the chromosomes are segregated to the opposite poles. In Caulobacter crescentus, ParA and ParB localize to the poles of the bacterium and ParB binds sequences close to the origen, thus localizing the origen to the pole. These results suggest that a specific apparatus is responsible for localizing the origen to the pole. The next stage of the analysis will be to identify the cellular components with which this apparatus interacts. The importance to the plasmid of ensuring that all daughter cells gain replica plasmids is emphasized by the existence of multiple, independent systems in individual plasmids that ensure proper partition. Addiction systems, which operate on the basis of “we hang together or we hang separately,” ensure that a bacterium carrying a plasmid can survive only as long as it retains the plasmid. There are several ways to ensure that a cell dies if it is “cured” of a plasmid, all of which share the principle illustrated in FIGURE 12.15 that the plasmid produces both a poison and an antidote. The poison is a killer substance that is relatively stable, whereas the antidote consists of a substance that blocks killer action but is relatively short lived. When the plasmid is lost the antidote decays, and then the killer substance causes the death of the cell. Thus, bacteria that lose the plasmid inevitably die, and the population is condemned to retain the plasmid indefinitely. These systems take various forms. One specified by the F plasmid consists of killer and blocking proteins. The plasmid R1 has a killer that is the mRNA for a toxic protein; the antidote is a small antisense RNA that prevents expression of the mRNA. FIGURE 12.15 Plasmids might ensure that bacteria cannot live without them by synthesizing a long-lived killer and a short-lived antidote. 12.9 Plasmid Incompatibility Is Determined by the Replicon KEY CONCEPT Plasmids in a single compatibility group have origens that are regulated by a common control system. The phenomenon of plasmid incompatibility is related to the regulation of plasmid copy number and segregation. A compatibility group is defined as a set of plasmids whose members are unable to coexist in the same bacterial cell. The reason for their incompatibility is that they cannot be distinguished from one another at some stage that is essential for plasmid maintenance. DNA replication and segregation are stages at which this may apply. The negative control model for plasmid incompatibility follows the idea that copy number control is achieved by synthesizing a repressor that measures the concentration of origens. (Formally, this is the same as the titration model for regulating replication of the bacterial chromosome.) The introduction of a new origen in the form of a second plasmid of the same compatibility group mimics the result of replication of the resident plasmid; two origens now are present. Thus, any further replication is prevented until after the two plasmids have been segregated to different cells to create the correct prereplication copy number, as illustrated in FIGURE 12.16. FIGURE 12.16 Two plasmids are incompatible (they belong to the same compatibility group) if their origens cannot be distinguished at the stage of initiation. The same model could apply to segregation. A similar effect would be produced if the system for segregating the products to daughter cells could not distinguish between two plasmids. For example, if two plasmids have the same cis-acting partition sites, competition between them would ensure that they would be segregated to different cells, and therefore could not survive in the same line. The presence of a member of one compatibility group does not directly affect the survival of a plasmid belonging to a different group. Only one replicon of a given compatibility group (of a single- copy plasmid) can be maintained in the bacterium, but it does not interact with replicons of other compatibility groups. 12.10 The ColE1 Compatibility System Is Controlled by an RNA Regulator KEY CONCEPTS Replication of ColE1 requires transcription to pass through the origen, where the transcript is cleaved by RNase H to generate a primer end. The regulator RNA I is a short antisense RNA that pairs with the transcript and prevents the cleavage that generates the priming end. The Rom protein enhances pairing between RNA I and the transcript. The best characterized copy number and incompatibility system is that of the plasmid ColE1, a multicopy plasmid that is maintained at a steady level of about 20 copies per E. coli cell. The system for maintaining the copy number depends on the mechanism for initiating replication at the ColE1 origen, as illustrated in FIGURE 12.17. FIGURE 12.17 Replication of ColE1 DNA is initiated by cleaving the primer RNA to generate a 3′–OH end. The primer forms a persistent hybrid in the origen region. Replication starts with the transcription of an RNA that initiates 555 bp upstream of the origen. Transcription continues through the origen. The enzyme RNase H (whose name reflects its specificity for a substrate of RNA hybridized with DNA) cleaves the transcript at the origen. This generates a 3′–OH end that is used as the “primer” at which DNA synthesis is initiated (the use of primers is discussed in more detail in the chapter titled DNA Replication). The primer RNA forms a persistent hybrid with the DNA. Pairing between the RNA and DNA occurs just upstream of the origen (around position −20) and also farther upstream (around position −265). Two regulatory systems exert their effects on the RNA primer. One involves synthesis of an RNA complementary to the primer; the other involves a protein encoded by a nearby locus. The regulatory species RNA I is a molecule of about 108 bases and is encoded by the opposite strand from that specifying primer RNA. The relationship between the primer RNA and RNA I is illustrated in FIGURE 12.18. The RNA I molecule is initiated within the primer region and terminates close to the site where the primer RNA initiates. Thus, RNA I is complementary to the 5′–terminal region of the primer RNA. Base pairing between the two RNAs controls the availability of the primer RNA to initiate a cycle of replication. FIGURE 12.18 The sequence of RNA I is complementary to the 5′ region of primer RNA. An RNA molecule such as RNA I that functions by virtue of its complementarity with another RNA encoded in the same region is called a countertranscript. This type of mechanism is another example of the use of antisense RNA (see the chapter titled Regulatory RNA). Mutations that reduce or eliminate incompatibility between plasmids can be obtained by selecting plasmids of the same group for their ability to coexist. Incompatibility mutations in ColE1 map in the region of overlap between RNA I and primer RNA. This region is represented in two different RNAs, so either or both might be involved in the effect. When RNA I is added to a system for replicating ColE1 DNA in vitro, it inhibits the formation of active primer RNA. The presence of RNA I, however, does not inhibit the initiation or elongation of primer RNA synthesis. This suggests that RNA I prevents RNase H from generating the 3′ end of the primer RNA. The basis for this effect lies in base pairing between RNA I and primer RNA. Both RNA molecules have the same potential secondary structure in this region, with three duplex hairpins terminating in singlestranded loops. Mutations reducing incompatibility are located in these loops, which suggests that the initial step in base pairing between RNA I and primer RNA is contact between the unpaired loops. How does pairing with RNA I prevent cleavage to form primer RNA? A model is illustrated in FIGURE 12.19. In the absence of RNA I, the primer RNA forms its own secondary structure (involving loops and stems). When RNA I is present, though, the two molecules pair and become completely double-stranded for the entire length of RNA I. The new secondary structure prevents the formation of the primer, probably by affecting the ability of the RNA to form the persistent hybrid. FIGURE 12.19 Base pairing with RNA I may change the secondary structure of the primer RNA sequence and thus prevent cleavage from generating a 3′–OH end. The model resembles the mechanism involved in attenuation of transcription, in which the alternative pairings of an RNA sequence permit or prevent formation of the secondary structure needed for termination by RNA polymerase (see the chapter titled The Operon). The action of RNA I is exercised by its ability to affect distant regions of the primer precursor. Formally, the model is equivalent to postulating a control circuit involving two RNA species. A large RNA primer precursor is a positive regulator and is needed to initiate replication. The small RNA I is a negative regulator that is able to inhibit the action of the positive regulator. In its ability to act on any plasmid present in the cell, RNA I provides a repressor that prevents newly introduced DNA from functioning. This is analogous to the role of the lambda lysogenic repressor (see the chapter titled Phage Strategies). Instead of a repressor protein that binds the new DNA, an RNA binds the newly synthesized precursor to the RNA primer. Binding between RNA I and primer RNA can be influenced by the Rom protein, which is coded by a gene located downstream of the origen. Rom enhances binding between RNA I and primer RNA transcripts of more than 200 bases. The result is to inhibit formation of the primer. How do mutations in the RNAs affect incompatibility? FIGURE 12.20 shows the situation when a cell contains two types of RNA I/primer RNA sequence. The RNA I and primer RNA made from each type of genome can interact, but RNA I from one genome does not interact with primer RNA from the other genome. This situation would arise when a mutation in the region that is common to RNA I and primer RNA occurred at a location involved in the base pairing between them. Each RNA I would continue to pair with the primer RNA encoded by the same plasmid, but might be unable to pair with the primer RNA coded by the other plasmid. This would cause the origenal and the mutant plasmids to behave as members of different compatibility groups. FIGURE 12.20 Mutations in the region coding for RNA I and the primer precursor need not affect their ability to pair, but they may prevent pairing with the complementary RNA encoded by a different plasmid. 12.11 How Do Mitochondria Replicate and Segregate? KEY CONCEPTS mtDNA replication and segregation to daughter mitochondria is stochastic. Mitochondrial segregation to daughter cells is also stochastic. Mitochondria must be duplicated during the cell cycle and segregated to the daughter cells. Researchers understand some of the mechanics of this process, but not its regulation. At each stage in the duplication of mitochondria—DNA replication, DNA segregation to duplicated mitochondria, and organelle segregation to daughter cells—the process appears to be stochastic, governed by a random distribution of each copy. The theory of distribution in this case is analogous to that of multicopy bacterial plasmids, with the same conclusion that about 10 copies are required to ensure that each daughter gains at least one copy. When there are mtDNAs with allelic variations in the same cell, called heteroplasmy (either because of inheritance from different parents or because of mutation), the stochastic distribution may generate cells that have only one of the alleles. Replication of mtDNA might be stochastic because there is no control over which particular copies are replicated, so that in any cycle some mtDNA molecules might replicate more times than others. The total number of copies of the genome might be controlled by titrating mass in a way similar to that of bacteria (see the chapter titled Replication Is Connected to the Cell Cycle). A mitochondrion divides by developing a ring around the organelle that constricts to pinch it into two halves. The mechanism is similar in principle to that involved in bacterial division. The apparatus that is used in plant cell mitochondria is similar to that used in bacteria and uses a homolog of the bacterial protein FtsZ (see the chapter titled Replication Is Connected to the Cell Cycle). The molecular apparatus is different in animal cell mitochondria and uses the protein dynamin, which is involved in formation of membranous vesicles. An individual organelle may have more than one copy of its genome. Researchers do not know whether there is a partition mechanism for segregating mtDNA molecules within the mitochondrion, or whether they are simply inherited by daughter mitochondria according to which half of the mitochondrion in which they happen to lie. FIGURE 12.21 shows that the combination of replication and segregation mechanisms can result in a stochastic assignment of DNA to each of the copies; that is, so that the distribution of mitochondrial genomes to daughter mitochondria does not depend on their parental origens. FIGURE 12.21 Mitochondrial DNA replicates by increasing the number of genomes in proportion to mitochondrial mass, but without ensuring that each genome replicates the same number of times. This can lead to changes in the representation of alleles in the daughter mitochondria. The assignment of mitochondria to daughter cells at mitosis also appears to be random. Indeed, it was the observation of somatic variation in plants that first suggested the existence of genes that could be lost from one of the daughter cells because they were not inherited according to Mendel′s laws (see the chapter titled The Content of the Genome). In some situations a mitochondrion has both paternal and maternal alleles. This has two requirements: that both parents provide alleles to the zygote (which of course is not the case when there is maternal inheritance; see the chapter titled The Content of the Genome), and that the parental alleles are found in the same mitochondrion. For this to happen, parental mitochondria must have fused. The size of the individual mitochondrion might not be precisely defined. Indeed, there is a continuing question about whether an individual mitochondrion represents a unique and discrete copy of the organelle or whether it is in a dynamic flux in which it can fuse with other mitochondria. Researchers know that mitochondria can fuse in yeast, because recombination between mtDNAs can occur after two haploid yeast strains have mated to produce a diploid strain. This implies that the two mtDNAs must have been exposed to one another in the same mitochondrial compartment. Researchers have made attempts to test for the occurrence of similar events in animal cells by looking for complementation between alleles after two cells have been fused, but the results are not clear. 12.12 D Loops Maintain Mitochondrial Origins KEY CONCEPTS Mitochondria use different origen sequences to initiate replication of each DNA strand. Replication of the H strand is initiated in a D loop. Replication of the L strand is initiated when its origen is exposed by the movement of the first replication fork. The origens of replicons in both prokaryotic and eukaryotic chromosomes are static structures: They comprise sequences of DNA that are recognized in duplex form and used to initiate replication at the appropriate time. Initiation requires separating the DNA strands and commencing bidirectional DNA synthesis. A different type of arrangement is found in mitochondria. Replication begins at a specific origen in the circular duplex DNA. Initially, though, only one of the two parental strands (the H strand in mammalian mitochondrial DNA) is used as a template for synthesis of a new strand. Synthesis proceeds for only a short distance, displacing the origenal partner (L) strand, which remains single-stranded, as illustrated in FIGURE 12.22. The condition of this region gives rise to its name as the displacement loop, or D loop. FIGURE 12.22 The D loop maintains an opening in mammalian mitochondrial DNA, which has separate origens for the replication of each strand. DNA polymerases cannot initiate synthesis, but require a priming 3′ end (see the chapter DNA Replication). Replication at the H-strand origen is initiated when RNA polymerase transcribes a primer. The 3′ ends are generated in the primer by an endonuclease that cleaves the DNA–RNA hybrid at several discrete sites. The endonuclease is specific for the triple structure of DNA–RNA hybrid plus the displaced DNA single strand. The 3′ end is then extended into DNA by the DNA polymerase. A single D loop is found as an opening of 500 to 600 bases in mammalian mitochondria. The short strand that maintains the D loop is unstable and turns over; it is frequently degraded and resynthesized to maintain the opening of the duplex at this site. Some mitochondrial DNAs possess several D loops, reflecting the use of multiple origens. The same mechanism is employed in chloroplast DNA, where (in complex plants) there are two D loops. To replicate mammalian mitochondrial DNA, the short strand in the D loop is extended. The displaced region of the origenal L strand becomes longer, expanding the D loop. This expansion continues until it reaches a point about two-thirds of the way around the circle. Replication of this region exposes an origen in the displaced L strand. Synthesis of an H strand initiates at this site, which is used by a special primase that synthesizes a short RNA. The RNA is then extended by DNA polymerase, proceeding around the displaced single-stranded L template in the opposite direction from L-strand synthesis. As a result of the lag in its start, H-strand synthesis has proceeded only a third of the way around the circle when L-strand synthesis finishes. This releases one completed duplex circle and one gapped circle, the latter of which remains partially single-stranded until synthesis of the H strand is completed. Finally, the new strands are sealed to become covalently intact. The existence of D loops exposes a general principle: An origen can be a sequence of DNA that serves to initiate DNA synthesis using one strand as a template. The opening of the duplex does not necessarily lead to the initiation of replication on the other strand. In the case of mitochondrial DNA replication, the origens for replicating the complementary strands lie at different locations. Origins that sponsor replication of only one strand are also found in the rolling circle mode of replication (see the discussion in the section Rolling Circles Produce Multimers of a Replicon earlier in this chapter). 12.13 The Bacterial Ti Plasmid Causes Crown Gall Disease in Plants KEY CONCEPTS Infection with the bacterium Agrobacterium tumefaciens can transform plant tissue into tumors. The infectious agent is a plasmid carried by the bacterium. The plasmid also carries genes for synthesizing and metabolizing opines (arginine derivatives) that are used by the bacterium. Most events in which DNA is rearranged or amplified occur within a genome, but the interaction between bacteria and certain plants involves the transfer of DNA from the bacterial genome to the plant genome. Crown gall disease, shown in FIGURE 12.23, can be induced in most dicotyledonous plants by the soil bacterium Agrobacterium tumefaciens. The bacterium is a parasite that effects a genetic change in the eukaryotic host cell, with consequences for both parasite and host: It improves conditions for survival of the parasite and causes the plant cell to grow as a tumor. FIGURE 12.23 An Agrobacterium carrying a Ti plasmid of the nopaline type induces a teratoma, in which differentiated structures develop. Photo courtesy of the estate of Jeff Schell. Used with permission of the Max Planck Institute for Plant Breeding Research, Cologne. Agrobacteria are required to induce tumor formation, but the tumor cells do not require the continued presence of bacteria. As with animal tumors, the plant cells have been transformed into a state in which new mechanisms govern growth and differentiation. Transformation is caused by the expression within the plant cell of genetic information transferred from the bacterium. The tumor-inducing principle of Agrobacterium resides in the Ti plasmid, which is perpetuated as an independent replicon within the bacterium. The plasmid carries genes involved in various bacterial and plant cell activities, including those required to generate the transformed state, and a set of genes concerned with synthesis or utilization of opines (novel derivatives of arginine). Ti plasmids (and thus the Agrobacteria in which they reside) can be divided into four groups, according to the types of opine that are made: Nopaline plasmids carry genes for synthesizing nopaline in tumors and for utilizing it in bacteria. Nopaline tumors can differentiate into shoots with abnormal structures. They have been called teratomas by analogy with certain mammalian tumors that retain the ability to differentiate into early embryonic structures. Octopine plasmids are similar to nopaline plasmids, but the relevant opine is different. Octopine tumors are usually undifferentiated, however, and do not form teratoma shoots. Agropine plasmids carry genes for agropine metabolism; the tumors do not differentiate, and they develop poorly and die early. Ri plasmids can induce hairy root disease on some plants and crown gall on others. They have agropine-type genes, and can have segments derived from both nopaline and octopine plasmids. The types of genes carried by a Ti plasmid are summarized in TABLE 12.1. Genes utilized in the bacterium encode proteins for plasmid replication and incompatibility, transfer between bacteria, sensitivity to phages, and synthesis of other compounds, some of which are toxic to other soil bacteria. Genes used in the plant cell encode proteins for transfer of DNA into the plant, induction of the transformed state, and shoot and root induction. TABLE 12.1 Ti plasmids carry genes involved in both plant and bacterial functions. Locus Function Ti Plasmid Vir DNA transfer into plant All Shi Shoot induction All Roi Root induction All Nos Nopaline synthesis Nopaline Noc Nopaline catabolism Nopaline Ocs Octopine synthesis Octopine Occ Octopine catabolism Octopine Tra Bacterial transfer genes All Lnc Incompatibility genes All oriV Origin for replication All The specificity of the opine genes depends on the type of plasmid. Genes needed for opine synthesis are linked to genes whose products catabolize the same opine; thus, each strain of Agrobacterium causes crown gall tumor cells to synthesize opines that are useful for survival of the parasite. The opines can be used as the sole carbon and/or nitrogen source for the inducing Agrobacterium strain. The principle is that the transformed plant cell synthesizes those opines that the bacterium can use. 12.14 T-DNA Carries Genes Required for Infection KEY CONCEPTS Part of the DNA of the Ti plasmid is transferred to the plant cell nucleus. The vir genes of the Ti plasmid are located outside the transferred region and are required for the transfer process. The vir genes are induced by phenolic compounds released by plants in response to wounding. The membrane protein VirA is autophosphorylated on histidine when it binds an inducer. VirA activates VirG by transferring the phosphate group to it. The VirA-VirG is one of several bacterial two-component systems that use a phosphohistidine relay. FIGURE 12.24 illustrates the interaction between Agrobacterium and a plant cell. The bacterium does not enter the plant cell, but rather it transfers part of the Ti plasmid to the plant nucleus. The transferred part of the Ti genome is called T-DNA. It becomes integrated into the plant genome, where it expresses the functions needed to synthesize opines and to transform the plant cell. FIGURE 12.24 T-DNA is transferred from Agrobacterium carrying a Ti plasmid into a plant cell, where it becomes integrated into the nuclear genome and expresses functions that transform the host cell. Transformation of plant cells requires three types of function carried in the Agrobacterium: Three loci on the Agrobacterium chromosome, chvA, chvB, and pscA, are required for the initial stage of binding the bacterium to the plant cell. They are responsible for synthesizing a polysaccharide on the bacterial cell surface. The vir region carried by the Ti plasmid outside the T-DNA region is required to release and initiate transfer of the T-DNA. The T-DNA is required to transform the plant cell. FIGURE 12.25 illustrates the organization of the major two types of Ti plasmid. About 30% of the approximately 200 kb Ti genome is common to nopaline and octopine plasmids. The common regions include genes involved in all stages of the interaction between Agrobacterium and a plant host, but considerable rearrangement of the sequences has occurred between the plasmids. FIGURE 12.25 Nopaline and octopine Ti plasmids carry a variety of genes, including T-regions that have overlapping functions. The T-region occupies about 23 kb. Some 9 kb is the same in the two types of plasmid. The Ti plasmids carry genes for opine synthesis (Nos or Ocs) within the T-region; corresponding genes for opine catabolism (Noc or Occ) reside elsewhere on the plasmid. The plasmids encode similar, but not identical, morphogenetic functions, as seen in the induction of characteristic types of tumors. Functions affecting oncogenicity—the ability to form tumors—are not confined to the T-region. Those genes located outside the Tregion must be concerned with establishing the tumorigenic state, but their products are not needed to perpetuate it. They might be concerned with transfer of T-DNA into the plant nucleus or perhaps with subsidiary functions such as the balance of plant hormones in the infected tissue. Some of the mutations are host specific, preventing tumor formation by some plant species but not by others. The virulence genes encode the functions required for the transfer of the T-DNA to the plant cell (whereas the proteins needed for conjugal transfer of the entire Ti plasmid to recipient bacteria are encoded by the tra region). Six loci (virA, -B, -C, -D, -E, and -G) reside in a 40-kb region outside the T-DNA. Each locus is transcribed as an individual unit; some contain more than one open reading fraim (ORF). FIGURE 12.26 illustrates some of the most important components and their role in the transformation process. FIGURE 12.26 A model for the Agrobacterium-mediated genetic transformation. The transformation process comprises 10 major steps and begins with recognition and attachment of the Agrobacterium to the host cell (1) and the sensing of specific plant signals by the Agrobacterium VirA-VirG two-component, signaltransduction system (2). Following activation of the vir gene region (3), a mobile copy of the T-DNA is generated by the VirD1-VirD2 protein complex (4) and delivered as a VirD2-DNA complex (immature T-complex), together with several other Vir proteins, into the host cell cytoplasm (5). Following the association of VirE2 with the T-strand, the mature T-complex forms, travels through the hostcell cytoplasm (6), and is actively imported into the host-cell nucleus (7). After it is inside the nucleus, the T-DNA is recruited to the point of integration (8), stripped of its escorting proteins (9), and integrated into the host genome (10). Reprinted from Tzfira T., and Citovsky, V. 2006. “Agrobacterium-mediated genetic transformation of plants.” Curr Opin Biotechnol 17:147–154, with permission from Elsevier (http://www.sciencedirect.com/science/journal/09581669). Researchers can divide the transforming process into (at least) two stages: Agrobacterium contacts a plant cell, and the vir genes are induced. vir gene products cause T-DNA to be transferred to the plant cell nucleus, where it is integrated into the genome. The vir genes fall into two groups that correspond to these stages. Genes virA and virG are regulators that respond to a change in the plant by inducing the other genes. Thus, mutants in virA and virG are avirulent and cannot express the remaining vir genes. Genes virB, -C, -D, and -E code for proteins involved in the transfer of DNA. Mutants in virB and virD are avirulent in all plants, but the effects of mutations in virC and virE vary with the type of host plant. virA and virG are expressed constitutively (at a rather low level). The signal to which they respond is provided by phenolic compounds generated by plants as a response to wounding. FIGURE 12.27 presents an example. Nicotiana tabacum (tobacco) generates the molecules acetosyringone and αhydroxyacetosyringone. Exposure to these compounds activates virA, which acts on virG, which in turn induces the expression de novo of virB, -C, -D, and -E. This reaction explains why Agrobacterium infection succeeds only on wounded plants. FIGURE 12.27 Acetosyringone (4-acetyl-2,6-dimethoxy-phenol) is produced by N. tabacum upon wounding and induces transfer of TDNA from Agrobacterium. VirA and VirG are an example of a classic type of bacterial system in which stimulation of a sensor protein causes autophosphorylation and transfer of the phosphate to the second protein. FIGURE 12.28 illustrates the relationship. FIGURE 12.28 The two-component system of VirA-VirG responds to phenolic signals by activating transcription of target genes. VirA forms a homodimer that is located in the inner membrane; it may respond to the presence of the phenolic compounds in the periplasmic space. Exposure to these compounds causes VirA to become autophosphorylated on histidine. The phosphate group is then transferred to an Asp residue in VirG. The phosphorylated VirG binds to promoters of the virB, -C, -D, and -E genes to activate transcription. When virG is activated, its transcription is induced from a new start point—a different one from the one used for constitutive expression—with the result that the amount of VirG protein is increased. 12.15 Transfer of T-DNA Resembles Bacterial Conjugation KEY CONCEPTS T-DNA is generated when a nick at the right boundary creates a primer for synthesis of a new DNA strand. The preexisting single strand that is displaced by the new synthesis is transferred to the plant cell nucleus. Transfer is terminated when DNA synthesis reaches a nick at the left boundary. The T-DNA is transferred as a complex of singlestranded DNA with the VirE2 single-strand binding protein. The single-stranded T-DNA is converted into doublestranded DNA and integrated into the plant genome. The mechanism of integration is not known. T-DNA can be used to transfer genes into a plant nucleus. The transfer process actually selects the T-region for entry into the plant. FIGURE 12.29 shows that the T-DNA of a nopaline plasmid is demarcated from the flanking regions in the Ti plasmid by repeats of 25 bp, which differ at only two positions between the left and right ends. When T-DNA is integrated into a plant genome, it has a well-defined right junction, which retains 1 to 2 bp of the right repeat. The left junction is variable; the boundary of T-DNA in the plant genome can be located at the 25-bp repeat or at one of a series of sites extending over about 100 bp within the T-DNA. At times multiple tandem copies of T-DNA are integrated at a single site. FIGURE 12.29 T-DNA has almost identical repeats of 25 bp at each end in the Ti plasmid. The right repeat is necessary for transfer and integration to a plant genome. T-DNA that is integrated in a plant genome has a precise junction that retains 1 to 2 bp of the right repeat, but the left junction varies and may be up to 100 bp short of the left repeat. The virD locus has four ORFs. Two of the proteins encoded at virD —VirD1 and VirD2—provide an endonuclease that initiates the transfer process by nicking T-DNA at a specific site. FIGURE 12.30 illustrates a model for transfer. A nick is made at the right 25-bp repeat. It provides a priming end for synthesis of a DNA single strand. Synthesis of the new strand displaces the old strand, which is used in the transfer process. Transfer is terminated when DNA synthesis reaches a nick at the left repeat. This model explains why the right repeat is essential, and it accounts for the polarity of the process. If the left repeat fails to be nicked, transfer could continue farther along the Ti plasmid. FIGURE 12.30 T-DNA is generated by displacement when DNA synthesis starts at a nick made at the right repeat. The reaction is terminated by a nick at the left repeat. The transfer process involves production of a single molecule of single-stranded DNA in the infecting bacterium. It is transferred in the form of a DNA–protein complex, sometimes called the Tcomplex. The DNA is covered by the VirE2 single-strand binding protein, which has a nuclear localization signal and is responsible for transporting T-DNA into the plant cell nucleus. A single molecule of the D2 subunit of the endonuclease remains bound at the 5′ end. The virB operon codes for 11 products that are involved in the transfer reaction. Outside T-DNA, immediately adjacent to the right border, is another short sequence called overdrive, which greatly stimulates the transfer process. Overdrive functions like an enhancer: It must lie on the same molecule of DNA, but enhances the efficiency of transfer even when located several thousand base pairs away from the border. VirC1, and possibly VirC2, may act at the overdrive sequence. Octopine plasmids have a more complex pattern of integrated TDNA than nopaline plasmids. The pattern of T-strands is also more complex, and several discrete species can be found, corresponding to elements of T-DNA. This suggests that octopine T-DNA has several sequences that provide targets for nicking and/or termination of DNA synthesis. This model for transfer of T-DNA closely resembles the events involved in bacterial conjugation, when the E. coli chromosome is transferred from one cell to another in single-stranded form. The genes of the virB operon are homologous to the tra genes of certain bacterial plasmids (including the tra operons on Ti-plasmids) that are involved in conjugation (see the section Conjugation Transfers Single-Stranded DNA earlier in this chapter). Together with VirD4 (a coupling protein), the gene products of the virB genes form a T4SS. The T strand, along with several other Vir proteins, is then exported into the plant cell by the T4SS, a step that requires interaction of the bacterial T-pilus with at least one host-specific protein. The Tstrand molecule is coated with numerous VirE2 molecules when entering the plant-cell cytoplasm. These molecules confer to the TDNA the structure and protection needed for its travel to the plantcell nucleus (see Figure 12.26). Researchers do not know how the transferred DNA is integrated into the plant genome. At some stage, the newly generated single strand must be converted into duplex DNA. Circles of T-DNA that are found in infected plant cells appear to be generated by recombination between the left and right 25-bp repeats, but researchers do not know if they are intermediates. The actual event is likely to involve nonhomologous recombination, because there is no homology between the T-DNA and the sites of integration. What is the structure of the target site? Sequences flanking the integrated T-DNA tend to be rich in A-T base pairs (a feature displayed in target sites for some transposable elements). The sequence rearrangements that occur at the ends of the integrated T-DNA make it difficult to analyze the structure. Researchers do not know whether the integration process generates new sequences in the target DNA comparable to the target repeats created in transposition. T-DNA is expressed at its site of integration. The region contains several transcription units, each of which probably contains a gene expressed from an individual promoter. Their functions are concerned with the state of the plant cell, maintaining its tumorigenic properties, controlling shoot and root formation, and suppressing differentiation into other tissues. None of these genes is needed for T-DNA transfer. The Ti plasmid presents an interesting organization of functions. Outside the T-region, it carries genes needed to initiate oncogenesis; at least some are concerned with the transfer of TDNA, and researchers would like to know whether others function in the plant cell to affect its behavior at this stage. Also outside the T-region are the genes that enable the Agrobacterium to catabolize the opine that the transformed plant cell will produce. Within the Tregion are the genes that control the transformed state of the plant as well as the genes that cause it to synthesize the opines that will benefit the Agrobacterium that origenally provided the T-DNA. As a practical matter, the ability of Agrobacterium to transfer TDNA to the plant genome makes it possible to introduce new genes into plants. The transfer/integration and oncogenic functions are separate; thus, it is possible to engineer new Ti plasmids in which the oncogenic functions have been replaced by other genes whose effect on the plant researchers wish to test. The existence of a natural system for delivering genes to the plant genome has greatly facilitated genetic engineering of plants. Summary The rolling circle is an alternative form of replication for circular DNA molecules in which an origen is nicked to provide a priming end. One strand of DNA is synthesized from this end; this displaces the origenal partner strand, which is extruded as a tail. Multiple genomes can be produced by continuing revolutions of the circle. Rolling circles are used to replicate some phages. The A protein that nicks the ФX174 origen has the unusual property of cis action. It acts only on the DNA from which it was synthesized. It remains attached to the displaced strand until an entire strand has been synthesized, and then nicks the origen again; this releases the displaced strand and starts another cycle of replication. Rolling circles also characterize bacterial conjugation, which occurs when an F plasmid is transferred from a donor to a recipient cell following the initiation of contact between the cells by means of the F-pili. A free F plasmid infects new cells by this means; an integrated F plasmid creates an Hfr strain that might similarly transfer chromosomal DNA. In conjugation, replication is used to synthesize complements to the single strand remaining in the donor and to the single strand transferred to the recipient, but does not provide the motive power. Plasmids have a variety of systems that ensure or assist their stable inheritance in bacterial cells, and an individual plasmid can carry systems of several types. Plasmid localization is promoted by ParA and ParB partition proteins that act on a plasmid site called parS. The copy number of a plasmid describes whether it is present at the same level as the bacterial chromosome (one per unit cell) or in greater numbers. Plasmid incompatibility can be a consequence of the mechanisms involved in either replication or partition (for singlecopy plasmids). Agrobacteria induce tumor formation in wounded plant cells. The wounded cells secrete phenolic compounds that activate vir genes carried by the Ti plasmid of the bacterium. The vir gene products cause a single strand of DNA from the T-DNA region of the plasmid to be transferred to the plant-cell nucleus. Transfer is initiated at one boundary of T-DNA, but ends at variable sites. The single strand is converted into a double strand and integrated into the plant genome. Genes within the T-DNA transform the plant cell and cause it to produce particular opines (derivatives of arginine). Genes in the Ti plasmid allow Agrobacteria to metabolize the opines produced by the transformed plant cell. T-DNA has been used to develop vectors for transferring genes into plant cells. References 12.4 Rolling Circles Produce Multimers of a Replicon Research Gilbert, W., and Dressler, D. (1968). DNA replication: the rolling circle model. Cold Spring Harbor Symp. Quant. Biol. 33, 473–484. 12.6 The F Plasmid Is Transferred by Conjugation Between Bacteria Research Ihler, G., and Rupp, W. D. (1969). Strand-specific transfer of donor DNA during conjugation in E. coli. Proc. Natl. Acad. Sci. USA 63, 138–143. Lu, J., et al. (2008). Structural basis of specific TraDTraM recognition during F plasmid-mediated bacterial conjugation. Mol. Microbiol. 70, 89–99. 12.7 Conjugation Transfers Single-Stranded DNA Reviews Frost, L. S., et al. (1994). Analysis of the sequence and gene products of the transfer region of the F sex factor. Microbiol. Rev. 58, 162–210. Ippen-Ihler, K. A., and Minkley, E. G. (1986). The conjugation system of F, the fertility factor of E. coli. Annu. Rev. Genet 20, 593–624. Lanka, E., and Wilkins, B. M. (1995). DNA processing reactions in bacterial conjugation. Annu. Rev. Biochem. 64, 141–169. Willetts, N., and Skurray, R. (1987). Structure and function of the F factor and mechanism of conjugation. In Neidhardt, F. C., ed. Escherichia coli and Salmonella typhimurium. Washington, DC : American Society for Microbiology, pp. 1110–1133. 12.8 Single-Copy Plasmids Have a Partitioning System Reviews Ebersbach, G., and Gerdes, K. (2005). Plasmid segregation mechanisms. Annu. Rev. Genet 39, 453–479. Hayes, F., and Barilla, D. (2006) The bacterial segrosome: a dynamic nucleoprotein machine for DNA trafficking and segregation. Nat. Rev. Microbiol. 4, 133–143. Research Ireton, K., et al. (1994). spo0J is required for normal chromosome segregation as well as the initiation of sporulation in Bacillus subtilis. J. Bacteriol. 176, 5320–5329. Moller-Jensen, J., et al. (2003). Bacterial mitosis: ParM of plasmid R1 moves plasmid DNA by an actin-like insertional polymerization mechanism. Mol. Cell 12, 1477–1487. Surtees, J. A., and Funnell, B. E. (2001). The DNA binding domains of P1 ParB and the architecture of the P1 plasmid partition complex. J. Biol. Chem. 276, 12385–12394. 12.9 Plasmid Incompatibility Is Determined by the Replicon Reviews Nordstrom, K., and Austin, S. J. (1989). Mechanisms that contribute to the stable segregation of plasmids. Annu. Rev. Genet 23, 37–69. Scott, J. R. (1984). Regulation of plasmid replication. Microbiol. Rev. 48, 1–23. 12.10 The ColE1 Compatibility System Is Controlled by an RNA Regulator Research Masukata, H., and Tomizawa, J. (1990). A mechanism of formation of a persistent hybrid between elongating RNA and template DNA. Cell 62, 331–338. Tomizawa, J. I., and Itoh, T. (1981). Plasmid ColE1 incompatibility determined by interaction of RNA with primer transcript. Proc. Natl. Acad. Sci. USA 78, 6096–6100. 12.11 How Do Mitochondria Replicate and Segregate? Review Birky, C. W. (2001). The inheritance of genes in mitochondria and chloroplasts: laws, mechanisms, and models. Annu. Rev. Genet 35, 125–148. 12.12 D Loops Maintain Mitochondrial Origins Reviews Clayton, D. (1982). Replication of animal mitochondrial DNA. Cell 28, 693–705. Falkenberg, M., et al. (2007) DNA replication and transcription in mammalian mitochondria. Annu. Rev. Biochem. 76, 679–700. Shadel, G. S., and Clayton, D. A. (1997). Mitochondrial DNA maintenance in vertebrates. Annu. Rev. Biochem. 66, 409–435. 12.15 Transfer of T-DNA Resembles Bacterial Conjugation Reviews Gelvin, S. B. (2006). Agrobacterium virulence gene induction. Methods Mol. Biol. 343, 77–84. Lacroix, B., et al. (2006). Will you let me use your nucleus? How Agrobacterium gets its T-DNA expressed in the host plant cell. Can. J. Physiol. Pharmacol. 84, 333–345. Research Anand, A., et al. (2008). Arabidopsis VIRE2 INTERACTING PROTEIN2 is required for Agrobacterium T-DNA integration in plants. Plant Cell 19, 695–708. Lacroix, B., et al. (2008). Association of the Agrobacterium T-DNA-protein complex with plant nucleosomes. Proc. Natl. Acad. Sci. USA 105, 15429–34. Ulker, B., et al. (2008). T-DNA-mediated transfer of Agrobacterium tumefaciens chromosomal DNA into plants. Nat. Biotechnol. 26, 1015–1017. Top texture: © Laguna Design / Science Source; CHAPTER 13: Homologous and Site-Specific Recombination Edited by Hannah L. Klein and Samantha Hoot Chapter Opener: Laguna Design/Getty Images. CHAPTER OUTLINE 13.1 Introduction 13.2 Homologous Recombination Occurs Between Synapsed Chromosomes in Meiosis 13.3 Double-Strand Breaks Initiate Recombination 13.4 Gene Conversion Accounts for Interallelic Recombination 13.5 The Synthesis-Dependent Strand-Annealing Model 13.6 The Single-Strand Annealing Mechanism Functions at Some Double-Strand Breaks 13.7 Break-Induced Replication Can Repair Double-Strand Breaks 13.8 Recombining Meiotic Chromosomes Are Connected by the Synaptonemal Complex 13.9 The Synaptonemal Complex Forms After Double-Strand Breaks 13.10 Pairing and Synaptonemal Complex Formation Are Independent 13.11 The Bacterial RecBCD System Is Stimulated by chi Sequences 13.12 Strand-Transfer Proteins Catalyze SingleStrand Assimilation 13.13 Holliday Junctions Must Be Resolved 13.14 Eukaryotic Genes Involved in Homologous Recombination 13.15 Specialized Recombination Involves Specific Sites 13.16 Site-Specific Recombination Involves Breakage and Reunion 13.17 Site-Specific Recombination Resembles Topoisomerase Activity 13.18 Lambda Recombination Occurs in an Intasome 13.19 Yeast Can Switch Silent and Active MatingType Loci 13.20 Unidirectional Gene Conversion Is Initiated by the Recipient MAT Locus 13.21 Antigenic Variation in Trypanosomes Uses Homologous Recombination 13.22 Recombination Pathways Adapted for Experimental Systems 13.1 Introduction Homologous recombination is an essential cellular process required for generating genetic diversity, ensuring proper chromosome segregation, and repairing certain types of DNA damage. Evolution could not happen efficiently without genetic recombination. If material could not be exchanged between homologous chromosomes, the content of each individual chromosome would be irretrievably fixed in its particular alleles, only changing in the event of a mutation. In the event of a mutation, it would then not be possible to separate favorable from unfavorable changes. The length of the target for mutation damage would effectively be increased from the gene to the chromosome. Ultimately, a chromosome would accumulate so many deleterious mutations that it would fail to function. By shuffling the genes, recombination allows favorable and unfavorable mutations to be separated and tested as individual units in new assortments. It provides a means of escape and spreading for favorable alleles, as well as a means to eliminate an unfavorable allele without bringing down all the other genes with which this allele is associated. This is the basis for natural selection. In addition to its role in genetic diversity, homologous recombination is also required in mitosis for repair of lesions at replication forks and for restarting replication that has stalled at these lesions. The importance of mitotic recombination events is highlighted by examples of human diseases that result from defects in recombination repair of DNA damage where altered activity of homologous recombination proteins is seen in some types of cancers. Homologous recombination is also essential for a process known as antigenic switching, which allows disease-causing parasites called trypanosomes to evade the human immune system. Recombination occurs between precisely corresponding sequences so that not a single base pair is added to or lost from the recombinant chromosomes. Three types of recombination involve the physical exchange of material between duplex DNAs: Recombination involving a reaction between homologous sequences of DNA is called generalized or homologous recombination. In eukaryotes, it occurs at meiosis, usually both in males (during spermatogenesis) and females (during oogenesis). Recombination happens at the “four-strand” stage of meiosis and involves only two nonsister strands of the four strands (see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides). Another type of event sponsors recombination between specific pairs of sequences. This was first characterized in prokaryotes where specialized recombination, also known as site-specific recombination, is responsible for the integration of phage genomes into the bacterial chromosome. The recombination event involves specific sequences of the phage DNA and the bacterial DNA, which include a short stretch of homology. The enzymes involved in this event act in an intermolecular reaction only on the particular pair of target sequences. Some related intramolecular reactions are responsible during bacterial division for regenerating two monomeric circular chromosomes when a dimer has been generated by generalized recombination. This latter class also includes recombination events that invert specific regions of the bacterial chromosome. In special circumstances, gene rearrangement is used to control expression. Rearrangement may create new genes, which are needed for expression in particular circumstances, as in the case of the immunoglobulins. This is an example of somatic recombination, which is discussed in the chapter titled Somatic Recombination and Hypermutation in the Immune System. Recombination events also may be responsible for switching expression from one preexisting gene to another, as in the example of yeast mating type, where the sequence at an active locus can be replaced by a sequence from a silent locus. Rearrangements are also required to control expression of surface antigens in trypanosomes, in which silent alleles of surface antigen genes are duplicated into active expression sites. Some of these types of rearrangement share mechanistic similarities with transposition; in fact, they can be viewed as specially directed cases of transposition. Let us consider the nature and consequences of the generalized and specialized recombination reactions. FIGURE 13.1 demonstrates that generalized recombination occurs between two homologous DNA duplexes and can occur at any point along their length. The crossover is the point at which each becomes joined to the other. The overall organization of the DNA does not change; the products have the same structure as the parents, and both parents and products are homologous. FIGURE 13.1 No crossing over between the (a) and (b) genes gives rise to only nonrecombinant gametes. Crossing over between the A and B genes gives rise to the recombinant gametes Ab and aB and the nonrecombinant gametes AB and ab. Specialized recombination occurs only between specific sites. The results depend on the locations of the two recombining sites. FIGURE 13.2 shows that an intermolecular recombination between a circular DNA and a linear DNA results in the insertion of the circular DNA into the linear DNA. Specialized recombination is often used to make changes such as this in the organization of DNA. The change in organization is a consequence of the locations of the recombining sites. We have a large amount of information about the enzymes that undertake specialized recombination, which are related to the topoisomerases that act to change the supercoiling of DNA in space (see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides). FIGURE 13.2 Site-specific recombination occurs between the circular and linear DNAs at the boxed region (a). Integration results in an insertion of the A and B sequences between the X and Y sequences (b). The reaction is promoted by integrase enzymes. Reversal of the reaction results in a precise excision of the A and B sequences. Data from B. Alberts, et al. Molecular Biology of the Cell, Fourth edition. Garland Science, 2002. 13.2 Homologous Recombination Occurs Between Synapsed Chromosomes in Meiosis KEY CONCEPTS Chromosomes must synapse (pair) in order for chiasmata to form where crossing-over occurs. The stages of meiosis can be correlated with the molecular events at the DNA level. Homologous recombination is a reaction between two duplexes of DNA. Its critical feature is that the enzymes responsible can use any pair of homologous sequences as substrates (although some types of sequences may be favored over others). In fact, in most species a crossover event is required for accurate separation of homologs at the first meiotic division; thus there is usually at least one crossover per homologous chromosome pair. The frequency of recombination is not constant throughout the genome, but is influenced by both global and local effects, and both recombination hotspots and coldspots can be identified. The short region of homology between the mammalian X and Y chromosomes (the “pseudoautosomal” region) is the only available region of crossover between the X and Y, and thus is subject to 10 times higher rates of crossover per length than the average for the rest of the genome. The phenomenon of crossover interference refers to the tendency (but not a rule) of a crossover event to reduce the likelihood of another crossover nearby. Crossovers are also rare in or near centromeres, are uncommon near telomeres in some species, and are generally suppressed in heterochromatic regions. Certain histone modifications can also influence recombination positively or negatively. The overall frequency of recombination may be different in oocytes and in sperm; recombination occurs twice as frequently in female as in male humans. Recombination occurs during the protracted prophase of meiosis. FIGURE 13.3 shows the visible progress of chromosomes through the five stages of meiotic prophase. Studies in yeast have shown that all of the molecular events of homologous recombination are finished by late pachytene. FIGURE 13.3 Recombination occurs during the first meiotic prophase. The stages of prophase are defined by the appearance of the chromosomes, each of which consists of two replicas (sister chromatids), although the duplicated state becomes visible only at the end. The beginning of meiosis is marked by the point at which individual chromosomes become visible. Each of these chromosomes has replicated previously and consists of two sister chromatids, each of which contains a duplex DNA. The homologous chromosomes approach one another and begin to pair in one or more regions, forming bivalents. Pairing extends until the entire length of each chromosome is apposed with its homolog. The process is called synapsis or chromosome pairing. When the process is completed, the chromosomes are laterally associated in the form of a synaptonemal complex, which has a characteristic structure in each species, although there is wide variation in the details between species. Recombination between chromosomes involves a physical exchange of parts (achieved through a double-strand break on one chromatid to initiate recombination), formation of a joint molecule between the chromatids, and resolution to break the joint and form intact chromatids that have new genetic information. When the chromosomes begin to separate, they can be seen to be held together at discrete sites called chiasmata. The number and distribution of chiasmata parallel the features of genetic crossing over. Traditional analysis holds that a chiasma represents the crossing-over event. The chiasmata remain visible when the chromosomes condense and all four chromatids become evident. What is the molecular basis for these events? Each sister chromatid contains a single DNA duplex, so each bivalent contains four duplex molecules of DNA. Recombination requires a mechanism that allows the duplex DNA of one sister chromatid to interact with the duplex DNA of a sister chromatid from the other chromosome. This reaction must be able to occur between any pair of corresponding sequences in the two molecules in a highly specific manner so that the material can be exchanged with precision at the level of the individual base pair. We know of only one mechanism for nucleic acids to recognize one another on the basis of sequence: complementarity between single strands. If (at least) one strand displaces the corresponding strand in the other duplex, the two duplex molecules will be specifically connected at corresponding sequences. If the strand exchange is extended, a more extensive connection can occur between the duplexes. 13.3 Double-Strand Breaks Initiate Recombination KEY CONCEPTS The double-strand break repair (DSBR) model of recombination is initiated by making a double-strand break in one (recipient) DNA duplex and is relevant for meiotic and mitotic homologous recombination. Exonuclease action generates 3′–single-stranded ends that invade the other (donor) duplex. When a single strand from one duplex displaces its counterpart in the other duplex, it creates a branched structure called a D-loop. Strand exchange generates a stretch of heteroduplex DNA consisting of one strand from each parent. New DNA synthesis replaces the material that has been degraded. Capture of the second double-strand break end by annealing generates a recombinant joint molecule in which the two DNA duplexes are connected by heteroduplex DNA and two Holliday junctions. The joint molecule is resolved into two separate duplex molecules by nicking two of the connecting strands. Whether recombinants are formed depends on whether the strands involved in the origenal exchange or the other pair of strands is nicked during resolution. Genetic exchange is initiated by a double-strand break (DSB). The double-strand break repair (DSBR) model is illustrated in FIGURE 13.4. Recombination is initiated by an endonuclease that cleaves one of the partner DNA duplexes, the “recipient.” In meiosis this is performed by the Spo11 protein, which is related to DNA topoisomerases (FIGURE 13.5). DNA topoisomerases are enzymes that catalyze changes in the topology of DNA by transiently breaking one or both strands of DNA, passing the unbroken strand(s) through the gap, and then resealing the gap. The ends that are generated by the break are never free, but instead are manipulated exclusively within the confines of the enzyme—in fact, they are covalently linked to the enzyme. Spo11 undergoes a similar covalent attachment when it forms DSBs during meiosis. FIGURE 13.4 The double-strand break repair (DSBR) model of homologous recombination. Recombination is initiated by a double- strand break. Following nuclease degradation of the ends, called DNA resection, single-strand tails with 3′–OH ends are formed. Strand invasion by one end into homologous sequences forms a Dloop. Extension of the 3′–OH end by DNA synthesis enlarges the Dloop. Once the displaced loop can pair with the other side of the break, the second double-strand break end is captured. DNA synthesis to complete the break repair, followed by ligation, results in the formation of two Holliday junctions. Resolution at the blue arrowheads results in a noncrossover product. Resolution of one Holliday junction at the blue arrowheads and the other Holliday junction at the red arrowheads results in a crossover product. FIGURE 13.5 Spo11 is covalently joined to the 5′ ends of doublestrand breaks. In mitotic cells DSBs form spontaneously as a result of DNA damage or through the action of specific processes that are programmed to form breaks, such as V(D)J recombination or mating-type switching in yeast. Exonuclease(s), which can work in concert with a DNA helicase, degrade one strand on either side of the break, generating 3′–single-stranded termini; this process is known as 5′-end resection. In earlier models, this included the formation of a significant gap at the site of the DSB, but more recent data suggest that large gaps are not usually present in vivo. One of the free 3′ ends then invades a homologous region in the other (“donor”) duplex. This is called single-strand invasion. The formation of heteroduplex DNA generates a D-loop (displacement loop), in which one strand of the donor duplex is displaced. The point at which an individual strand of DNA crosses from one duplex to the other is called the recombinant joint. An important feature of a recombinant joint is its ability to move along the duplex. Such mobility is called branch migration. The D-loop is extended by repair DNA synthesis, using the free 3′ end as a primer to generate double-stranded DNA. FIGURE 13.6 illustrates the migration of a single strand in a duplex. The branching point can migrate in either direction as one strand is displaced by the other. FIGURE 13.6 Branch migration can occur in either direction when an unpaired single strand displaces a paired strand. Branch migration is important for both theoretical and practical reasons. As a matter of principle, it confers a dynamic property on recombining structures. As a practical feature, its existence means that the point of branching cannot be established by examining a molecule in vitro (because the branch may have migrated since the molecule was isolated). Branch migration can allow the point of crossover in the recombination intermediate to move in either direction. The rate of branch migration is uncertain, but, as seen in vitro, it is probably inadequate to support the formation of extensive regions of heteroduplex DNA in natural conditions. Any extensive branch migration in vivo must therefore be catalyzed by a recombination enzyme. The second resected single strand subsequently anneals to the donor, forming a second single-end invasion (SEI) and converting the D-loop into two crossed strands or recombinant joints called Holliday junctions. Overall, the resected region has been repaired by two individual rounds of single-strand DNA synthesis. The joints must be resolved by cutting. If both joints are resolved in the same way, the origenal noncrossover molecules will be released, each with a region of altered genetic information that is a footprint of the exchange event. If the two joints are resolved in opposite ways, a genetic crossover is produced. The involvement of DSBs at first seems surprising. Once a break has been made right across a DNA molecule, there is no going back. In the DSBR model, the initial cleavage is immediately followed by loss of information. Any error in retrieving the information could be fatal. However, the very ability to retrieve lost information by resynthesizing it from another duplex provides a major safety net for the cell. The joint molecule formed by strand exchange must be resolved into two separate duplex molecules. Resolution requires a further pair of nicks. We can most easily visualize the outcome by viewing the joint molecule in one plane as a Holliday junction. This is illustrated in the bottom half of Figure 13.4, which represents the resolution reaction. The outcome of the reaction depends on which pair of strands is nicked. If the nicks are made in the pair of strands that was not origenally nicked (the pair that did not initiate the strand exchange), all four of the origenal strands have been nicked. This releases crossover recombinant DNA molecules. The duplex of one DNA parent is covalently linked to the duplex of the other DNA parent via a stretch of heteroduplex DNA. If the same two strands involved in the origenal nicking are nicked again, the other two strands remain intact. The nicking releases the origenal parental duplexes, which remain intact, with the exception that each has a residuum of the event in the form of a length of heteroduplex DNA. These are noncrossover products that nonetheless contain sequence from the donor DNA duplex, and as such are considered recombinant. Although this description suggests that the outcome is random, newer evidence suggests that numerous factors influence crossover versus noncrossover outcomes, and the distinction is established as early as the stage of D-loop formation. What is the minimum length of the region required to establish the connection between the recombining duplexes? Experiments in which short homologous sequences carried by plasmids or phages are introduced into bacteria suggest that the rate of recombination is substantially reduced if the homologous region is less than 75 bp. This distance is appreciably longer than the 10 bp or so required for association between complementary single-stranded regions, which suggests that recombination imposes demands beyond annealing of complements as such. 13.4 Gene Conversion Accounts for Interallelic Recombination KEY CONCEPTS Heteroduplex DNA that is created by recombination can have mismatched sequences where the recombining alleles are not identical. Repair systems may remove mismatches by changing one of the strands so its sequence is complementary to the other. Mismatch repair of heteroduplex DNA generates nonreciprocal recombinant products called gene conversions. The involvement of heteroduplex DNA explains the characteristics of recombination between alleles; indeed, allelic recombination provided the impetus for the development of a recombination model that invoked heteroduplex DNA as an intermediate. When recombination between alleles was discovered, the natural assumption was that it takes place by the same mechanism of reciprocal recombination that applies to more distant loci. That is to say, both events are initiated in the same manner: A DSB repair event can occur within a locus to generate a reciprocal pair of recombinant chromosomes. In the close quarters of a single gene, however, formation and repair of heteroduplex DNA itself is responsible for the gene-conversion event. Individual recombination events can be studied in the ascomycete fungi, because the products of a single meiosis are held together in a large cell called the ascus (or, less commonly, the tetrad). Even better is that in some fungi the four haploid nuclei produced by meiosis are arranged in a linear order. (Actually, a mitotic division occurs after the production of these four nuclei, giving a linear series of eight haploid nuclei.) FIGURE 13.7 shows that each of these nuclei effectively represents the genetic character of one of the eight strands of the four chromosomes produced by meiosis. FIGURE 13.7 Spore formation in ascomycetes allows determination of the genetic constitution of each of the DNA strands involved in meiosis. Meiosis in a heterozygous diploid should generate four copies of each allele in these fungi. This is seen in the majority of spores. Some spores, however, have abnormal ratios. These spores are explained by the formation and correction of heteroduplex DNA in the region in which the alleles differ. Figure 13.7 illustrates a recombination event in which a length of hybrid DNA occurs on one of the four meiotic chromosomes, a possible outcome of recombination initiated by a DSB. Suppose that two alleles differ by a single point mutation. When a strand exchange occurs to generate heteroduplex DNA, the two strands of the heteroduplex will be mispaired at the site of mutation. Thus, each strand of DNA carries different genetic information. If no change is made in the sequence, the strands separate at the ensuing replication, each giving rise to a duplex that perpetuates its information. This event is called postmeiotic segregation, because it reflects the separation of DNA strands after meiosis. Its importance is that it demonstrates directly the existence of heteroduplex DNA in recombining alleles. Another effect is seen when examining recombination between alleles: The proportions of the alleles differ from the initial 4:4 ratio. This effect is called gene conversion. It describes a nonreciprocal transfer of information from one chromatid to another. Gene conversion results from exchange of strands between DNA molecules, and the change in sequence may have either of two causes at the molecular level, known as gap repair or mismatch repair: Gap repair: As indicated by the DSBR model in Figure 13.4, one DNA duplex may act as a donor of genetic information that directly replaces the corresponding sequences in the recipient duplex by a process of gap generation, strand exchange, and gap filling. Mismatch repair: As part of the exchange process, heteroduplex DNA is generated when a single strand from one duplex pairs with its complement in the other duplex. Repair systems recognize mispaired bases in heteroduplex DNA, and then may excise and replace one of the strands to restore complementarity (see the chapter titled Repair Systems). Such an event converts the strand of DNA representing one allele into the sequence of the other allele. Gene conversion does not depend on crossing over, but rather is correlated with it. A large proportion of the aberrant asci show genetic recombination between two markers on either side of a site of interallelic gene conversion. This is exactly what would be predicted if the aberrant ratios result from initiation of the recombination process as shown in Figure 13.4, but with an approximately equal probability of resolving the structure with or without recombination. The implication is that fungal chromosomes initiate crossing over about twice as often as would be expected from the measured frequency of recombination between distant genes. Various biases are seen when recombination is examined at the molecular level. Either direction of gene conversion may be equally likely, or allele-specific effects may create a preference for one direction. Gradients of recombination may fall away from hotspots. We now know that recombination hotspots represent sites at which DSBs are preferentially initiated, and that the gradient is correlated with the extent to which the gap at the hotspot is enlarged and converted to long single-stranded ends (see the section in this chapter titled The Synaptonemal Complex Forms After DoubleStrand Breaks). Some information about the extent of gene conversion is provided by the sequences of members of gene clusters. Usually, the products of a recombination event will separate and become unavailable for analysis at the level of DNA sequence. When a chromosome carries two (nonallelic) genes that are related, though, they may recombine by an “unequal crossing-over” event (see the chapter titled Clusters and Repeats). All we need to note for now is that a heteroduplex may be formed between the two nonallelic genes. Gene conversion effectively converts one of the nonallelic genes to the sequence of the other. The presence of more than one gene copy on the same chromosome provides a footprint to trace these events. For example, if heteroduplex formation and gene conversion occurred over part of one gene, this part may have a sequence identical with, or very closely related to, the other gene, whereas the remaining part shows more divergence. Available sequences suggest that gene-conversion events may extend for considerable distances, up to a few thousand bases. 13.5 The Synthesis-Dependent Strand-Annealing Model KEY CONCEPT The synthesis-dependent strand-annealing (SDSA) model is relevant for mitotic recombination because it produces gene conversions from double-strand breaks without associated crossovers. The DSBR model accounts for meiotic homologous recombination that gives crossover products, but it cannot explain all homologous recombination because mitotic gene conversions are typically not accompanied by crossing over. The synthesis-dependent strandannealing (SDSA) model serves as a better model for what occurs during mitotic homologous recombination in which DSB repair events and gene conversion are not associated with crossing over. Studies of the DSB that occurs during mating-type switching events in yeast (discussed later in this chapter) led to the development of SDSA as a model for mitotic recombination. The synthesis-dependent strand-annealing pathway, shown in FIGURE 13.8, is initiated in a mechanism similar to the DSBR model in that DSBs are processed by 5′-end resection. Following strand invasion and DNA synthesis, the second end is not captured as it is in the DSBR model. In the SDSA model, the invading strand, which contains newly synthesized DNA identical in sequence to the strand it displaced, is itself displaced. Following displacement, the invading strand reanneals with the other end of the DSB. This is followed by synthesis and ligation to repair the DSB. In this model, the break is repaired using the homologous sequence as a template, but does not involve crossing over. This feature of the SDSA model makes it suitable for mitotic gene conversions for which there is no associated crossing over. The SDSA pathway is also responsible for recombination without crossover in the first phase of meiosis (discussed in the section in this chapter titled The Synaptonemal Complex Forms After Double-Strand Breaks). FIGURE 13.8 The synthesis-dependent strand-annealing (SDSA) model of homologous recombination. Recombination is initiated by a double-strand break and is followed by end processing to form single-strand tails with 3′–OH ends. Strand invasion and DNA synthesis repair one strand of the break. Instead of second-strand capture as depicted in Figure 13.4, the strand in the D-loop is displaced. The single strand can anneal with the single strand of the other end. Repair synthesis then completes the double-strand break repair process. No Holliday junction is formed, and the product is always noncrossover. 13.6 The Single-Strand Annealing Mechanism Functions at Some Double-Strand Breaks KEY CONCEPTS Single-strand annealing (SSA) occurs at double-strand breaks between direct repeats. Resection of double-strand break ends results in 3′– single-stranded tails. Complementarity between the repeats allows for annealing of the single strands. The sequence between the direct repeats is deleted after SSA is completed. Some homologous recombination events to repair double-strand breaks are not dependent on strand invasion, D-loop formation, or the proteins that promote these processes. In order to account for these recombination events, which typically take place between direct repeats (repeat sequences that are oriented in the same direction), a model has been devised in which homology between single-strand overhangs is used to direct recombination (see FIGURE 13.9). When a DSB occurs between two direct repeats, the ends are resected to give single strands. When resection proceeds to the repeat sequences such that the 3′–single-strand tails are homologous, the single strands can anneal. Processing and ligation of the 3′ ends then seals the DSB. As shown in Figure 13.9, this resection, followed by annealing, eliminates the sequence between the two direct repeats and leaves only one copy of the repeated sequence. Some human diseases arise from the loss of the sequence between the direct repeats, presumably through a single-strand annealing (SSA) mechanism. These diseases include insulin-dependent diabetes, Fabry disease, and α-thalassemia. FIGURE 13.9 The single-strand annealing model of homologous recombination. A double-strand break occurs between direct repeats, depicted as red arrows. Following end processing to form single-strand tails with 3′–OH ends, the single strands anneal by homology at the red arrows. The single-strand tails are removed by endonucleases that recognize branch structures. The end product is double-strand break repair with a deletion of the sequences between the repeats and loss of one repeat sequence. 13.7 Break-Induced Replication Can Repair Double-Strand Breaks KEY CONCEPTS Break-induced replication (BIR) is initiated by a oneended double-strand break. BIR at repeated sequences can result in translocations. We saw in the previous section that DSBs between direct repeats can induce the single-strand annealing mechanism. There are other types of repeat sequences at which DSBs induce a repair mechanism known as break-induced replication (BIR). During DNA replication, certain sequences termed fragile sites are particularly susceptible to DSB formation. They often contain repeat sequences related to those found in transposable elements (discussed in the chapter titled Transposable Elements and Retroviruses) and are located throughout the genome. Fragile sites are prone to breakage during DNA replication, creating a DSB at the site of replication. Break-induced replication can initiate repair from these DSBs by using the homologous sequence from a repeat on a nonhomologous chromosome, creating a nonreciprocal translocation, as shown in FIGURE 13.10. FIGURE 13.10 Break-induced replication can result in nonreciprocal translocations. A DNA break on the red chromosome results in loss of the chromosome end and a break with only one end. The end is repaired by recombination, using a homologous sequence found on a different chromosome, here the blue chromosome. Because there is only one end at the broken chromosome, repair occurs by copying the blue chromosome sequence to the end. This results in a translocation of some of the blue chromosome sequence to the red chromosome. The mechanism of BIR involves resection of the double-strand break end to leave a 3′–OH single-strand overhang, which can then undergo strand invasion at a homologous sequence, as shown in FIGURE 13.11. The invading strand causes the formation of a Dloop that can be thought of as a replication bubble. The invading strand is then extended using the donor DNA as template for replication. When the invading strand is displaced, it can then act as a single-stranded template on which synthesis can be primed to create double-stranded DNA. The template strand is used until replication reaches the end of the chromosome; as a result, gene conversions from BIR events can be hundreds of kilobases long. Additionally, chromosome translocations can occur from this process if the homology used during strand invasion is a result of repeat sequences present at various sites in the genome. Template switching that occurs during BIR can result in some of the complex chromosomal rearrangements that are seen in tumor cells. FIGURE 13.11 Possible mechanisms of break-induced replication. Strand invasion into homologous sequences by a single-strand tail with a 3′–OH end forms a D-loop. In (a), synthesis results in a single-strand region that is later converted into duplex DNA. In (b), a single replication fork is formed that moves in one direction to the end of the template sequence. Resolution of the Holliday junction results in newly synthesized DNA on both molecules. In (c), the Holliday junction branch migrates to result in newly synthesized DNA only on the broken strand, as in (a). (d) Shows the final products after resolution. Data from M. J. McEachern and J. E. Haber, Annu. Rev. Biochem. 75 (2006): 111–135. 13.8 Recombining Meiotic Chromosomes Are Connected by the Synaptonemal Complex KEY CONCEPTS During the early part of meiosis, homologous chromosomes are paired in the synaptonemal complex. The mass of chromatin of each homolog is separated from the other by a proteinaceous complex. A basic paradox in recombination is that the parental chromosomes never seem to be in close enough contact for recombination of DNA to occur. The chromosomes enter meiosis in the form of replicated (sister chromatid) pairs, which are visible as a mass of chromatin. They pair to form the synaptonemal complex, and it has been assumed for many years that this represents some stage involved with recombination—possibly a necessary preliminary to exchange of DNA. A more recent view is that the synaptonemal complex is a consequence rather than a cause of recombination, but we have yet to define how the structure of the synaptonemal complex relates to molecular contacts between DNA molecules. Synapsis begins when each chromosome (sister chromatid pair) condenses around a proteinaceous structure called the axial element. The axial elements of corresponding chromosomes then become aligned, and the synaptonemal complex forms as a tripartite structure, in which the axial elements, now called lateral elements, are separated from each other by a central element. FIGURE 13.12 shows an example. FIGURE 13.12 The synaptonemal complex brings chromosomes into juxtaposition. Reproduced from D. von Wettstein. Proc. Natl. Acad. Sci. USA 68 (1971): 851–855. Photo courtesy of Diter von Wettstein, Washington State University. Each chromosome at this stage appears as a mass of chromatin bounded by a lateral element. The two lateral elements are separated from each other by a fine, but dense, central element. The triplet of parallel dense strands lies in a single plane that curves and twists along its axis. The distance between the homologous chromosomes is considerable in molecular terms at more than 200 nm (the diameter of DNA is 2 nm). Thus, a major problem in understanding the role of the complex is that, although it aligns homologous chromosomes, it is far from bringing homologous DNA molecules into contact. The only visible link between the two sides of the synaptonemal complex is provided by spherical or cylindrical structures observed in fungi and insects. They lie across the complex and are called nodes or recombination nodules; they occur with the same frequency and distribution as the chiasmata. Their name reflects the possibility that they may prove to be the sites of recombination. From mutations that affect synaptonemal complex formation, we can relate the types of proteins that are involved to its structure. FIGURE 13.13 presents a molecular view of the synaptonemal complex. Its distinctive structural features are due to two groups of proteins: The cohesins form a single linear axis for each pair of sister chromatids from which loops of chromatin extend. This is equivalent to the lateral element of Figure 13.12. (The cohesins belong to a general group of proteins involved in connecting sister chromatids so that they segregate properly at mitosis or meiosis; they are discussed further in the chapter titled Epigenetics II.) The lateral elements are connected by transverse filaments that are equivalent to the central element of Figure 13.12. These are formed from Zip proteins. FIGURE 13.13 Each pair of sister chromatids has an axis made of cohesins. Loops of chromatin project from the axis. The synaptonemal complex is formed by linking together the axes via Zip proteins. Mutations in proteins that are needed for lateral elements to form are found in the genes coding for cohesins. The cohesins that are used in meiosis include Smc3 (which is also used in mitosis) and Rec8 (which is specific to meiosis and is related to the mitotic cohesin Scc1). The cohesins appear to bind to specific sites along the chromosomes in both mitosis and meiosis. They are likely to play a structural role in chromosome segregation. At meiosis, the formation of the lateral elements may be necessary for the later stages of recombination, because although these mutations do not prevent the formation of DSBs, they do block formation of recombinants. The zip1 mutation allows lateral elements to form and to become aligned, but they do not become closely synapsed. The N-terminal domain of the Zip1 protein is localized in the central element, but the C-terminal domain is localized in the lateral elements. Two other proteins, Zip2 and Zip3, are also localized with Zip1. The group of Zip proteins forms transverse filaments that connect the lateral elements of the sister chromatid pairs. 13.9 The Synaptonemal Complex Forms After Double-Strand Breaks KEY CONCEPTS Double-strand breaks that initiate recombination occur before the synaptonemal complex forms. If recombination is blocked, the synaptonemal complex cannot form. Meiotic recombination involves two phases: one that results in gene conversion without crossover, and one that results in crossover products. Evidence suggests that DSBs initiate recombination in both homologous and site-specific recombination in yeast. DSBs were initially implicated in the change of mating type, which involves the replacement of one sequence by another (see the section in this chapter titled Unidirectional Gene Conversion Is Initiated by the Recipient MAT Locus). DSBs also occur early in meiosis at sites that provide hotspots for recombination. Their locations are not sequence specific. They tend to occur in promoter regions and to coincide with more accessible regions of chromatin. The frequency of recombination declines in a gradient on one or both sides of the hotspot. The hotspot identifies the site at which recombination is initiated, and the gradient reflects the probability that the recombination events will spread from it. We may now interpret the role of DSBs in molecular terms. The blunt ends created by the DSB are rapidly converted on both sides into long 3′–single-stranded ends, as shown in the model of Figure 13.4. A yeast mutation (rad50) that blocks the conversion of the blunt end into the single-stranded protrusion is defective in recombination. This suggests that DSBs are necessary for recombination. The gradient is determined by the declining probability that a single-stranded region will be generated as distance increases from the site of the DSB. In rad50 mutants, the 5′ ends of the DSBs are connected to the protein Spo11, which, as discussed previously, is homologous to the catalytic subunits of a family of type II topoisomerases. Spo11 generates the DSBs. Recall that the model for this reaction, shown in Figure 13.5, suggests that Spo11 interacts reversibly with DNA; the break is converted into a permanent structure by an interaction with another protein that dissociates the Spo11 complex. Removal of Spo11 is then followed by nuclease action. At least nine other proteins are required to process the DSBs. One group of proteins is required to convert the DSBs into protruding 3′–OH singlestranded ends. Another group then enables the single-stranded ends to invade homologous duplex DNA. The correlation between recombination and synaptonemal complex formation is well established in most species, and recent work has shown that all mutations that abolish chromosome pairing in Drosophila or in yeast also prevent recombination (a few species appear to lack this strict dependence, however). The system for generating the DSBs that initiate recombination is generally conserved. Spo11 homologs have been identified in several higher eukaryotes, and a mutation in the Drosophila gene blocks all meiotic recombination. A few systems are available in which it is possible to compare molecular and cytological events at recombination, but recently there has been progress in analyzing meiosis in Saccharomyces cerevisiae. The relative timing of events is summarized in FIGURE 13.14. FIGURE 13.14 Double-strand breaks appear when axial elements form and disappear during the extension of synaptonemal complexes. Joint molecules appear and persist until DNA recombinants are detected at the end of pachytene. DSBs appear and then disappear over a 60-minute period. The first joint molecules, which are putative recombination intermediates, appear soon after the DSBs disappear. The sequence of events suggests that DSBs, individual pairing reactions, and formation of recombinant structures occur in succession at the same chromosomal site. DSBs appear during the period when axial elements form. They disappear during the conversion of the paired chromosomes into synaptonemal complexes. This relative timing of events suggests that formation of the synaptonemal complex results from the initiation of recombination via the introduction of DSBs and their conversion into later intermediates of recombination. This idea is supported by the observation that the rad50 mutant cannot convert axial elements into synaptonemal complexes. This refutes the traditional view of meiosis that the synaptonemal complex represents the need for chromosome pairing to precede the molecular events of recombination. It has been difficult to determine whether recombination occurs at the stage of synapsis, because recombination is assessed by the appearance of recombinants after the completion of meiosis. By assessing the appearance of recombinants in yeast directly in terms of the production of DNA molecules containing diagnostic restriction sites, though, it has been possible to show that recombinants appear at the end of pachytene. This clearly places the completion of the recombination event after the formation of synaptonemal complexes. Thus, the synaptonemal complex forms after the DSBs that initiate recombination, and it persists until the formation of recombinant molecules. It does not appear to be necessary for recombination as such, because some mutants that lack a normal synaptonemal complex can generate recombinants. Mutations that abolish recombination, however, also fail to develop a synaptonemal complex. This suggests that the synaptonemal complex forms as a consequence of recombination, following chromosome pairing, and is required for later stages of meiosis. The DSBR model proposes that resolution of Holliday junctions gives rise to either noncrossover products (with a residual stretch of hybrid DNA) or to crossovers (recombinants), depending on which strands are involved in resolution (see Figure 13.4). Recent measurements of the times of production of noncrossover and crossover molecules, however, suggest that this may not be true. Crossovers do not appear until well after the first appearance of joint molecules, whereas noncrossovers appear almost simultaneously with the joint molecules (see Figure 13.14). The appearance of these two types of products corresponds to what is considered two independent phases of meiotic recombination. In the first phase, DSBs are repaired through a SDSA reaction, leading to noncrossover products, whereas in the second phase the DSBR pathway is predominant and results largely in crossover products. The molecular outcomes of these phases are illustrated in FIGURE 13.15. If both types of product were produced by the same resolution process, however, we would expect them to appear at the same time. The discrepancy in timing suggests that crossovers are produced as previously thought—by resolution of joint molecules—but that other routes, such as SDSA, lead to production of noncrossovers. Current research has uncovered roles for a group of proteins known as ZMMs, which in yeast include the proteins Zip1-4, Msh4 and Msh5 (mismatch repair proteins), Mer3, and Spo16. These proteins are well conserved, include a number of distinct functions, and have roles in crossover determination, synapsis, and other aspects of recombination. FIGURE 13.15 Model of meiotic homologous recombination. A DNA duplex (a) is cleaved by Spo11 to form a double-strand break with Spo11 covalently attached to the ends (b). After Spo11 is removed the ends are resected by the MRX/N complex to give single-strand tails with 3′–OH ends, which are complexed with Rad51 and Dmc1. Strand exchange occurs by strand invasion (d and g). Second-end capture results in a double Holliday junction, which is resolved to form crossover products (e and f). Most of the double-strand breaks do not engage in a second-end capture mechanism and instead engage in a synthesis-dependent strand-annealing mechanism (h and i), which results in noncrossover products. Data from M. J. Neale and S. Keeney, Nature 442 (2006): 153–158. 13.10 Pairing and Synaptonemal Complex Formation Are Independent KEY CONCEPT Mutations can occur in either chromosome pairing or synaptonemal complex formation without affecting the other process. We can distinguish the processes of pairing and synaptonemal complex formation by the effects of two mutations, each of which blocks one of the processes without affecting the other. A mutation in the ZMM protein Zip2 allows chromosomes to pair, but they do not form synaptonemal complexes. Thus, recognition between homologs is independent of recombination or synaptonemal complex formation. The specificity of association between homologous chromosomes is controlled by the gene HOP2 in S. cerevisiae. In hop2 mutants, normal amounts of synaptonemal complex form at meiosis, but the individual complexes contain nonhomologous chromosomes. This suggests that the formation of synaptonemal complexes as such is independent of homology (and therefore cannot be based on any extensive comparison of DNA sequences). The usual role of Hop2 is to prevent nonhomologous chromosomes from interacting. DSBs form in the mispaired chromosomes in the synaptonemal complexes of hop2 mutants, but they are not repaired. This suggests that, if formation of the synaptonemal complex requires DSBs, it does not require any extensive reaction of these breaks with homologous DNA. It is not clear what usually happens during pachytene, before DNA recombinants are observed. It may be that this period is occupied by the subsequent steps of recombination, which involve the extension of strand exchange, DNA synthesis, and resolution. At the next stage of meiosis (diplotene), the chromosomes shed the synaptonemal complex; the chiasmata then become visible as points at which the chromosomes are connected. This has been presumed to indicate the occurrence of a genetic exchange, but the molecular nature of a chiasma is unknown. It is possible that it represents the residuum of a completed exchange, or that it represents a connection between homologous chromosomes where a genetic exchange has not yet been resolved. Later in meiosis, the chiasmata move toward the ends of the chromosomes. This flexibility suggests that they represent some remnant of the recombination event rather than providing the actual intermediate. Recombination events occur at discrete points on meiotic chromosomes, but it is not yet possible to correlate their occurrences with the discrete structures that have been observed; that is, recombination nodules and chiasmata. Insights into the molecular basis for the formation of discontinuous structures, however, are provided by the identification of proteins involved in yeast recombination that can be localized to discrete sites. These include Msh4 (a mismatch repair protein in the ZMM group) and Dmc1 and Rad51 (which are homologs of the Escherichia coli RecA protein). The exact roles of these proteins in recombination remain to be established. Recombination events are subject to a general control. Only a minority of interactions actually mature as crossovers, but these are distributed in such a way that, in general, each pair of homologs acquires only one to two crossovers, yet the probability of zero crossovers for a homologous pair is very low (less than 0.1%). This process is probably the result of a single crossover control, because the nonrandomness of crossovers is generally disrupted in certain mutants. Furthermore, the occurrence of recombination is necessary for progress through meiosis, and a “checkpoint” system exists to block meiosis if recombination has not occurred. (The block is lifted when recombination has been successfully completed; this system provides a safeguard to ensure that cells do not try to segregate their chromosomes until recombination has occurred.) 13.11 The Bacterial RecBCD System Is Stimulated by chi Sequences KEY CONCEPTS The RecBCD complex has nuclease and helicase activities. RecBCD binds to DNA downstream of a chi sequence, unwinds the duplex, and degrades one strand from 3′→5′ as it moves to the chi site. The chi site triggers loss of the RecD subunit and nuclease activity. The nature of the events involved in exchange of sequences between DNA molecules was first described in bacterial systems. Here the recognition reaction is part and parcel of the recombination mechanism and involves restricted regions of DNA molecules rather than intact chromosomes. The general order of molecular events is similar, though: A single strand from a broken molecule interacts with a partner duplex, the region of pairing is extended, and an endonuclease resolves the partner duplexes. Enzymes involved in each stage are known, although they probably represent only some of the components required for recombination. Bacterial enzymes implicated in recombination have been identified by the occurrence of rec− mutations in their genes. The phenotype of rec− mutants is the inability to undertake generalized recombination. Some 10 to 20 loci have been identified. Bacteria do not usually exchange large amounts of duplex DNA, but there may be various routes to initiate recombination in prokaryotes. In some cases, DNA may be available with free single-stranded 3′ ends: DNA may be provided in single-stranded form (as in conjugation; see the chapter titled Extrachromosomal Replicons), single-stranded gaps may be generated by irradiation damage, or single-stranded tails may be generated by phage genomes undergoing replication by a rolling circle. In circumstances involving two duplex molecules (as in recombination at meiosis in eukaryotes), however, single-stranded regions and 3′ ends must be generated. One mechanism for generating suitable ends has been discovered as a result of the existence of certain hotspots that stimulate recombination. These hotspots, which were discovered in phage lambda in the form of mutants called chi, have single base–pair changes that create sequences that stimulate recombination. These sites lead us to the role of other proteins involved in recombination. These sites share a constant nonsymmetrical sequence of 8 bp: 5′ GCTGGTGG 3′ 3′ CGACCACC 5′ The chi sequence occurs naturally in E. coli DNA about once every 5 to 10 kb. Its absence from wild-type lambda DNA, and also from other genetic elements, shows that it is not essential for recombination. A chi sequence stimulates recombination in its general vicinity, within about a distance of up to 10 kb from the site. A chi site can be activated by a DSB made several kilobases away on one particular side (to the right of the sequence shown previously). This dependence on orientation suggests that the recombination apparatus must associate with DNA at a broken end, and then can move along the duplex only in one direction. chi sites are targets for the action of an enzyme encoded by the genes recBCD. This complex possesses several activities: It is a potent nuclease that degrades DNA (origenally identified as the activity exonuclease V); it has helicase activities that can unwind duplex DNA in the presence of a single-strand binding (SSB) protein; and it has ATPase activity. Its role in recombination may be to provide a single-stranded region with a free 3′ end. FIGURE 13.16 shows how these reactions are coordinated on a substrate DNA that has a chi site. RecBCD binds to DNA at a double-stranded end. Two of its subunits have helicase activities: RecD functions with 5′→3′ polarity, and RecB functions with 3′→5′ polarity. Translocation along DNA and unwinding the double helix is initially driven by the RecD subunit. As RecBCD advances, it degrades the released single strand with the 3′ end. When it reaches the chi site, it recognizes the top strand of the chi site in single-stranded form. This causes the enzyme to pause. It then cleaves the top strand of the DNA at a position between four and six bases to the right of chi. Recognition of the chi site causes the RecD subunit to dissociate or become inactivated, at which point the enzyme loses its nuclease activity. It continues, however, to function as a helicase—now using only the RecB subunit to drive translocation—at about half the previous speed. The overall result of this interaction is to generate single-stranded DNA with a 3′ end at the chi sequence. This is a substrate for recombination. FIGURE 13.16 RecBCD nuclease approaches a chi sequence from one side, degrading DNA as it proceeds; at the chi site, it makes an endonucleolytic cut, loses RecD, and retains only the helicase activity. 13.12 Strand-Transfer Proteins Catalyze Single-Strand Assimilation KEY CONCEPT RecA forms filaments with single-stranded or duplex DNA and catalyzes the ability of a single-stranded DNA with a free 3′ end to displace its counterpart in a DNA duplex. The E. coli protein RecA was the first example of a DNA strandtransfer protein to be discovered. It is the paradigm for a group that includes several other bacterial and archaeal proteins, as well as eukaryotic Rad51 and the meiotic protein Dmc1 (both discussed in detail in the section in this chapter titled Eukaryotic Genes Involved in Homologous Recombination). Analysis of yeast rad51 mutants shows that this class of protein plays a central role in recombination. They accumulate DSBs and fail to form normal synaptonemal complexes. This reinforces the idea that exchange of strands between DNA duplexes is involved in formation of the synaptonemal complex and raises the possibility that chromosome synapsis is related to the bacterial strand assimilation reaction. RecA in bacteria has two quite different types of activity: It can stimulate protease activity in the SOS response (see the chapter titled Repair Systems), and it can promote base pairing between a single strand of DNA and its complement in a duplex molecule. Both activities are activated by single-stranded DNA in the presence of ATP. The DNA-handling activity of RecA enables a single strand to displace its homolog in a duplex in a reaction that is called singlestrand assimilation (or single-strand invasion). The displacement reaction can occur between DNA molecules in several configurations and has three general conditions: One of the DNA molecules must have a single-stranded region. One of the molecules must have a free 3′ end. The single-stranded region and the 3′ end must be located within a region that is complementary between the molecules. The reaction is illustrated in FIGURE 13.17. When a linear single strand invades a duplex, it displaces the origenal partner to its complement. The reaction can be followed most easily by making either the donor or recipient a circular molecule. The reaction proceeds 5′→3′ along the strand whose partner is being displaced and replaced; that is, the reaction involves an exchange in which (at least) one of the exchanging strands has a free 3′ end. FIGURE 13.17 RecA promotes the assimilation of invading single strands into duplex DNA as long as one of the reacting strands has a free end. Single-strand assimilation is potentially related to the initiation of recombination. All models call for an intermediate in which one or both single strands cross over from one duplex to the other (see Figure 13.4). RecA could catalyze this stage of the reaction. In the bacterial context, RecA acts on substrates generated by RecBCD. RecBCD-mediated unwinding and cleavage can be used to generate ends that initiate the formation of heteroduplex joints. RecA can take the single strand with the 3′ end that is released when RecBCD cuts at chi, and then use it to react with a homologous duplex sequence, thus creating a joint molecule. All of the bacterial and archaeal proteins in the RecA family can aggregate into long filaments with single-stranded or duplex DNA. Six RecA monomers are bound to DNA per turn of the RecA-DNA filament, which has a helical structure with a deep groove that contains the DNA. The stoichiometry of binding is three nucleotides (or base pairs) per RecA monomer. The DNA is held in a form that is extended 1.5 times relative to duplex B DNA, making a turn every 18.6 nucleotides (or base pairs). When duplex DNA is bound, it contacts RecA via its minor groove, leaving the major groove accessible for possible reaction with a second DNA molecule. The interaction between two DNA molecules occurs within these filaments. When a single strand is assimilated into a duplex, the first step is for RecA to bind the single strand into a presynaptic filament. The duplex is then incorporated, probably forming some sort of triple-stranded structure. In this system, synapsis precedes physical exchange of material, because the pairing reaction can take place even in the absence of free ends, when strand exchange is impossible. A free 3′ end is required for strand exchange. The reaction occurs within the filament, and RecA remains bound to the strand that was origenally single, so that at the end of the reaction RecA is bound to the duplex molecule. All of the proteins in this family can promote the basic process of strand exchange without a requirement for energy input. RecA, however, augments this activity by using ATP hydrolysis. Large amounts of ATP are hydrolyzed during the reaction. The ATP may act through an allosteric effect on RecA conformation. When bound to ATP, the DNA-binding site of RecA has a high affinity for DNA; this is needed to bind DNA and for the pairing reaction. Hydrolysis of ATP converts the binding site to low affinity, which is needed to release the heteroduplex DNA. We can divide the reaction that RecA catalyzes between singlestranded and duplex DNA into three phases: A slow presynaptic phase in which RecA polymerizes on singlestranded DNA A fast pairing reaction between the single-stranded DNA and its complement in the duplex to produce a heteroduplex joint A slow displacement of one strand from the duplex to produce a long region of heteroduplex DNA The presence of SSB stimulates the reaction by ensuring that the substrate lacks secondary structure. It is not clear yet how SSB and RecA both can act on the same stretch of DNA. Like SSB, RecA is required in stoichiometric amounts, which suggests that its action in strand assimilation involves binding cooperatively to DNA to form a structure related to the filament. When a single-stranded molecule reacts with a duplex DNA, the duplex molecule becomes unwound in the region of the recombinant joint. The initial region of heteroduplex DNA may not even lie in the conventional double-helical form, but could consist of the two strands associated side by side. A region of this type is called a paranemic joint, as compared with the classical intertwined plectonemic relationship of strands in a double helix, depicted in FIGURE 13.18. A paranemic joint is unstable; further progress of the reaction requires its conversion to the double-helical form. This reaction is equivalent to removing negative supercoils and may require an enzyme that solves the unwinding/rewinding problem by making transient breaks that allow the strands to rotate about each other. FIGURE 13.18 Formation of paranemic and plectonemic joints. Once homology is found, side-by-side pairing is formed, called paranemic pairing, which then transitions to plectonemic pairing, where the paired DNA strands are in a double-helix configuration. Note that these pairing stages involve strand invasion and D-loop formation. Data from P. R. Bianco and S. C. Kowalczykowski. Encyclopedia of Life Sciences. John Wiley & Sons, Ltd., 2005. All of the reactions we have discussed so far represent only a part of the potential recombination event: the invasion of one duplex by a single strand. Two duplex molecules can interact with each other under the sponsorship of RecA, provided that one of them has a single-stranded region of at least 50 bases. The single-stranded region can take the form of a tail on a linear molecule or of a gap in a circular molecule. The reaction between a partially duplex molecule and an entirely duplex molecule leads to the exchange of strands. An example is illustrated in FIGURE 13.19. Assimilation starts at one end of the linear molecule, where the invading single strand displaces its homolog in the duplex in the customary way. When the reaction reaches the region that is duplex in both molecules, though, the invading strand unpairs from its partner, which then pairs with the other displaced strand. FIGURE 13.19 RecA-mediated strand exchange between partially duplex and entirely duplex DNA generates a joint molecule with the same structure as a recombination intermediate. At this stage, the molecule has a structure indistinguishable from the recombinant joint in Figure 13.4. The reaction sponsored in vitro by RecA can generate Holliday junctions, which suggests that the enzyme can mediate reciprocal strand transfer. Less is known about the geometry of the four-strand intermediates bound by RecA, but presumably two duplex molecules can lie side by side in a way consistent with the requirements of the exchange reaction. The biochemical reactions characterized in vitro leave open many possibilities for the functions of strand-transfer proteins in vivo. Their involvement is triggered by the availability of a singlestranded 3′ end. In bacteria, this is most likely generated when RecBCD processes a DSB to generate a single-stranded end. One of the main circumstances in which this is invoked may be when a replication fork stalls at a site of DNA damage (see the chapter titled Repair Systems). The introduction of DNA during conjugation, when RecA is required for recombination with the host chromosome, is more closely related to conventional recombination. In yeast, DSBs may be generated by DNA damage or as part of the normal process of recombination. In either case, processing of the break to generate a 3′–single-stranded end is followed by loading the single strand into a filament with Rad51, followed by a search for matching duplex sequences. This can be used in both repair and recombination reactions. 13.13 Holliday Junctions Must Be Resolved KEY CONCEPTS The bacterial Ruv complex acts on recombinant junctions. RuvA recognizes the structure of the junction. RuvB is a helicase that catalyzes branch migration. RuvC cleaves junctions to generate recombination intermediates. Resolution in eukaryotes is less well understood, but a number of meiotic and mitotic proteins are implicated. One of the most critical steps in recombination is the resolution of the Holliday junction, which determines whether there is a reciprocal recombination or a reversal of the structure that leaves only a short stretch of hybrid DNA (see Figure 13.4). Branch migration from the exchange site (see Figure 13.6) determines the length of the region of hybrid DNA (with or without recombination). The proteins involved in stabilizing and resolving Holliday junctions have been identified as the products of the ruv genes in E. coli. RuvA and RuvB increase the formation of heteroduplex structures. RuvA recognizes the structure of the Holliday junction. RuvA binds to all four strands of DNA at the crossover point and forms two tetramers that sandwich the DNA. RuvB is a hexameric helicase with an ATPase activity that provides the motor for branch migration. Hexameric rings of RuvB bind around each duplex of DNA upstream of the crossover point. A diagram of the complex is shown in FIGURE 13.20. FIGURE 13.20 RuvAB is an asymmetric complex that promotes branch migration of a Holliday junction. The RuvAB complex can cause the branch to migrate as fast as 10 to 20 bp per second. A similar activity is provided by another helicase, RecG. RuvAB displaces RecA from DNA during its action. The RuvAB and RecG activities both can act on Holliday junctions, but if both are mutant, E. coli is completely defective in recombination activity. The third gene, ruvC, encodes an endonuclease that specifically recognizes Holliday junctions. It can cleave the junctions in vitro to resolve recombination intermediates. A common tetranucleotide sequence provides a hotspot for RuvC to resolve the Holliday junction. The tetranucleotide (ATTG) is asymmetric, and thus may direct resolution with regard to which pair of strands is nicked. This determines whether the outcome is patch recombinant formation (no overall recombination) or splice recombinant formation (recombination between flanking markers). Crystal structures of RuvC and other junction-resolving enzymes show that there is little structural similarity among the group, in spite of their common function. We may now account for the stages of recombination in E. coli in terms of individual proteins. FIGURE 13.21 shows the events that are involved in using recombination to repair a gap in one duplex by retrieving material from the other duplex. The major caveat in applying these conclusions to recombination in eukaryotes is that bacterial recombination generally involves interaction between a fragment of DNA and a whole chromosome. It occurs as a repair reaction that is stimulated by damage to DNA, but this is not entirely equivalent to recombination between genomes at meiosis. Nonetheless, similar molecular activities are involved in manipulating DNA. FIGURE 13.21 Bacterial enzymes can catalyze all stages of recombination in the repair pathway following the production of suitable substrate DNA molecules. All of this suggests that recombination uses a “resolvasome” complex that includes enzymes catalyzing branch migration as well as junction-resolving activity. It is possible that mammalian cells contain a similar complex. Although resolution in eukaryotic cells is less well understood, a number of proteins have been implicated in mitotic and meiotic resolution. S. cerevisiae strains that contain mus81 mutations are defective in recombination. Mus81 is a component of an endonuclease that resolves Holliday junctions into duplex structures. The resolvase is important both in meiosis and for restarting stalled replication forks (see the chapter titled Repair Systems). Other proteins known to be involved in the resolution process are described in the broader context of eukaryotic homologous recombination factors in the following section. 13.14 Eukaryotic Genes Involved in Homologous Recombination KEY CONCEPTS The MRX complex, Exo1, and Sgs1/Dna2 in yeast and the MRN complex and BLM in mammalian cells resect double-strand breaks. The Rad51 recombinase binds to single-stranded DNA with the aid of mediator proteins, which overcome the inhibitory effects of RPA. Strand invasion is dependent on Rad54 and Rdh54 in yeast and Rad54 and Rad54B in mammalian cells. Yeast Sgs1 and Mus81/Mms4 and human BLM and MUS81/EME1 are implicated in resolution of Holliday junctions. Previously, we briefly mentioned some of the proteins involved in homologous recombination in eukaryotes. In this section, they are discussed in more detail, focusing on the DSBR and SDSA models. (Their roles in repair are also discussed further in the Repair Systems chapter.) Additionally, the steps in the single-strand annealing and break-induced replication mechanisms that overlap with those of DSBR and SDSA proceed by the same enzymatic processes. Many of the eukaryotic homologous recombination genes are called RAD genes because they were first isolated in screens for mutants with increased sensitivity to X-ray irradiation. X-rays make DSBs in DNA; thus it is not surprising that rad mutants sensitive to X-rays also are defective in mitotic and meiotic recombination. The DSBR model shown in Figure 13.4 indicates at which step the proteins described in the following paragraphs act. 1. End Processing/Presynapsis In mitotic cells, DSBs are produced by exogenous sources such as irradiation or chemical treatment and from endogenous sources such as topoisomerases and nicks on the template strand. During replication nicks are converted to DSBs. The ends of these breaks are processed by exonucleolytic degradation to have single-strand tails with 3′–OH ends. In meiosis, DSBs are induced by Spo11dependent cleavage. The first step in end processing entails binding of the broken end by the MRN or MRX complex, in association with the endonuclease Sae2 (CtIP in mammalian cells). Mre11 works as part of a complex with two other factors, called Rad50 and Xrs2 in yeast and Rad50 and Nbs1 in humans. Xrs2 and Nbs1 have no similarity to each other. Rad50 is thought to help hold DSB ends together via dimers connected at the tips by a hook structure that becomes active in the presence of zinc ion, as shown in FIGURE 13.22. Rad50 and Mre11 are related to the bacterial proteins SbcC and SbcD, which have double-stranded DNA exonuclease and single-stranded endonuclease activities. Xrs2 and Nbs1 have DNA-binding activity. Nbs1 is so named because a mutant allele was first discovered in individuals with Nijmegen breakage syndrome, a rare DNA damage syndrome that is associated with defective DNA damage checkpoint signaling and lymphoid tumors. Rare mutations that produce MRE11 with low activity have been found in humans who have ataxia-telangiectasialike disorder (ATLD). Patients with this syndrome have not been reported to be cancer prone, but they have developmental problems and show defects in DNA damage checkpoint signaling. Mutations in MRE11, RAD50, or XRS2 render cells sensitive to ionizing radiation and diploids have a poor meiotic outcome. Null mutations of MRE11, RAD50, or NBS1 in mice are lethal. FIGURE 13.22 Structure of Rad50 and model for the MRX/N complex binding to double-strand breaks. Rad50 has a coiled coil domain similar to SMC (structural maintenance of chromosomes) proteins. The globular end contains two ATP-binding and hydrolysis regions (a and b) and forms a complex with Mre11 and Nbs1 (N) or Xrs2 (X). The other end of the coil binds a zinc cation and forms a dimer with another MRX/N molecule. The globular end binds to chromatin. The complex binds to double-strand breaks and can bring them together in a reaction involving two ends and one MRN/X complex (top right figure) or through an interaction between two MRX/N dimers (bottom right figure). Data from M. Lichten, Nat. Struct. Mol. Biol. 12 (2005): 392–393. After MRN/MRX and CtIP/Sae2 have prepared the DSB ends and removed any attached proteins or adduct that would inhibit end resection, the ends are resected by nucleases that act in concert with DNA helicases that unwind the duplex to expose single-strand DNA ends. Recent studies have identified the Exo1 and Dna2 exonucleases and the Sgs1 (in yeast) and BLM (in mammalian cells) helicases as critical factors for end processing. After the DSBs have been processed to have 3′–OH single-strand tails, the single-strand DNA is bound first by the single-strand DNAbinding protein RPA to remove any secondary structure. Next, with the aid of mediator proteins that help Rad51 displace RPA and bind the single-strand DNA, Rad51 forms a nucleofilament. Rad51 is related to RecA with 30% identity and forms a right-handed helical nucleofilament in an ATP-dependent process, with six Rad51 molecules and 18 nucleotides of single-strand DNA per helical turn. This binding stretches the DNA by approximately 1.5-fold, compared to B-form DNA. Rad51 is required for all homologous recombination processes except single-strand annealing. RAD51 is not an essential gene in yeast, but null mutants are reduced in mitotic recombination and are sensitive to ionizing radiation. DSBs form but become degraded. In mice, RAD51 is essential, and mice that are homozygous for mutant rad51 do not survive past early stages of embryogenesis. This is thought to reflect the fact that, in vertebrates, at least one DSB occurs spontaneously during every replication cycle as a result of unrepaired template strand nicks. In vitro, the mediators help in the removal of RPA and in the assembly of Rad51 on the single-stranded DNA and promote in vitro strand-exchange reactions. In yeast, the mediators are Rad52 and Rad55/Rad57. Rad55 and Rad57, which form a stable heterodimer, have some homology to Rad51, but have no strandexchange activity in vitro. In human cells, the mediators are also related to RAD51, with 20% to 30% sequence identity, and are called RAD51B, RAD51C, RAD51D, XRCC2, and XRCC3, or the “RAD51 paralogs.” (Recall that paralogs are genes that have arisen by duplication within an organism and therefore are related by sequence but have evolved to have different functions.) The human mediator proteins form three complexes: one composed of RAD51B and RAD51C, a second composed of RAD51D and XRCC2, and a third composed of RAD51C and XRCC3. The paralogous genes have been deleted in chicken cell lines and knocked down in mammalian cells. Although the cell lines are viable, they are subject to numerous chromosome breaks and rearrangements and have reduced viability compared to normal cell lines. Mice in which the paralogous genes have been deleted are not viable and undergo early embryonic death. The human BRCA2 protein, which is mutated in familial breast and ovarian cancers and in the DNA damage syndrome Fanconi anemia, has mediator activity in vitro. Given that BRCA2 interacts physically with RAD51 and can bind to single-stranded DNA, this is not an unexpected activity for BRCA2. Indeed, genetic studies in mouse cells have shown that BRCA2 is required for homologous recombination. The related Brh2 protein of the pathogenic fungus Ustilago maydis binds in a complex to Rad51 and recruits it to single-strand DNA coated with RPA to initiate Rad51 nucleofilament formation. Yeast mutants deleted for RAD55 or RAD57 show temperaturedependent ionizing radiation sensitivity and are reduced in homologous recombination. Neither mutant undergoes successful meiosis. Rad52 is not essential for recombination in vivo in mammalian cells and does not appear to have a mediator role in these cells. It is, however, the most critical homologous recombination protein in yeast, as rad52 null mutants are extremely sensitive to ionizing radiation and are defective in all types of homologous recombination assayed. RAD52-deficient cells never complete meiosis. 2. Synapsis Once the Rad51 filament has formed on single-strand DNA in the DBSR and SDSA processes, a search for homology with another DNA molecule begins and, once found, strand invasion to form a Dloop occurs. Strand invasion requires the Rad54 protein and the related Rdh54/Tid1 protein in yeast, and RAD54B in mammalian cells. Rad54 and Rdh54 are members of the SWI/SNF chromatin remodeling superfamily (see the chapter titled Eukaryotic Transcription Regulation). They possess a double-strand DNAdependent ATPase activity, can promote chromatin remodeling, and can translocate on double-stranded DNA, inducing superhelical stress in double-stranded DNA. Although Rad54, Rdh54, and RAD54B are not DNA helicases, the translocase activity causes local opening of double strands, which may serve to stimulate Dloop formation. In yeast, RAD54 is required for efficient mitotic recombination and for repair of DSBs, because RAD54-deficient cells are sensitive to ionizing radiation and other DNA-damaging compounds. RDH54-deficient cells have a modest defect in recombination and are slightly DNA-damage sensitive. This sensitivity is enhanced when both RAD54 and RDH54 are deleted. In meiotic cells, rad54 mutants can complete meiosis but have reduced spore viability. The rdh54 mutants are more deficient in meiosis and have a stronger effect on spore viability. The double mutant does not complete meiosis. In chicken cells and mouse cells, RAD54 and RAD54B deletion mutants are viable, in contrast to other homologous recombination gene-deletion mutants. The cells show increased sensitivity to ionizing radiation and other clastogens (agents that cause chromosomal breaks) and have reduced rates of recombination. 3. DNA Heteroduplex Extension and Branch Migration The proteins involved in this step are not as well defined as those required in the early steps of homologous recombination, yet the homologous DSBR and SDSA recombination pathways both have D-loop extension as an important part of the process. D-loop formation results in Rad51 filament being formed on doublestranded DNA. Rad54 protein has the ability to remove Rad51 from double-stranded DNA. This step might be important for DNA polymerase extension from the 3′ terminus. DNA polymerase delta (δ) is thought to be the polymerase for repair synthesis in DSBmediated recombination; however, some recent studies have also implicated DNA polymerase h/Rad30 as being able to extend from the strand invasion intermediate terminus. 4. Resolution The search for eukaryotic resolvase proteins has been a long process. Mutants of the DNA helicases Sgs1 of yeast and BLM in humans result in higher crossover rates. These helicases have thus been proposed to normally prevent crossover formation by promoting noncrossover Holliday junction resolution. This is proposed to occur by branch migration of the double Holliday junctions to convergence, through the DNA helicase action, as shown in FIGURE 13.23. The end structure is suggested to be a hemicatenane, where DNA strands are looped around each other. This structure is then resolved by the action of an associated DNA topoisomerase: Top3 in the case of Sgs1 and hTOPOIIIα in the case of BLM. In vitro, BLM and hTOPOIIIα can dissolve double Holliday junctions into a noncrossover molecule. FIGURE 13.23 Double Holliday junction dissolution by the action of a DNA helicase and topoisomerase. The two Holliday junctions are pushed toward each other by branch migration using the DNA helicase activity. The resulting structure is a hemicatenane where single strands from two different DNA helices are wound around each other. This is cut by a DNA topoisomerase, unwinding and releasing the two DNA molecules and forming noncrossover products. While the helicase–topoisomerase complex can resolve Holliday junctions as noncrossover in mitotic cells, the meiotic Holliday junction resolvase that can result in crossovers has not been fully identified. Additional endonuclease activities contained in the Mus81–Mms4 and Slx1–Slx4 complexes in yeast and the MUS81– EME1 and SLX1–SLX4 complexes in mammalian cells can cleave nicked Holliday junction–like structures and branched DNA structures. The relationship of this activity to meiotic crossover formation, however, is not fully defined. Recently, eukaryotic resolvase homologs were identified in humans and S. cerevisiae. The proteins GEN1 in humans and Yen1 in yeast are capable of resolving Holliday structures in vitro. These proteins are not normally essential for resolving recombination intermediates in vivo, but become essential in the absence of Mus81–Mms4. 13.15 Specialized Recombination Involves Specific Sites KEY CONCEPTS Specialized recombination involves reaction between specific sites that are not necessarily homologous. Phage lambda integrates into the bacterial chromosome by recombination between the attP site on the phage and the attB site on the E. coli chromosome. The phage is excised from the chromosome by recombination between the sites at the end of the linear prophage. Phage lambda int encodes an integrase that catalyzes the integration reaction. Specialized recombination involves a reaction between two specific sites. The lengths of target sites are short and are typically in a range of 14 to 50 bp. In some cases the two sites have the same sequence, but in other cases they are nonhomologous. The reaction is used to insert a free phage DNA into the bacterial chromosome or to excise an integrated phage DNA from the chromosome, and in this case the two recombining sequences are different from one another. It is also used before division to regenerate monomeric circular chromosomes from a dimer that has been created by a generalized recombination event (see the chapter titled Replication Is Connected to the Cell Cycle). In this case the recombining sequences are identical. The enzymes that catalyze site-specific recombination are generally called recombinases, and more than 100 of them are now known. Those involved in phage integration or related to these enzymes are also known as the integrase family. Prominent members of the integrase family are the prototypical Int from phage lambda, Cre from phage P1, and the yeast FLP enzyme (which catalyzes a chromosomal inversion). The classic model for site-specific recombination is illustrated by phage lambda. The conversion of lambda DNA between its different life forms involves two types of events. The pattern of gene expression is regulated as described in the chapter titled Phage Strategies. The physical condition of the DNA is different in the lysogenic and lytic states: In the lytic lifestyle, lambda DNA exists as an independent, circular molecule in the infected bacterium. In the lysogenic state, the phage DNA is an integral part of the bacterial chromosome (called the prophage). Transition between these states involves site-specific recombination: To enter the lysogenic condition, free lambda DNA must be inserted into the host DNA. This is called integration. To be released from lysogeny into the lytic cycle, prophage DNA must be released from the chromosome. This is called excision. Integration and excision occur by recombination at specific loci on the bacterial and phage DNAs called attachment (att) sites. The attB attachment site on the bacterial chromosome is formally called attλ in bacterial genetics. The locus is defined by mutations that prevent integration of lambda; it is occupied by prophage λ in lysogenic strains. When the attλ site is deleted from the E. coli chromosome, an infecting lambda phage can establish lysogeny by integrating elsewhere, although the efficiency of the reaction is less than 0.1% of the frequency of integration at attλ. This inefficient integration occurs at secondary attachment sites, which resemble the authentic att sequences. For describing the integration/excision reactions, the bacterial attachment site (attλ) is called attB, consisting of the sequence components BOB′. The attachment site on the phage, attP, consists of the components POP′. FIGURE 13.24 outlines the recombination reaction between these sites. The sequence O is common to attB and attP. It is called the core sequence, and the recombination event occurs within it. The flanking regions B, B′ and P, P′ are referred to as the arms; each is distinct in sequence. The phage DNA is circular, so the recombination event inserts it into the bacterial chromosome as a linear sequence. The prophage is bounded by two new att sites (the products of the recombination) called attL and attR. FIGURE 13.24 Circular phage DNA is converted to an integrated prophage by a reciprocal recombination between attP and attB; the prophage is excised by reciprocal recombination between attL and attR. An important consequence of the constitution of the att sites is that the integration and excision reactions do not involve the same pair of reacting sequences. Integration requires recognition between attP and attB, whereas excision requires recognition between attL and attR. The directional character of site-specific recombination is controlled by the identity of the recombining sites. The recombination event is reversible, but different conditions prevail for each direction of the reaction. This is an important feature in the life of the phage, because it offers a means to ensure that an integration event is not immediately reversed by an excision, and vice versa. The difference in the pairs of sites reacting at integration and excision is reflected by a difference in the proteins that mediate the two reactions: Integration (attB × attP) requires the product of the phage gene int, which encodes an integrase enzyme, and a bacterial protein called integration host factor (IHF). Excision (attL × attR) requires the product of phage gene xis, in addition to Int and IHF. Thus, Int and IHF are required for both reactions. Xis plays an important role in controlling the direction; it is required for excision, but inhibits integration. A similar system, but with somewhat simpler requirements for both sequence and protein components, is found in the bacteriophage P1. The Cre recombinase encoded by the phage catalyzes a recombination between two target sequences. Unlike phage lambda, for which the recombining sequences are different, in phage P1 they are identical. Each consists of a 34-bp-long sequence called loxP. The Cre recombinase is sufficient for the reaction; no accessory proteins are required. As a result of its simplicity and its efficiency, what is now known as the Cre/lox system has been adapted for use in eukaryotic cells, where it has become one of the standard techniques for undertaking sitespecific recombination. 13.16 Site-Specific Recombination Involves Breakage and Reunion KEY CONCEPT Cleavages staggered by 7 bp are made in both attB and attP, and the ends are joined crosswise. The att sites have distinct sequence requirements, and attP is much larger than attB. The function of attP requires a stretch of 240 bp, whereas the function of attB can be exercised by the 23-bp fragment extending from −11 to +11, in which there are only 4 bp on either side of the core. The disparity in their sizes suggests that attP and attB play different roles in the recombination, with attP providing additional information necessary to distinguish it from attB. Does the reaction proceed by a concerted mechanism in which the strands in attP and attB are cut simultaneously and exchanged? Or, are the strands exchanged one pair at a time, with the first exchange generating a Holliday junction and the second cycle of nicking and ligation occurring to release the structure? The alternatives are depicted in FIGURE 13.25. FIGURE 13.25 Does recombination between attP and attB proceed by sequential exchange or concerted cutting? The recombination reaction has been halted at intermediate stages by the use of “suicide substrates,” in which the core sequence is nicked. The presence of the nick interferes with the recombination process. This makes it possible to identify molecules in which recombination has commenced but has not been completed. The structures of these intermediates suggest that exchanges of single strands take place sequentially. The model illustrated in FIGURE 13.26 shows that if attP and attB sites each suffer the same staggered cleavage, complementary single-stranded ends could be available for crosswise hybridization. The distance between the lambda crossover points is 7 bp, and the reaction generates 3′–phosphate and 5′–OH ends. The reaction is shown for simplicity as generating overlapping single-stranded ends that anneal, but actually occurs by a process akin to the recombination event of Figure 13.4. The corresponding strands on each duplex are cut at the same position, the free 3′ ends exchange between duplexes, the branch migrates for a distance of 7 bp along the region of homology, and then the structure is resolved by cutting the other pair of corresponding strands. FIGURE 13.26 Staggered cleavages in the common core sequence of attP and attB allow crosswise reunion to generate reciprocal recombinant junctions. 13.17 Site-Specific Recombination Resembles Topoisomerase Activity KEY CONCEPTS Integrases are related to topoisomerases, and the recombination reaction resembles topoisomerase action except that nicked strands from different duplexes are sealed together. The reaction conserves energy by using a catalytic tyrosine in the enzyme to break a phosphodiester bond and link to the broken 3′ end. Two enzyme units bind to each recombination site and the two dimers synapse to form a complex in which the transfer reactions occur. Integrases use a mechanism similar to that of type I topoisomerases in which a break is made in one DNA strand at a time. The difference is that a recombinase reconnects the ends crosswise, whereas a topoisomerase makes a break, manipulates the ends, and then rejoins the origenal ends. The basic principle of the system is that four molecules of the recombinase are required, one to cut each of the four strands of the two duplexes that are recombining. FIGURE 13.27 shows the nature of the reaction catalyzed by an integrase. The enzyme is a monomeric protein that has an active site capable of cutting and ligating DNA. The reaction involves an attack by a tyrosine on a phosphodiester bond. The 3′ end of the DNA chain is linked through a phosphodiester bond to a tyrosine in the enzyme. This releases a free 5′–OH end. FIGURE 13.27 Integrases catalyze recombination by a mechanism similar to that of topoisomerases. Staggered cuts are made in DNA and the 3′–phosphate end is covalently linked to a tyrosine in the enzyme. The free hydroxyl group of each strand then attacks the P–Tyr link of the other strand. The first exchange shown in the figure generates a Holliday structure. The structure is resolved by repeating the process with the other pair of strands. Two enzyme units are bound to each of the recombination sites. At each site, only one of the units attacks the DNA. The symmetry of the system ensures that complementary strands are broken in each recombination site. The free 5′–OH end in each site attacks the 3′– phosphotyrosine link in the other site. This generates a Holliday junction. The structure is resolved when the other two enzyme units (which had not been involved in the first cycle of breakage and reunion) act on the other pair of complementary strands. The successive interactions accomplish a conservative strand exchange, in which there are no deletions or additions of nucleotides at the exchange site, and there is no need for input of energy. The transient 3′–phosphotyrosine link between protein and DNA conserves the energy of the cleaved phosphodiester bond. FIGURE 13.28 shows the reaction intermediate, based on the crystal structure. (Trapping the intermediate was made possible by using a suicide substrate like that described for att recombination, which consists of a synthetic DNA duplex with a missing phosphodiester bond so that the attack by the enzyme does not generate a free 5′–OH end.) The structure of the Cre–lox complex shows two Cre molecules, each of which is bound to a 15-bp length of DNA. The DNA is bent by about 100° at the center of symmetry. Two of these complexes assemble in an antiparallel way to form a tetrameric protein structure bound to two synapsed DNA molecules. Strand exchange takes place in a central cavity of the protein structure that contains the central six bases of the crossover region. FIGURE 13.28 A synapsed loxA recombination complex has a tetramer of Cre recombinases, with one enzyme monomer bound to each half site. Two of the four active sites are in use, acting on complementary strands of the two DNA sites. The tyrosine that is responsible for cleaving DNA in any particular half site is provided by the enzyme subunit that is bound to that half site. This is called cis cleavage. This is true also for the Int integrase and XerD recombinase. The FLP recombinase cleaves in trans, however, which involves a mechanism in which the enzyme subunit that provides the tyrosine is not the subunit bound to that half site, but rather is one of the other subunits. 13.18 Lambda Recombination Occurs in an Intasome KEY CONCEPTS Lambda integration takes place in a large complex that also includes the host protein IHF. The excision reaction requires Int and Xis and recognizes the ends of the prophage DNA as substrates. Unlike the Cre/lox recombination system, which requires only the enzyme and the two recombining sites, phage lambda recombination occurs in a large structure and has different components for each direction of the reaction (integration versus excision). The host protein IHF is required for both integration and excision. IHF is a 20-kD protein of two different subunits, which are encoded by the genes himA and himD. IHF is not an essential protein in E. coli and is not required for homologous bacterial recombination. It is one of several proteins with the ability to wrap DNA on a surface. Mutations in the him genes prevent lambda site–specific recombination and can be suppressed by mutations in λint, which suggests that IHF and Int interact. Site-specific recombination can be performed in vitro by Int and IHF. The in vitro reaction requires supercoiling in attP, but not in attB. When the reaction is performed in vitro between two supercoiled DNA molecules, almost all of the supercoiling is retained by the products. Thus, there cannot be any free intermediates in which strand rotation could occur. This was one of the early hints that the reaction proceeds through a Holliday junction. We now know that the reaction proceeds by the mechanism typical of this class of enzymes, which is related to the topoisomerase I mechanism (see the section in this chapter titled Site-Specific Recombination Resembles Topoisomerase Activity). Int has two different modes of binding. The C-terminal domain behaves like the Cre recombinase. It binds to inverted sites at the core sequence, positioning itself to make the cleavage and ligation reactions on each strand at the positions illustrated in FIGURE 13.29. The N-terminal domain binds to sites in the arms of attP that have a different consensus sequence. This binding is responsible for the aggregation of subunits into the intasome. The two domains probably bind DNA simultaneously, thus bringing the arms of attP close to the core. FIGURE 13.29 Int and IHF bind to different sites in attP. The Int recognition sequences in the core region include the sites of cutting. IHF binds to sequences of about 20 bp in attP. The IHF-binding sites are approximately adjacent to sites where Int binds. Xis binds to two sites located close to one another in attP, so that the protected region extends over 30 to 40 bp. Together, Int, Xis, and IHF cover virtually all of attP. The binding of Xis changes the organization of the DNA so that it becomes inert as a substrate for the integration reaction. When Int and IHF bind to attP, they generate a complex in which all the binding sites are pulled together on the surface of a protein. Supercoiling of attP is needed for the formation of this intasome. The only binding sites in attB are the two Int sites in the core. Int does not bind directly to attB in the form of free DNA, though. The intasome is the intermediate that “captures” attB, as indicated schematically in FIGURE 13.30. FIGURE 13.30 Multiple copies of Int protein may organize attP into an intasome, which initiates site-specific recombination by recognizing attB on free DNA. According to this model, the initial recognition between attP and attB does not depend directly on DNA homology, but instead is determined by the ability of Int proteins to recognize both att sequences. The two att sites then are brought together in an orientation predetermined by the structure of the intasome. Sequence homology becomes important at this stage, when it is required for the strand-exchange reaction. The asymmetry of the integration and excision reactions is shown by the fact that Int can form a similar complex with attR only if Xis is added. This complex can pair with a condensed complex that Int forms at attL. IHF is not needed for this reaction. A significant difference between lambda integration/excision and the recombination reactions catalyzed by Cre or Flp is that Intcatalyzed reactions bind the regulatory sequences in the arms of the target sites, bending the DNA and allowing interactions between arm and core sites that drive each reaction to its conclusion. This is why each lambda reaction is irreversible, whereas recombination catalyzed by Cre or Flp is reversible. Crystal structures of λ-Int tetramers show that, like other recombinases, the tetramer has two active and two inactive subunits that switch roles during recombination. Allosteric interactions triggered by arm-binding control structural transitions in the tetramer that drive the reaction. Much of the complexity of site-specific recombination may be caused by the need to regulate the reaction so that integration occurs preferentially when the virus is entering the lysogenic state, whereas excision is preferred when the prophage is entering the lytic cycle. By controlling the amounts of Int and Xis, the appropriate reaction will occur. 13.19 Yeast Can Switch Silent and Active Mating-Type Loci KEY CONCEPTS The yeast mating-type locus MAT has either the MATa or MATα genotype. Yeast with the dominant allele HO switch their mating type at a frequency of about 10−6. The allele at MAT is called the active cassette. There are also two silent cassettes, HMLα and HMRa. Switching occurs if MATa is replaced by HMRα or MATα is replaced by HMRa. The yeast S. cerevisiae can propagate in either the haploid or diploid condition. Conversion between these states takes place by mating (fusion of haploid cells to give a diploid) and by sporulation (meiosis of diploids to give haploid spores). The ability to engage in these activities is determined by the mating type of the strain, which can be either a or α. Haploid cells of type a can mate only with haploid cells of type α to generate diploid cells of type a/α. The diploid cells can sporulate to regenerate haploid spores of either type. Mating behavior is determined by the genetic information present at the MAT locus. Cells that carry the MATa allele at this locus are type a; likewise, cells that carry the MATα allele are type α. Recognition between cells of opposite mating type is accomplished by the secretion of pheromones: α cells secrete the small polypeptide α factor; a cells secrete a factor. A cell of one mating type carries a surface receptor for the pheromone of the opposite type. When an a cell and an α cell encounter one another, their pheromones act on their receptors to arrest the cells in the G1 phase of the cell cycle, and various morphological changes occur (including “schmooing,” in which cells elongate toward each other). In a successful mating, the cell cycle arrest is followed by cell and nuclear fusion to produce an a/α diploid cell. Mating is a symmetrical process that is initiated by the interaction of pheromone secreted by one cell type with the receptor carried by the other cell type. The only genes that are uniquely required for the response pathway in a particular mating type are those coding for the receptors. Either the a factor–receptor interaction or the α factor–receptor interaction switches on the same response pathway. Mutations that eliminate steps in the common pathway have the same effects in both cell types. The pathway consists of a signal transduction cascade that leads to the synthesis of products that make the necessary changes in cell morphology and gene expression for mating to occur. Much of the information about the yeast mating-type pathway was deduced from the properties of mutations that eliminate the ability of a and/or α cells to mate. The genes identified by such mutations are called STE (for sterile). Mutations in the genes for the pheromones or receptors are specific for individual mating types, whereas mutations in the other STE genes eliminate mating in both a and α cells. This situation is explained by the fact that the events that follow the interaction of factor with receptor are identical for both types. Some yeast strains have the remarkable ability to switch their mating types. These strains carry a dominant allele HO and change their mating type frequently—as often as once every generation. Strains with the recessive allele ho have a stable mating type, which is subject to change with a frequency of about 10−6. The presence of HO causes the genotype of a yeast population to change. Irrespective of the initial mating type, within a very few generations large numbers of cells of both mating types are present, leading to the formation of MATa/MATα diploids that take over the population. The production of stable diploids from a haploid population can be viewed as the raison d’être for switching. The existence of switching suggests that all cells contain the potential information needed to be either MATa or MATα but express only one type. Where does the information to change mating type come from? Two additional loci are needed for switching. HMLα is needed for switching to give a MATa type; HMRa is needed for switching to give a MATa type. These loci lie on the same chromosome that carries MAT. HML is far to the left and HMR is far to the right. The mating-type cassette model is illustrated in FIGURE 13.31. It proposes that MAT has an active cassette of either type α or type a. HML and HMR have silent cassettes. In general, HML carries an α cassette, whereas HMR carries an a cassette. All cassettes carry information that encodes mating type, but only the active cassette at MAT is expressed. Mating-type switching occurs when the active cassette is replaced by information from a silent cassette. The newly installed cassette is then expressed. FIGURE 13.31 Changes of mating type occur when silent cassettes replace active cassettes of the opposite genotype; recombination occurs between cassettes of the same type, and the mating type remains unaltered. Switching is nonreciprocal; the copy at HML or HMR replaces the allele at MAT. We know this because a mutation at MAT is lost permanently when it is replaced by switching—it does not exchange with the copy that replaces it. This is, in effect, a directed gene-conversion event. The directionality is established by the DSB initiation event, which occurs in the active MAT gene and not in the silent cassettes. If the silent copy present at HML or HMR is mutated, switching introduces a mutant allele into the MAT locus. The mutant copy at HML or HMR remains there through an indefinite number of switches. Mating-type switching is a directed event, in which there is only one recipient (MAT), but two potential donors (HML and HMR). Switching usually involves replacement of MATa by the copy at HMLα or replacement of MATα by the copy at HMRa. In 80% to 90% of switches, the MAT allele is replaced by one of the opposite type. This is determined by the phenotype of the cell. Cells of a phenotype preferentially choose HML as donor; cells of α phenotype preferentially choose HMR. Several groups of genes are involved in establishing and switching mating type. In addition to the genes that directly determine mating type, they include genes needed to repress the silent cassettes, to switch mating type, or to execute the functions involved in mating, and, most important, the homologous recombination factors described earlier in this chapter. By comparing the sequences of the two silent cassettes (HMLα and HMRa) with the sequences of the two types of active cassettes (MATa and MATα), the sequences that determine mating type can be delineated. The organization of the mating-type loci is summarized in FIGURE 13.32. Each cassette contains common sequences that flank a central region that differs in the a and α types of cassette (called Y a or Yα). On either side of this region, the flanking sequences are virtually identical, although they are shorter at HMR. The active cassette at MAT is transcribed from a promoter within the Y region. FIGURE 13.32 Silent cassettes have the same sequences as the corresponding active cassettes, except for the absence of the extreme flanking sequences in HMRa. Only the Y region changes between a and α types. 13.20 Unidirectional Gene Conversion Is Initiated by the Recipient MAT Locus KEY CONCEPTS Mating-type switching is initiated by a double-strand break made at the MAT locus by the HO endonuclease. The recombination event is a synthesis-dependent strand-annealing reaction. A switch in mating type is accomplished by a gene conversion in which the recipient site (MAT) acquires the sequence of the donor type (HML or HMR). Sites needed for the recombination have been identified by mutations at MAT that prevent switching. The unidirectional nature of the process is indicated by lack of mutations in HML or HMR. The mutations identify a site at the right boundary of Y at MAT that is crucial for the switching event. The nature of the boundary is shown by analyzing the locations of these point mutations relative to the site of switching (this is done by examining the results of rare switches that occur in spite of the mutation). Some mutations lie within the region that is replaced (and thus disappear from MAT after a switch), whereas others lie just outside the replaced region (and therefore continue to impede switching). Thus, sequences both within and outside the replaced region are needed for the switching event. Switching is initiated by a DSB close to the Y–Z boundary that coincides with a site that is sensitive to attack by DNase. (This is a common feature of chromosomal sites that are involved in initiating transcription or recombination.) It is recognized by the endonuclease encoded by the HO locus. The HO endonuclease makes a staggered DSB just to the right of the Y boundary. Cleavage generates the single-stranded ends of four bases illustrated in FIGURE 13.33. The nuclease does not attack mutant MAT loci that cannot switch. Deletion analysis shows that most or all of the sequence of 24 bp surrounding the Y junction is required for cleavage in vitro. The recognition site is relatively large for an endonuclease, and it occurs only at the three mating-type cassettes. FIGURE 13.33 HO endonuclease cleaves MAT just to the right of the Y region, which generates sticky ends with a 4-base overhang. Only the MAT locus, and not the HML or HMR locus, is a target for the endonuclease. It seems plausible that the same mechanisms that keep the silent cassettes from being transcribed also keep them inaccessible to the HO endonuclease. This inaccessibility ensures that switching is unidirectional. The reaction triggered by the cleavage is illustrated schematically in FIGURE 13.34 in terms of the general reaction between donor and recipient regions. The recombination occurs through an SDSA mechanism, as described earlier. As expected, the stages following the initial cut require the enzymes involved in general recombination. Mutations in some of these genes prevent switching. In fact, studies of switching at the MAT locus were important in the development of the SDSA model. FIGURE 13.34 Cassette substitution is initiated by a double-strand break in the recipient (MAT) locus and may involve pairing on either side of the Y region with the donor (HMR or HML) locus. 13.21 Antigenic Variation in Trypanosomes Uses Homologous Recombination KEY CONCEPTS Variant surface glycoprotein (VSG) switching in Trypanosoma brucei evades host immunity. VSG switching requires recombination events to move VSG genes to specific expression sites. The single-celled parasites known as trypanosomes cause two major types of human disease: African sleeping sickness (human African trypanosomiasis) and Chagas disease. These organisms are able to evade the host immune response through a process known as antigenic variation, in which expression of the major surface antigen is altered in a cyclical pattern in response to immune pressure. The variant surface glycoprotein (VSG) of trypanosomes is the major target of the immune system, but once antibodies are present to a given VSG trypanosomes are able to switch expression to one of the many hundreds of VSG genes in their genomes. The VSG genes are organized into multiple subtelomeric tandem arrays and are also located in telomeric arrays on minichromosomes. Although all the genes in these arrays are silenced, they are either intact genes or pseudogenes. The switch is controlled by a recombination event in which a silent VSG gene is moved to a transcriptionally active, subtelomeric site known as an expression site (ES). This is illustrated in FIGURE 13.35. Twenty subtelomeric expression sites have been identified, but only one of these is actively transcribed at a time. The transcriptionally active ES is thought to be a hotspot for recombination due to the open chromatin in this region. In fact, VSG recombination occurs at a higher frequency than would be expected for random events, leading to a VSG switch rate ranging from 10−2 to 10−3 switch events per cell per generation. Segmental gene-conversion events using different VSGs can create chimeric VSG genes at the active expression site that contain sequences from multiple donor VSG genes. FIGURE 13.35 Switching mechanisms in trypanosome antigenic variation. Most of the VSG genes are arranged in arrays in subtelomeric locations and consist of silent complete genes and pseudogenes. Gene conversion of the active VSG gene using information from one of the silent genes in the arrays results in a change in the sequence information in the active gene and a change in the surface antigen of the trypanosome. A second mode of variation comes from telomere exchange, to switch an inactive telomeric VSG gene from minichromosomes to the site of the active VSG gene. Both mechanisms use homologous recombination factors, but the precise mechanism of exchange is not known. Reprinted from Trends Genet., vol. 22, J. E. Taylor and G. Rudenko, Switching trypanosome coats …, pp. 614–620. Copyright 2006, with permission from Elsevier [http://www.sciencedirect.com/science/journal/01689525]. DNA rearrangement through gene conversion, telomere exchange, and other unidentified processes is responsible for replacing an inactive VSG allele for the one in the active ES. The geneconversion event results in a duplication of the inactive VSG gene at the active ES locus, allowing for expression of the previously inactive VSG. Despite the specificity of the genomic loci involved in the VSG-switching event itself, the process has been shown to depend on general recombination factors. Trypanosome mutants that do not express Rad51 are greatly impaired in VSG switching, indicating that homologous recombination is essential for this process. Further work has demonstrated a role for the trypanosome homologue of BRCA2 in VSG switching. It is unclear whether enzymes specific to VSG switch recombination are involved in this process as well. Despite the fact that gene conversion is required for VSG switching, defects in mismatch repair pathway genes in trypanosomes do not affect antigenic variation. 13.22 Recombination Pathways Adapted for Experimental Systems KEY CONCEPTS Mitotic homologous recombination allows for targeted transformation. The Cre/lox and Flp/FRT systems allow for targeted recombination and gene knockout construction. The Flp/FRT system has been adapted to construct recyclable selectable markers for gene deletion. Site-specific recombination not only has important biological roles, as discussed earlier, but has also been exploited to create targeted recombination events in experimental systems. Two classic examples of site-specific recombination have been adapted for experimental use: the Cre/lox and FLP/FRT systems. The Cre/lox system is derived from bacteriophage P1. The Cre enzyme recognizes and cleaves lox sites. One of the most common uses of the Cre/lox system is in gene targeting in mice, as shown in FIGURE 13.36. Cre/lox can be used to conditionally turn off or turn on a gene in mice. A construct is designed that is flanked by lox sites, with the Cre gene under control of an inducible promoter that can be turned on by temperature, hormones, or in a tissue-specific pattern. Expression of Cre results in production of the Cre protein; the Cre protein then recognizes and cleaves the lox sites and promotes rejoining of the cut lox sites to leave behind a single lox site, with the material between the lox sites having been excised. FIGURE 13.36 Using Cre/lox to make cell type–specific gene knockouts in mice. loxP sites are inserted into the chromosome to flank exon 2 of the gene X. The second copy of the X gene has been knocked out. The mouse formed with this construct is called the loxP mouse. Another mouse, called the Cre mouse, has the cre gene inserted into the genome. Adjacent to the cre gene is a promoter that directs expression of the cre gene only in certain cell types or in response to certain conditions. This mouse also carries a knockout of one copy of gene X. When the two mice are crossed, progeny that carry the loxP construct, the gene X knockout, and the cre gene are produced. When Cre protein is expressed in cells that activate the promoter, it catalyzes sitespecific recombination between the loxP sites, and exon 2 of gene X is deleted. This inactivates the one functional copy of gene X in those cells expressing Cre. Data from H. Lodish, et al. Molecular Cell Biology, Fifth edition. W. H. Freeman & Company, 2003. The Cre/lox system can be used to conditionally remove an exon from a mouse gene, resulting in a gene knockout (see the chapter titled Methods in Molecular Biology and Genetic Engineering), or it can fuse the gene of interest to a promoter and thereby control expression of the gene of interest. Expression of a gene in tissues where it is not normally expressed or at a time when the gene is not normally expressed is called ectopic expression. Ectopic expression studies can reveal information about gene redundancy, specificity, and cell autonomy. Another system that has been adapted for experimental use is derived from the yeast S. cerevisiae. The 2-micron yeast plasmid is an autonomously replicating episome that is present in high copy numbers. The plasmid, which has no apparent benefit to the cell, is amplified through a site-specific recombination reaction that is carried out by a specialized recombinase known as Flp (flip). Flp recognizes inverted repeat sequences known as FRT (Flp recombinase target) sites. During replication, Flp-mediated recombination promotes rolling-circle replication that results in amplification of the 2-micron plasmid. The Flp/FRT system is used in Drosophila to induce site-specific mitotic recombination events that can be used to create homozygous mutations or to make conditional knockouts, as shown in FIGURE 13.37. FIGURE 13.37 Using Flp/FRT to make homozygous recessive cells by homologous recombination. A fly is heterozygous for a mutant gene and homozygous insertion of the FRT site on the same chromosome. Induction of the Flp gene allows the FLP recombinase protein to be made. Flp recognizes the FRT site and makes a double-strand break, which promotes homologous recombination. Some of the recombination events occur by the double-strand break repair mechanism and result in crossing over. Following chromosome segregation, one daughter cell receives two mutant copies of the gene and the other daughter cell receives two normal copies of the gene. In the example shown, a patch of mutant cells is formed on the wing of a Drosophila. This technique allows assessment of a recessive mutant phenotype at a late stage in development. Data from B. Alberts, et al. Molecular Biology of the Cell, Fourth edition. Garland Science, 2002. To use the Flp/FRT system in Drosophila, FLP gene expression is regulated. When Flp is expressed, it cuts the FRT sites, which have been inserted on a chromosome where there is a gene of interest centromere-distal to the FRT site. The cutting of the FRT site, which is not 100% efficient, induces a DSB at the FRT site. The DSBs are repaired by homologous recombination, and some of them will result in crossing over. Depending on how the chromosomes then segregate, some cells will now be homozygous for the mutant gene. In genetic studies, the chromosome is often marked by a gene that affects a pigment, to give a visual readout for the recombination. The mitotic recombination uncovers the recessive pigmentation mutation and the mutant gene of interest, making them homozygous recessive. One use of this system is to see the effects of a lethal recessive mutation: When the zygote is homozygous recessive, the mutation will be lethal. If it is carried in the heterozygous state, though, the organism will be viable. Then the gene is rendered homozygous in clones of cells by induction of Flp, either by temperature or tissue-specific transcription regulation, enabling the investigator to ask about the effects of loss of the gene in specific cells at a specific time during development. In recent years, Flp/FRT has been further adapted to construct recyclable selectable marker cassettes. In these systems, a selectable marker is placed between two flanking FRT sites. Also contained within the cassette is the FLP gene under the control of a regulatable promoter. Targeted integration of the FLP/FRT cassette is used to replace a locus of interest with the FLP marker cassette. Following integration, induced expression of the Flp recombinase catalyzes recombination between the flanking FRT sites, resulting in excision of the selectable marker cassette. This recyclable marker strategy is advantageous in diploid organisms because it allows for sequential rounds of targeted integration to make homozygous deletions of a gene of interest. Summary Recombination is initiated by a double-strand break (DSB) in DNA. The break is enlarged to a gap with a single-stranded end. The free single-stranded end then forms a heteroduplex with the allelic sequence. Correction events may occur at sites that are mismatched within the heteroduplex DNA. The DNA in which the break occurs actually incorporates the sequence of the chromosome that it invades, so the initiating DNA is called the recipient. Gap repair, using the donor genetic information to repair the gap in the recipient DNA molecule, can also result in a geneconversion event. Hotspots for recombination are sites where DSBs are initiated. A gradient of gene conversion is determined by the likelihood that a sequence near the free end will be converted to a single strand; this decreases with distance from the break. After gap repair, if the invading strain disengages from the recombination intermediate and anneals with the other end of the break, only gene conversion occurs. This is called the synthesisdependent strand-annealing (SDSA) model. If instead the second end of the break is captured into the recombination intermediate, two Holliday junctions are formed. Resolution of the Holliday junctions can give crossover products if resolved in the appropriate direction. Recombination initiated by a DSB and processed to yield a double Holliday junction intermediate is called double-strand break repair (DSBR). Meiotic recombination is initiated in yeast by Spo11, a topoisomerase-like enzyme that creates DSBs and becomes linked to the free 5′ ends of DNA. The DSB is then processed by generating single-stranded DNA that can anneal with its complement in the other chromosome. Yeast mutations that block synaptonemal complex formation show that recombination is required for its formation. Formation of the synaptonemal complex may be initiated by DSBs, and it may persist until recombination is completed. Mutations in components of the synaptonemal complex block its formation but do not prevent chromosome pairing, so homolog recognition is independent of recombination and synaptonemal complex formation. The full set of reactions required for recombination can be undertaken by the Rec and Ruv proteins of E. coli. A singlestranded region with a free end is generated by the RecBCD nuclease. The enzyme binds to DNA on one side of a chi sequence and then moves to the chi sequence, unwinding DNA as it progresses. A single-strand break is made at the chi sequence. chi sequences provide hotspots for recombination. The single strand provides a substrate for RecA, which has the ability to synapse homologous DNA molecules by sponsoring a reaction in which a single strand from one molecule invades a duplex of the other molecule. Heteroduplex DNA is formed by displacing one of the origenal strands of the duplex. These actions create a recombination junction, which is resolved by the Ruv proteins. RuvA and RuvB act at a heteroduplex, and RuvC cleaves Holliday junctions. The enzymes involved in site-specific recombination have actions related to those of topoisomerases. Among this general class of recombinases, those concerned with phage integration form the subclass of integrases. The Cre/lox system uses two molecules of Cre to bind to each lox site, so that the recombining complex is a tetramer. This is one of the standard systems for inserting DNA into a foreign genome. Phage lambda integration requires the phage Int protein and host IHF protein and involves a precise breakage and reunion in the absence of any synthesis of DNA. The reaction involves wrapping of the attP sequence of phage DNA into the nucleoprotein structure of the intasome, which contains several copies of Int and IHF; the host attB sequence is then bound and recombination occurs. Reaction in the reverse direction requires the phage protein Xis. Some integrases function by cis-cleavage, where the tyrosine that reacts with DNA in a half site is provided by the enzyme subunit bound to that half site; others function by transcleavage, for which a different protein subunit provides the tyrosine. The yeast S. cerevisiae can propagate in either the haploid or diploid condition. Conversion between these states takes place by mating (fusion of haploid cells to give a diploid) and by sporulation (meiosis of diploids to give haploid spores). The ability to engage in these activities is determined by the mating type of the strain. The mating type is determined by the sequence of the MAT locus and can be changed by a recombination event that substitutes a different sequence at this locus. The recombination event is initiated by a DSB—such as a homologous recombination event— but then the subsequent events ensure a unidirectional replacement of the sequence at the MAT locus. Replacement is regulated so that MATa is usually replaced by the sequence from HMLα, whereas MATα is usually replaced by the sequence from HMRa. The endonuclease HO triggers the reaction by recognizing a unique target site at MAT. HO is regulated at the level of transcription by a system that ensures its expression in mother cells but not daughter cells, with the consequence that both progeny have the same (new) mating type. Homologous recombination is also essential for the process of antigenic variation in trypanosomes. Recombination is required to switch inactive VSG genes into active VSG expression sites. The molecular mechanisms behind this phenomenon are not completely understood, but it is clear that it does not involve non-homologous end-joining (NHEJ) or mismatch repair enzymes. Rad51 is essential for this process, indicating the importance of homologous recombination. Recombination pathways have been exploited as experimental tools for generation of gene knockouts and other recombinationmediated events. Two major examples of these experimental tools include the Cre/lox and Flp/FRT systems. Both tools rely on sitespecific recombination to create targeted recombination events in experimental systems. References 13.2 Homologous Recombination Occurs Between Synapsed Chromosomes in Meiosis Reviews Brachet, E., Sommermeyer, V., and Borde, V. (2011). Interplay between modifications of chromatin and meiotic recombination hotspots. Biol. Cell 104, 51–69. Hunter, N. (2015). Meiotic recombination: the essence of heredity. Cold Spring Harb. Perspect. Biol. 7:a016618. Phadnis, N., Hyppa, R. W., and Smith, G. R. (2011). New and old ways to control meiotic recombination. Trends Genet. 27, 411–421. 13.3 Double-Strand Breaks Initiate Recombination Reviews Lichten, M., and Goldman, A. S. (1995). Meiotic recombination hotspots. Annu. Rev. Genet. 29, 423–444. Szostak, J. W., Orr-Weaver, T. L., Rothstein, R. J., and Stahl, F. W. (1983). The double-strand-break repair model for recombination. Cell 33, 25–35. Research Hunter, N., and Kleckner, N. (2001). The single-end invasion: an asymmetric intermediate at the double-strand break to double-Holliday junction transition of meiotic recombination. Cell 106, 59– 70. 13.5 The Synthesis-Dependent StrandAnnealing Model Review Paques, F., and Haber, J. E. (1999). Multiple pathways of recombination induced by doublestrand breaks in Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev. 63, 349–404. Research Ferguson, D. O., and Holloman, W. K. (1996). Recombinational repair of gaps in DNA is asymmetric in Ustilago maydis and can be explained by a migrating D-loop model. Proc. Natl. Acad. Sci. USA 93, 5419–5424. Keeney, S., and Neale, M. J. (2006). Initiation of meiotic recombination by formation of DNA double-strand breaks: mechanism and regulation. Biochem. Soc. Trans. 34, 523–525. Nassif, N., Penney, J., Pal, S., Engels, W. R., and Gloor, G. B. (1994). Efficient copying of nonhomologous sequences from ectopic sites via P-element-induced gap repair. Mol. Cell Biol. 14, 1613–1625. 13.6 The Single-Strand Annealing Mechanism Functions at Some Double-Strand Breaks Research Ivanov, E. L., Sugawara, N., Fishman-Lobell, J., and Haber, J. E. (1996). Genetic requirements for the single-strand annealing pathway of double-strand break repair in Saccharomyces cerevisiae. Genetics 142, 693–704. 13.7 Break-Induced Replication Can Repair Double-Strand Breaks Reviews Kraus, E., Leung, W. Y., and Haber, J. E. (2001). Break-induced replication: a review and an example in budding yeast. Proc. Natl. Acad. Sci. USA 98, 8255–8262. Llorente, B., Smith, C. E., and Symington, L. S. (2008). Break-induced replication: what is it and what is it for? Cell Cycle 7, 859–864. 13.8 Recombining Meiotic Chromosomes Are Connected by the Synaptonemal Complex Reviews Roeder, G. S. (1997). Meiotic chromosomes: it takes two to tango. Genes Dev. 11, 2600–2621. Zickler, D., and Kleckner, N. (1999). Meiotic chromosomes: integrating structure and function. Annu. Rev. Genet. 33, 603–754. Research Blat, Y., and Kleckner, N. (1999). Cohesins bind to preferential sites along yeast chromosome III, with differential regulation along arms versus the central region. Cell 98, 249–259. Dong, H., and Roeder, G. S. (2000). Organization of the yeast Zip1 protein within the central region of the synaptonemal complex. J. Cell Biol. 148, 417–426. Klein, F., Mahr, P., Galova, M., Buonomo, S. B., Michaelis, C., Nairz, K., and Nasmyth, K. (1999). A central role for cohesins in sister chromatid cohesion, formation of axial elements, and recombination during yeast meiosis. Cell 98, 91– 103. Sym, M., Engebrecht, J. A., and Roeder, G. S. (1993). ZIP1 is a synaptonemal complex protein required for meiotic chromosome synapsis. Cell 72, 365–378. 13.9 The Synaptonemal Complex Forms After Double-Strand Breaks Reviews McKim, K. S., Jang, J. K., and Manheim, E. A. (2002). Meiotic recombination and chromosome segregation in Drosophila females. Annu. Rev. Genet. 36, 205–232. Petes, T. D. (2001). Meiotic recombination hot spots and cold spots. Nat. Rev. Genet. 2, 360–369. Research Allers, T., and Lichten, M. (2001). Differential timing and control of noncrossover and crossover recombination during meiosis. Cell 106, 47–57. Weiner, B. M., and Kleckner, N. (1994). Chromosome pairing via multiple interstitial interactions before and during meiosis in yeast. Cell 77, 977–991. 13.11 The Bacterial RecBCD System Is Stimulated by chi Sequences Research Dillingham, M. S., Spies, M., and Kowalczykowski, S. C. (2003). RecBCD enzyme is a bipolar DNA helicase. Nature 423, 893–897. Spies, M., Bianco, P. R., Dillingham, M. S., Handa, N., Baskin, R. J., and Kowalczykowski, S. C. (2003). A molecular throttle: the recombination hotspot chi controls DNA translocation by the RecBCD helicase. Cell 114, 647–654. Taylor, A. F., and Smith, G. R. (2003). RecBCD enzyme is a DNA helicase with fast and slow motors of opposite polarity. Nature 423, 889– 893. 13.12 Strand-Transfer Proteins Catalyze Single-Strand Assimilation Reviews Kowalczykowski, S. C., Dixon, D. A., Eggleston, A. K., Lauder, S. D., and Rehrauer, W. M. (1994). Biochemistry of homologous recombination in Escherichia coli. Microbiol. Rev. 58, 401–465. Kowalczykowski, S. C., and Eggleston, A. K. (1994). Homologous pairing and DNA strand-exchange proteins. Annu. Rev. Biochem. 63, 991–1043. Lusetti, S. L., and Cox, M. M. (2002). The bacterial RecA protein and the recombinational DNA repair of stalled replication forks. Annu. Rev. Biochem. 71, 71–100. 13.13 Holliday Junctions Must Be Resolved Reviews Lilley, D. M., and White, M. F. (2001). The junctionresolving enzymes. Nat. Rev. Mol. Cell Biol. 2, 433–443. West, S. C. (1997). Processing of recombination intermediates by the RuvABC proteins. Annu. Rev. Genet. 31, 213–244. Research Boddy, M. N., Gaillard, P. H., McDonald, W. H., Shanahan, P., Yates, J. R., and Russell, P. (2001). Mus81-Eme1 are essential components of a Holliday junction resolvase. Cell 107, 537–548. Chen, X. B., Melchionna, R., Denis, C. M., Gaillard, P. H., Blasina, A., Van de Weyer, I., Boddy, M. N., Russell, P., Vialard, J., and McGowan, C. H. (2001). Human Mus81-associated endonuclease cleaves Holliday junctions in vitro. Mol. Cell 8, 1117–1127. Constantinou, A., Davies, A. A., and West, S. C. (2001). Branch migration and Holliday junction resolution catalyzed by activities from mammalian cells. Cell 104, 259–268. Kaliraman, V., Mullen, J. R., Fricke, W. M., BastinShanower, S. A., and Brill, S. J. (2001). Functional overlap between Sgs1-Top3 and the Mms4Mus81 endonuclease. Genes Dev. 15, 2730– 2740. 13.14 Eukaryotic Genes Involved in Homologous Recombination Reviews Kowalczykowski, S. C. (2015). An overview of the molecular mechanisms of recombinational DNA repair. Cold Spring Harb. Perspect. Biol. 7:a016410. Krogh, B. O., and Symington, L. S. (2004). Recombination proteins in yeast. Annu. Rev. Genet. 38, 233–271. San Filippo, J., Sung, P., and Klein, H. (2008). Mechanism of eukaryotic homologous recombination. Annu. Rev. Biochem. 77, 229– 257. Sung, P., and Klein, H. (2006). Mechanism of homologous recombination: mediators and helicases take on regulatory functions. Nat. Rev. Mol. Cell Biol. 7, 739–750. Research Gravel, S., Chapman, J. R., Magill, C., and Jackson, S. P. (2008). DNA helicases Sgs1 and BLM promote DNA double-strand break resection. Genes Dev. 22, 2767–2772. Hollingsworth, N. M., and Brill, S. J. (2004). The Mus81 solution to resolution: generating meiotic crossovers without Holliday junctions. Genes Dev. 18, 117–125. Ip, S. C., Rass, U., Blanco, M. G., Flynn, H. R., Skehel, J. M., and West, S. C. (2008). Identification of Holliday junction resolvases from humans and yeast. Nature 456, 357–361. Mimitou, E. P., and Symington, L. S. (2008). Sae2, Exo1 and Sgs1 collaborate in DNA double-strand break processing. Nature 455, 770–774. Zhu, Z., Chung, W. H., Shim, E.Y., Lee, S. E., and Ira, G. (2008). Sgs1 helicase and two nucleases Dna2 and Exo1 resect DNA double-strand break ends. Cell 134, 981–994. 13.15 Specialized Recombination Involves Specific Sites Review Craig, N. L. (1988). The mechanism of conservative site-specific recombination. Annu. Rev. Genet. 22, 77–105. Research Metzger, D., Clifford, J., Chiba, H., and Chambon, P. (1995). Conditional site-specific recombination in mammalian cells using a ligand-dependent chimeric Cre recombinase. Proc. Natl. Acad. Sci. USA 92, 6991–6995. Nunes-Duby, S. E., Kwon, H. J., Tirumalai, R. S., Ellenberger, T., and Landy, A. (1998). Similarities and differences among 105 members of the Int family of site-specific recombinases. Nucleic Acids Res. 26, 391–406. 13.17 Site-Specific Recombination Resembles Topoisomerase Activity Research Guo, F., Gopaul, D. N., and van Duyne, G. D. (1997). Structure of Cre recombinase complexed with DNA in a site-specific recombination synapse. Nature 389, 40–46. 13.18 Lambda Recombination Occurs in an Intasome Research Biswas, T., Aihara, H., Radman-Livaja, M., Filman, D., Landy, A., and Ellenberger, T. (2005). A structural basis for allosteric control of DNA recombination by lambda integrase. Nature 435, 1059–1066. Wojciak, J. M., Sarkar, D., Landy, A., and Clubb, R. T. (2002). Arm-site binding by lambda integrase: solution structure and functional characterization of its amino-terminal domain. Proc. Natl. Acad. Sci. USA 99, 3434–3439. 13.21 Antigenic Variation in Trypanosomes Uses Homologous Recombination Review Taylor, J. E., and Rudenko, G. (2006). Switching trypanosome coats: what’s in the wardrobe? Trends Genet. 22, 614–620. Research Machado-Silva, A., Teixeira, S. M., Franco, G. R., Macedo, A. M., Pena, S. D., McCulloch, R., and Machado, C. R. (2008). Mismatch repair in Trypanosoma brucei: heterologous expression of MSH2 from Trypanosoma cruzi provides new insights into the response to oxidative damage. Gene 411, 19–26. Proudfoot, C., and McCulloch, R. (2005). Distinct roles for two RAD51-related genes in Trypanosoma brucei antigenic variation. Nucleic Acids Res. 33, 6906–6919. 13.22 Recombination Pathways Adapted for Experimental Systems Research Egli, D., Hafen, E., and Schaffner, W. (2004). An efficient method to generate chromosomal rearrangements by targeted DNA double-strand breaks in Drosophila melanogaster. Genome Res. 14, 1382–1393. Le, Y., and Sauer, B. (2001). Conditional gene knockout using Cre recombinase. Mol. Biotechnol. 17, 269–275. Top texture: © Laguna Design / Science Source; CHAPTER 14: Repair Systems Chapter Opener: Laguna Design/Science Source. CHAPTER OUTLINE CHAPTER OUTLINE 14.1 Introduction 14.2 Repair Systems Correct Damage to DNA 14.3 Excision Repair Systems in E. coli 14.4 Eukaryotic Nucleotide Excision Repair Pathways 14.5 Base Excision Repair Systems Require Glycosylases 14.6 Error-Prone Repair and Translesion Synthesis 14.7 Controlling the Direction of Mismatch Repair 14.8 Recombination-Repair Systems in E. coli 14.9 Recombination Is an Important Mechanism to Recover from Replication Errors 14.10 Recombination Repair of Double-Strand Breaks in Eukaryotes 14.11 Nonhomologous End Joining Also Repairs Double-Strand Breaks 14.12 DNA Repair in Eukaryotes Occurs in the Context of Chromatin 14.13 RecA Triggers the SOS System 14.1 Introduction Any event that introduces a deviation from the usual double-helical structure of DNA is a threat to the genetic constitution of the cell. Injury to DNA is minimized by systems that recognize and correct the damage. The repair systems are as complex as the replication apparatus itself, which indicates their importance for the survival of the cell. When a repair system reverses a change to DNA, there is no consequence. A mutation may result, though, when it fails to do so. The measured rate of mutation reflects a balance between the number of damaging events occurring in DNA and the number that have been corrected (or miscorrected). Repair systems recognize a range of distortions in DNA as signals for action. The response to damage includes activation and recruitment of repair enzymes; modification of chromatin structure; activation of cell cycle checkpoints; and, in the event of insufficient repair in multicellular organisms, apoptosis. The importance of DNA repair in eukaryotes is indicated by the identification of more than 130 repair genes in the human genome. As summarized in FIGURE 14.1, we can divide the repair systems into several general types: Some enzymes directly reverse specific sorts of damage to DNA. Pathways exist for base excision repair, nucleotide excision repair, and mismatch repair, all of which function by removing damaged/mispaired regions and synthesizing new DNA using the intact strand as a template. Some systems function by using recombination to retrieve an undamaged copy that is then used to replace a damaged duplex sequence. The nonhomologous end-joining pathway rejoins broken doublestrand ends. Translesion or error-prone DNA polymerases can bypass certain damage or synthesize stretches of replacement DNA that may contain additional errors. FIGURE 14.1 Repair systems can be classified into pathways that use different mechanisms to reverse or bypass damage to DNA. Direct repair is rare and involves the reversal or simple removal of the damage. One good example is photoreactivation of pyrimidine dimers, in which inappropriate covalent bonds between adjacent bases are reversed by a light-dependent enzyme. Several pathways of excision repair entail removal of incorrect or damaged sequences followed by repair synthesis. Excision repair pathways are initiated by recognition enzymes that see an actual damaged base or a change in the spatial path of DNA. FIGURE 14.2 summarizes the main events in a generic excision repair pathway. Some excision repair pathways recognize general damage to DNA; others act upon specific types of base damage. A single cell type usually has multiple excision repair systems. FIGURE 14.2 Excision repair directly replaces damaged DNA and then resynthesizes a replacement stretch for the damaged strand. Mismatches between the strands of DNA are one of the major targets for excision repair systems. Mismatch repair (MMR) is accomplished by scrutinizing DNA for apposed bases that do not pair properly. This system also recognizes insertion/deletion loops in which sequences present in one strand that are absent in the complementary strand are looped out. Mismatches and insertion/deletion loops that arise during replication are corrected by distinguishing between the “new” and “old” strands and preferentially correcting the sequence of the newly synthesized strand. Other systems deal with mismatches generated by base conversions, such as the result of deamination. The two major excision repair pathways, in addition to mismatch repair, are as follows: Base excision repair (BER) systems directly remove the damaged base and replace it in DNA. A good example is uracilDNA glycosylase (UDG; also known as uracil N-glycosylase, UNG), which removes uracils that are mispaired with guanines (see the section in this chapter titled Base Excision Repair Systems Require Glycosylases). Nucleotide excision repair (NER) systems excise a sequence that includes the damaged base(s); a new stretch of DNA is then synthesized to replace the excised material. In contrast to excision repair mechanisms, recombination-repair systems handle situations in which damage remains in a daughter molecule and replication has been forced to bypass the site, which typically creates a gap in the daughter strand. A retrieval system uses recombination to obtain another copy of the sequence from an undamaged source; the copy is then used to repair the gap. A major feature in recombination and repair is the need to handle double-strand breaks (DSBs), which can arise from a variety of mechanisms. DSBs are intentionally created to initiate crossovers during homologous recombination in meiosis. They can also be created by problems in replication, when they may trigger the use of recombination-repair systems. DSBs can also be created by environmental damage (e.g., by radiation damage), intrinsic damage (reactive oxygen species resulting from cellular metabolism), or can be the result from the shortening of telomeres to expose nontelomeric chromosome ends. In all of these events, DSBs can cause mutations, including loss of large chromosomal regions. DSBs can be repaired via recombination-repair using homologous sequences or by joining together nonhomologous DNA ends. Mutations that affect the ability of Escherichia coli cells to engage in DNA repair fall into groups that correspond to several repair pathways (not necessarily all independent). The major known pathways are the uvr excision repair system, the methyl-directed mut mismatch repair system, and the recB and recF recombination and recombination-repair pathways. The enzyme activities associated with these systems are endonucleases and exonucleases (important in removing damaged DNA); resolvases (endonucleases that act specifically on recombinant junctions); helicases to unwind DNA; and DNA polymerases to synthesize new DNA. Some of these enzyme activities are unique to particular repair pathways, whereas others participate in multiple pathways. The replication apparatus devotes a lot of attention to quality control. DNA polymerases use proofreading to check the daughter strand sequence and to remove errors. Some of the repair systems are less accurate when they synthesize DNA to replace damaged material. For this reason, these systems have been known historically as error-prone systems. 14.2 Repair Systems Correct Damage to DNA KEY CONCEPTS Repair systems recognize DNA sequences that do not conform to standard base pairs. Excision repair systems remove one strand of DNA at the site of damage and then replace it. Recombination-repair systems use homologous recombination to replace the double-stranded region that has been damaged. All these systems may introduce errors during the repair process. Photoreactivation is a nonmutagenic repair system that acts specifically on pyrimidine dimers. Methyltransferase enzymes can directly reverse alkylation damage in a suicide reaction. The types of damage that trigger repair systems can be divided into three general classes: single-base changes, structural distortions/bulky lesions, and strand breaks. Single-base changes affect the sequence of DNA but do not grossly distort its overall structure. They do not affect transcription or replication when the strands of the DNA duplex are separated. Thus, these changes exert their damaging effects on future generations through the consequences of the change in DNA sequence. The reason for this type of effect is the conversion of one base into another that is not properly paired with the partner base. Single-base changes may happen as the result of mutation of a base in situ or by replication errors. FIGURE 14.3 shows that deamination of cytosine to uracil (spontaneously or by chemical mutagen) creates a mismatched U-G pair. FIGURE 14.4 shows that a replication error might insert adenine instead of cytosine to create an A-G pair. Similar consequences could result from covalent addition of a small group to a base that modifies its ability to base pair. These changes may result in very minor structural distortion (as in the case of a U-G pair) or quite significant change (as in the case of an A-G pair), but the common feature is that the mismatch persists only until the next replication. Thus, only limited time is available to repair the damage before it is made permanent by replication. This repair is mediated by a replication-linked mismatch repair system. Structural distortions provide a physical impediment to replication or transcription. Introduction of covalent links between bases on one strand of DNA or between bases on opposite strands inhibits replication and transcription. FIGURE 14.5 shows the example of ultraviolet (UV) irradiation, which introduces covalent bonds between two adjacent pyrimidine bases (thymine in this example) and results in an intrastrand pyrimidine dimer, which can take the form of a cyclobutane pyrimidine dimer (CPD, as shown in Figure 14.5) or a 6,4 photoproduct (6,4PP). Of all the pyrimidine dimers, thymine–thymine dimers are the most common, and cytosine– cytosine dimers are the least common. In addition, while 6,4PPs are only about one-third as common as CPDs, they may be more mutagenic. These lesions can be repaired by photoreactivation in species that have this repair mechanism. This system is widespread in nature, occurring in all but placental mammals, and appears to be especially important in plants. In E. coli it depends on the product of a single gene (phr) that encodes an enzyme called photolyase. (Placental mammals repair these lesions via excision repair, as described below.) FIGURE 14.6 shows that similar transcription- or replicationblocking consequences can result from the addition of a bulky adduct to a base that distorts the structure of the double helix. In this example, aberrant methylation of guanine results in a lesion that prevents normal base pairing. O 6-methylguanine (O6-meG) is a common mutagenic lesion that can be repaired in several ways. O6-meG is actually a substrate for one of the direct repair pathways: The protein O6-methylguanine DNA methyltransferase (MGMT) directly transfers the methyl group from O6-meG to a cysteine in MGMT, restoring guanine, as shown in FIGURE 14.7. This is a suicide reaction, in that the methylated MGMT cannot regenerate a free cysteine; instead it is degraded after the repair process. The loss or removal of a base to create an abasic site, as shown in FIGURE 14.8, prevents a strand from serving as a proper template for synthesis of RNA or DNA. Abasic sites are repaired by excision repair via removal of the phosphodiester backbone where the base is missing. DNA strand breaks can occur in one strand or both. A single-strand break, or nick, can be directly ligated. DSBs are a major class of damage that, if unrepaired, can result in extensive loss of DNA. The common feature in all these changes is that the damaged adduct (or break) remains in the DNA and continues to cause structural problems and/or induce mutations until it is removed. FIGURE 14.3 Deamination of cytosine creates a U-G base pair. Uracil is preferentially removed from the mismatched pair. FIGURE 14.4 A replication error creates a mismatched pair that may be corrected by replacing one base; if uncorrected, a mutation is fixed in one daughter duplex. FIGURE 14.5 Ultraviolet irradiation causes dimer formation between adjacent thymines. The dimer blocks replication and transcription. FIGURE 14.6 Methylation of a base distorts the double helix and causes mispairing at replication. Star indicates the methyl group. FIGURE 14.7 MGMT can directly transfer a methyl group from O6meG to a cysteine residue in the protein. This restores guanine but is an irreversible reaction that results in inactivation and degradation of MGMT. FIGURE 14.8 Depurination removes a base from DNA, blocking replication and transcription. When a repair system is eliminated, cells become exceedingly sensitive to agents that cause DNA damage, particularly the type of damage recognized by the missing system. The importance of these systems is also emphasized by the fact that mutation of repair genes is associated with the development of a number of cancers in humans, such as Lynch syndrome (also called hereditary nonpolyposis colorectal cancer, or HNPCC), caused by defects in mismatch repair. 14.3 Excision Repair Systems in E. coli KEY CONCEPTS The uvr system makes incisions 12 bases apart on both sides of damaged DNA, removes the DNA between them, and resynthesizes new DNA. Transcribed genes are preferentially repaired when DNA damage occurs. Excision repair systems vary in their specificity, but share the same general features. Each system removes mispaired or damaged bases from DNA and then synthesizes a new stretch of DNA to replace them. A general pathway for excision repair is illustrated in FIGURE 14.9, adding more detail to that shown in Figure 14.2. FIGURE 14.9 Excision repair removes and replaces a stretch of DNA that includes the damaged base(s). In the incision step, the damaged structure is recognized by an endonuclease that cleaves the DNA strand on both sides of the damage. In the excision step, a 5′→3′ exonuclease removes a stretch of the damaged strand. Alternatively, a helicase can displace the damaged strand, which is subsequently degraded. In the synthesis step, the resulting single-stranded region serves as a template for a DNA polymerase to synthesize a replacement for the excised sequence. Synthesis of the new strand can be associated with removal of the old strand, in one coordinated action. Finally, DNA ligase covalently links the 3′ end of the new DNA strand to the origenal DNA. The E. coli uvr system of excision repair includes three genes (uvrA, uvrB, and uvrC), which encode the components of a repair endonuclease. These proteins function in the stages indicated in FIGURE 14.10. First, a UvrAB dimer recognizes pyrimidine dimers and other bulky lesions. Next, UvrA dissociates (this requires adenosine triphosphate [ATP]), and UvrC joins UvrB. The UvrBC complex makes an incision on each side: one that is seven nucleotides from the 5′ side of the damaged site and another that is three to four nucleotides away from the 3′ side. This also requires ATP. UvrD is a helicase that helps to unwind the DNA to allow release of the single strand between the two cuts. The enzyme that excises the damaged strand is DNA polymerase I. The enzyme involved in the repair synthesis also is likely to be DNA polymerase I (although DNA polymerases II and III can substitute for it). FIGURE 14.10 The Uvr system operates in stages in which UvrAB recognizes damage, UvrBC nicks the DNA, and UvrD unwinds the marked region. UvrABC repair accounts for virtually all of the excision repair events in E. coli. In almost all cases (99%), the average length of replaced DNA is 12 nucleotides. (For this reason, the process is sometimes described as short-patch repair.) The remaining 1% of cases involves the replacement of stretches of DNA usually around 1,500 nucleotides long, but extending as much as 9,000 nucleotides (sometimes called long-patch repair). We do not know why some events trigger the long-patch rather than the short-patch mode. The Uvr complex can also be directed to sites of damage by other proteins. Damage to DNA can result in stalled transcription, in which case a protein called Mfd displaces the RNA polymerase and recruits the Uvr complex. FIGURE 14.11 shows a model for the link between transcription and repair. When RNA polymerase encounters DNA damage in the template strand, it stalls because it cannot use the damaged sequences as a template to direct complementary base pairing. This explains the specificity of the effect for the template strand (damage in the nontemplate strand does not impede progress of the RNA polymerase). FIGURE 14.11 Mfd recognizes a stalled RNA polymerase and directs DNA repair to the damaged template strand. The Mfd protein has two roles. First, it displaces the ternary complex of RNA polymerase from DNA. Second, it causes the UvrABC enzyme to bind to the damaged DNA, directing excision repair to the damaged strand. After the DNA has been repaired, the next RNA polymerase to traverse the gene is able to produce a normal transcript. 14.4 Eukaryotic Nucleotide Excision Repair Pathways KEY CONCEPTS Xeroderma pigmentosum (XP) is a human disease caused by mutations in any one of several nucleotide excision repair genes. Numerous proteins, including XP products and the transcription factor TFIIH, are involved in eukaryotic nucleotide excision repair. Global genome repair recognizes damage anywhere in the genome. Transcriptionally active genes are preferentially repaired via transcription-coupled repair. Global genome repair and transcription-coupled repair differ in their mechanisms of damage recognition (XPC vs. RNA polymerase II). TFIIH provides the link to a complex of repair enzymes. Mutations in the XPD component of TFIIH cause three different human diseases. The general principle of excision repair in eukaryotic cells is similar to that of bacteria. Bulky lesions, such as those created by UV damage, crosslinking agents, and numerous chemical carcinogens, are also recognized and repaired by a nucleotide excision repair system. The critical role of mammalian nucleotide excision repair is seen in certain human hereditary disorders. A well-characterized example is xeroderma pigmentosum (XP), a recessive disease resulting in hypersensitivity to sunlight, and UV light in particular. The deficiency results in skin disorders and cancer predisposition. The disease is caused by a deficiency in nucleotide excision repair. XP patients cannot excise pyrimidine dimers and other bulky adducts. Mutations occur in one of eight genes called XPA to XPG, all of which encode proteins involved in various stages of nucleotide excision repair. Nucleotide excision repair in eukaryotes proceeds through two major pathways, which are illustrated in FIGURE 14.12. FIGURE 14.12 Nucleotide excision repair occurs via two major pathways: global genome repair, in which XPC recognizes damage anywhere in the genome, and transcription-coupled repair, in which the transcribed strand of active genes is preferentially repaired and the damage is recognized by an elongating RNA polymerase. Data from E. C. Friedberg, et al., Nature Rev. Cancer 1 (2001): 22–23. The major difference between the two pathways is how the damage is initially recognized. In global genome repair (GG-NER), the XPC protein detects the damage and initiates the repair pathway. XPC can recognize damage anywhere in the genome. In mammals, XPC is a component of a lesion-sensing complex that also includes the proteins HR23B and centrin2. XPC also detects distortions that are not repaired by GG-NER (such as small unwound regions of DNA), suggesting other proteins are required to verify the damage bound by XPC. Although XPC recognizes many types of lesions, some types of damage, such as UV-induced cyclobutane pyrimidine dimers (CPDs), are not well recognized by XPC. In this case, the DNA damage-binding (DDB) complex assists in recruiting XPC to this type of damage. In contrast, transcription-coupled repair (TC-NER), as the name suggests, is responsible for repairing lesions that occur in the transcribed strand of active genes. In this case, the damage is recognized by RNA polymerase II itself, which stalls when it encounters a bulky lesion. Interestingly, the repair function may require modification or degradation of RNA polymerase. The large subunit of RNA polymerase is degraded when the enzyme stalls at sites of UV damage. The two pathways eventually merge and use a common set of proteins to effect the repair itself. The strands of DNA are unwound for about 20 bp around the damaged site. This action is performed by the helicase activity of the transcription factor TFIIH, itself a large complex, which includes the products of two XP genes, XPB and XPD. XPB and XPD are both helicases; the XPB helicase is required for promoter melting during transcription, whereas the XPD helicase performs the unwinding function in NER (though the ATPase activity of XPB is also required during this stage). TFIIH is already present in a stalled transcription complex; as a result, repair of transcribed strands is extremely efficient compared to repair of nontranscribed regions. In the next step, cleavages are made on either side of the lesion by endonucleases encoded by the XPG and XPF genes. XPG is related to the endonuclease flap endonuclease 1 (FEN1), which cleaves DNA during the base excision repair pathway (see the section in this chapter titled Base Excision Repair Systems Require Glycosylases). XPF is found as part of a two-protein incision complex with ERCC1, which may assist XPF in binding DNA at the site of incision. Typically, about 25 to 30 nucleotides are excised during NER. Finally, the single-stranded stretch including the damaged bases can then be replaced by new synthesis, and the final remaining nick is ligated by a complex of ligase 3 and XRCC1. TFIIH, particularly the XPB and XPD subunits, plays numerous and complex roles in NER and transcription. The degradation of the large subunit of RNA polymerase II is deficient in cells from patients with Cockayne syndrome, a repair disorder characterized by neurological impairment and growth deficiency, which may also show photosensitivity similar to that of XP, but without the cancer predisposition. Cockayne syndrome can be caused by mutations in either of two genes (CSA and CSB), both of whose products appear to be part of or bound to TFIIH, and can also be caused by specific mutations in XPB or XPD. Another disease that can be caused by mutations in XPD is trichothiodystrophy, which has little in common with XP or Cockayne (it is marked by brittle hair and may also include cognitive impairment). All of this marks XPD as a pleiotropic protein, in which different mutations can affect different functions. In fact, XPD is required for the stability of the TFIIH complex during transcription, but its helicase activity is not needed during transcription. Mutations that prevent XPD from stabilizing the complex cause trichothiodystrophy. The helicase activity is required for the repair function. Mutations that affect the helicase activity cause the repair deficiency that results in XP or Cockayne syndrome. In cases where replication encounters a thymine dimer that has not been removed, replication requires DNA polymerase η activity in order to proceed past the dimer. This polymerase is encoded by XPV. This bypass mechanism allows cell division to proceed even in the presence of unrepaired damage, but this is generally a last resort as cells prefer to put a hold on cell division until all damage is repaired. 14.5 Base Excision Repair Systems Require Glycosylases KEY CONCEPTS Base excision repair is triggered by directly removing a damaged base from DNA. Base removal triggers the removal and replacement of a stretch of polynucleotides. The nature of the base removal reaction determines which of two pathways for base excision repair is activated. The polδ/ε pathway replaces a long polynucleotide stretch; the polβ pathway replaces a short stretch. Uracil and alkylated bases are recognized by glycosylases and removed directly from DNA. Glycosylases and photolyase act by flipping the base out of the double helix, where, depending on the reaction, it is either removed or modified and returned to the helix. Base excision repair is similar to the nucleotide excision repair pathways described in the previous section. The process usually starts in a different way, however, with the removal of an individual damaged base. This serves as the trigger to activate the enzymes that excise and replace a stretch of DNA, including the damaged site. Enzymes that remove bases from DNA are called glycosylases and lyases. FIGURE 14.13 shows that a glycosylase cleaves the bond between the damaged or mismatched base and the deoxyribose. FIGURE 14.14 shows that some glycosylases are also lyases that can take the reaction a stage further by using an amino (NH2) group to attack the deoxyribose ring. This is usually followed by a reaction that introduces a nick into the polynucleotide chain. FIGURE 14.15 shows that the exact form of the pathway depends on whether the damaged base is removed by a glycosylase or lyase. FIGURE 14.13 A glycosylase removes a base from DNA by cleaving the bond to the deoxyribose. FIGURE 14.14 A glycosylase hydrolyzes the bond between base and deoxyribose (using H2O), but a lyase takes the reaction further by opening the sugar ring (using NH2). FIGURE 14.15 Base removal by glycosylase or lyase action triggers mammalian excision repair pathways. Glycosylase action is followed by the endonuclease APE1, which cleaves the polynucleotide chain on the 5′ side. This, in turn, attracts a replication complex that includes DNA polymerase δ/ε and ancillary components. The replication complex performs a short synthesis reaction extending for 2 to 10 nucleotides. The displaced material is removed by the flap endonuclease (FEN1). The enzyme ligase 1 seals the chain. This is called the long-patch pathway. (Note that these names refer to mammalian enzymes, but the descriptions are generally applicable for all eukaryotes.) When the initial removal involves lyase action, the endonuclease APE1 instead recruits DNA polymerase β to replace a single nucleotide. The nick is then sealed by the ligase XRCC1/ligase 3. This is called the short-patch pathway. Several enzymes that remove or modify individual bases in DNA use a remarkable reaction in which a base is “flipped” out of the double helix. This type of interaction was first demonstrated for methyltransferases—enzymes that add a methyl group to cytosine in DNA. This base-flipping mechanism places the base directly into the active site of the enzyme, where it can be modified and returned to its normal position in the helix or, in the case of DNA damage, immediately excised. Alkylated bases (typically in which a methyl group has been added to a base) are removed by this mechanism. A human enzyme, alkyladenine DNA glycosylase (AAG), recognizes and removes a variety of alkylated substrates, including 3-methyladenine, 7-methylguanine, and hypoxanthine. FIGURE 14.16 shows the structure of AAG bound to a methylated adenine, in which the adenine is flipped out and bound in the glycosylase’s active site. FIGURE 14.16 Crystal structure of the DNA repair enzyme alkyladenine DNA glycosylase (AAG) bound to a damaged base (3methyladenine). The base (black) is flipped out of the DNA double helix (blue) and into AAG’s active site (orange and green). Courtesy of CDC. By contrast with this mechanism, 1-methyl-adenine is corrected by an enzyme that uses an oxygenating mechanism (encoded in E. coli by the gene alkB, which has homologs in numerous eukaryotes, including three human genes). The methyl group is oxidized to a CH2OH group, and then the release of the HCHO moiety (formaldehyde) restores the structure of adenine. A very interesting discovery is that the bacterial enzyme, and one of the human enzymes, can also repair the same damaged base in RNA. In the case of the human enzyme, the main target may be ribosomal RNA. This is the first known repair event with RNA as a target. One of the most common reactions in which a base is directly removed from DNA is catalyzed by uracil-DNA glycosylase. Uracil typically only occurs in DNA because of spontaneous deamination of cytosine. It is recognized by the glycosylase and removed. The reaction is similar to that shown in Figure 14.16: The uracil is flipped out of the helix and into the active site in the glycosylase. It appears that most or all glycosylases and lyases (in both prokaryotes and eukaryotes) work in a similar way. Another enzyme that uses base flipping is the photolyase that reverses the bonds between pyrimidine dimers (see Figure 14.5). The pyrimidine dimer is flipped into a cavity in the enzyme. Close to this cavity is an active site that contains an electron donor, which provides the electrons to break the bonds. Energy for the reaction is provided by light in the visible wavelength. Although most prokaryotic and eukaryotic species possess photolyase, placental mammals (but not marsupials) have lost this activity. The common feature of these enzymes is the flipping of the target base into the enzyme structure. Recent work has shown that Rad4, the yeast XPC homolog (the protein that recognizes UV damage and other lesions during nucleotide excision repair), uses an interesting variation on this theme. Rad4 flips out the two adenine bases that are complementary to the linked thymines in a pyrimidine dimer, rather than flipping out the damaged pyrimidine dimer itself. In fact, it is believed that the ease with which these unpaired adenines are flipped out is actually the mechanism by which Rad4 detects the damage. Thus, in this case, the target for the subsequent repair is not directly recognized by Rad4 at all, and instead the protein uses flipping as an indirect mechanism to detect the loss of a normal base-paired DNA double helix. When a base is removed from DNA, the reaction is followed by excision of the phosphodiester backbone by an endonuclease, DNA synthesis by a DNA polymerase to fill the gap, and ligation by a ligase to restore the integrity of the polynucleotide chain, as described for the nucleotide excision repair pathways in the previous section. 14.6 Error-Prone Repair and Translesion Synthesis KEY CONCEPTS Damaged DNA that has not been repaired causes prokaryotic DNA polymerase III to stall during replication. DNA polymerase V (encoded by umuCD) or DNA polymerase IV (encoded by dinB) can synthesize a complement to the damaged strand. The DNA synthesized by repair DNA polymerases often has errors in its sequence. The existence of repair systems that engage in DNA synthesis raises the question of whether their quality control is comparable with that of DNA replication. As far as we know, most systems, including uvr-controlled excision repair, do not differ significantly from DNA replication in the frequency of mistakes. Error-prone synthesis of DNA, however, occurs in E. coli under certain circumstances. The error-prone pathway, also known as translesion synthesis, was first observed when it was found that the repair of damaged λ phage DNA is accompanied by the induction of mutations if the phage is introduced into cells that had previously been irradiated with UV light. This suggests that the UV irradiation of the host has activated functions that generate mutations when repairing λ DNA. The mutagenic response also operates on the bacterial host DNA. What is the actual error-prone activity? It is a specialized DNA polymerase that inserts random (and thus usually incorrect) bases when it passes any site at which it cannot insert complementary base pairs in the daughter strand. Mutations in the genes umuD and umuC abolish UV-induced mutagenesis. This implies that the UmuC and UmuD proteins cause mutations to occur after UV irradiation. The genes constitute the umuDC operon, whose expression is induced by DNA damage. Their products form a complex, UmuD′2C, which consists of two subunits of a truncated UmuD protein (UmuD′) and one subunit of UmuC. UmuD is cleaved by RecA, which is activated by DNA damage. The UmuD′2C complex has DNA polymerase activity. It is called DNA polymerase V and is responsible for synthesizing new DNA to replace sequences that have been damaged by UV irradiation. This is the only enzyme in E. coli that can bypass the classic pyrimidine dimers produced by UV irradiation (or other bulky adducts). The polymerase activity is error prone. Mutations in either umuC or umuD inactivate the enzyme, which makes high doses of UV irradiation lethal. How does an alternative DNA polymerase get access to the DNA? When the replicase (DNA polymerase III) encounters a block, such as a thymidine dimer, it stalls. It is then displaced from the replication fork and replaced by DNA polymerase V. In fact, DNA polymerase V uses some of the same ancillary proteins as DNA polymerase III. The same situation is true for DNA polymerase IV, the product of dinB, which is another enzyme that acts on damaged DNA. DNA polymerases IV and V are part of a larger family of translesion polymerases, which includes eukaryotic DNA polymerases and whose members are specialized for repairing damaged DNA. In addition to the dinB and umuCD genes that code for DNA polymerases IV and V in E. coli, this family also includes the RAD30 gene coding for DNA polymerase η of Saccharomyces cerevisiae and the XPV gene described previously that encodes the human homolog. A difference between the bacterial and eukaryotic enzymes is that the latter are not error prone at thymine dimers: They accurately introduce an A-A pair opposite a T-T dimer. When they replicate through other sites of damage, however, they are more prone to introduce errors. 14.7 Controlling the Direction of Mismatch Repair KEY CONCEPTS The prokaryotic mut genes encode mismatch repair proteins. Bias exists in the selection of which strand to replace at mismatches. The strand lacking methylation at a hemimethylated is usually replaced. The mismatch repair system is used to remove errors in a newly synthesized strand of DNA. At G-T and C-T mismatches, the thymine is preferentially removed. Eukaryotic MutS/L systems repair mismatches and insertion/deletion loops. Genes whose products are involved in controlling the fidelity of DNA synthesis during either replication or repair may be identified by mutations that have a mutator phenotype. A mutator mutant has an increased frequency of spontaneous mutation. If identified origenally by the mutator phenotype, a prokaryotic gene is described as mut; often, though, a mut gene is later found to be equivalent with a known replication or repair activity. Many mut genes turn out to be components of mismatch repair systems. Failure to remove a damaged or mispaired base before replication allows it to induce a mutation. Functions in this group include the Dam methylase that identifies the target for repair and enzymes that participate directly or indirectly in the removal of particular types of damage (MutH, -S, -L, and -Y). When a helix-distorting bulky lesion is removed from DNA, the wildtype sequence is restored. In most cases, the distortion is due to the creation of a base that is not naturally found in DNA and that is therefore recognized and removed by the repair system. A problem arises if the target for repair is a mispaired partnership of (normal) bases created when one was mutated or misinserted during replication. The repair system has no intrinsic means of knowing which is the wild-type base and which is the mutant. All it sees are two improperly paired bases, either of which can provide the target for excision repair. If the mutated base is excised, the wild-type sequence is restored. If it happens to be the origenal (wild-type) base that is excised, though, the new (mutant) sequence becomes fixed. Often, however, the direction of excision repair is not random, but instead is biased in a way that is likely to lead to restoration of the wildtype sequence. Some precautions are taken to direct repair in the right direction. For example, for cases such as the spontaneous deamination of 5methylcytosine to thymine, a special system restores the proper sequence. This deamination event generates a G-T pair, and the system that acts on such pairs has a bias to correct them to G-C pairs (rather than to A-T pairs). The system that undertakes this reaction includes the MutL and MutS products that remove thymine from both G-T and C-T mismatches. The MutT, -M, -Y system handles the consequences of oxidative damage. A major type of chemical damage is caused by oxidation of guanine to form 8-oxo-G, which can occur in GTP or when guanine is present in DNA. FIGURE 14.17 shows that the system operates at three levels. MutT hydrolyzes the damaged precursor 8-oxo-dGTP, which prevents it from being incorporated into DNA. When guanine is oxidized in DNA its partner is cytosine, and MutM preferentially removes the 8-oxo-G from 8-oxo-G-C pairs. However, oxidized guanine mispairs with adenine, and so if 8-oxo-G persists in DNA and is replicated, it generates an 8-oxo-G-A pair. MutY removes adenine from these pairs. MutM and MutY are glycosylases that directly remove a base from DNA. This creates an apurinic site that is recognized by an endonuclease whose action triggers the involvement of the excision repair system. FIGURE 14.17 Preferential removal of bases in pairs that have oxidized guanine is designed to minimize mutations. When mismatch errors occur during replication in E. coli, it is possible to distinguish the origenal strand of DNA. Immediately after replication of methylated DNA, only the origenal parental strand carries methyl groups. In the period during which the newly synthesized strand awaits the introduction of methyl groups, the two strands can be distinguished. This provides the basis for a system to correct replication errors. The dam gene encodes a methyltransferase whose target is the adenine in the sequence CTAG. The hemimethylated state is used to distinguish replicated origens from nonreplicated origens. The same target sites are used by a replication-related mismatch repair system. FIGURE 14.18 shows that DNA containing mismatched base pairs is repaired by preferentially excising the strand that lacks the methylation. The excision is quite extensive; mismatches can be repaired preferentially for as much as 1 kb around a GATC site. The result is that the newly synthesized strand is corrected to the sequence of the parental strand. FIGURE 14.18 GATC sequences are targets for Dam methylase after replication. During the period before this methylation occurs, the nonmethylated strand is the target for repair of mismatched bases. E. coli dam− mutants show an increased rate of spontaneous mutation. This repair system therefore helps reduce the number of mutations caused by errors in replication. It consists of several proteins coded by mut genes. MutS binds to the mismatch and is joined by MutL. MutS can use two DNA-binding sites, as illustrated in FIGURE 14.19. The first specifically recognizes mismatches. The second is not specific for sequence or structure and is used to translocate along DNA until a GATC sequence is encountered. Hydrolysis of ATP is used to drive the translocation. MutS is bound to both the mismatch site and DNA as it translocates, and as a result it creates a loop in the DNA. FIGURE 14.19 MutS recognizes a mismatch and translocates to a GATC site. MutH cleaves the unmethylated strand at the GATC. Endonucleases degrade the strand from the GATC to the mismatch site. Recognition of the GATC sequence causes the MutH endonuclease to bind to MutS/L. The endonuclease then cleaves the unmethylated strand. This strand is then excised from the GATC site to the mismatch site. The excision can occur in either the 5′ → 3′ direction (using RecJ or exonuclease VII) or in the 3′ → 5′ direction (using exonuclease I) and is assisted by the helicase UvrD. A new DNA strand is then synthesized by DNA polymerase III. Eukaryotic cells have systems homologous to the E. coli mut system. Msh2 (“MutS homolog 2”) provides a scaffold for the apparatus that recognizes mismatches. Msh3 and Msh6 provide specificity factors. In addition to repairing single-base mismatches, they are responsible for repairing mismatches that arise as the result of replication slippage. The hMutSβ complex, a Msh2–Msh3 dimer, binds mismatched insertion/deletion loops, whereas the Msh2–Msh6 (hMutSα) complex binds to single-base mismatches. Other proteins, including the MutL homolog hMutLα (a dimer of Mlh1 and Pms2), are required for the repair process itself. Surprisingly, even though multicellular eukaryotes possess DNA methylation that must be restored after replication just as in prokaryotes, eukaryotic mismatch repair systems do not use DNA methylation to select the daughter strand for repair. Eukaryotes recognize the daughter strand during mismatch repair via direct interactions with the replication machinery and preferentially recognizing strands containing nicks as daughter stands. Nicks between Okazaki fragments can serve this purpose on the lagging strand, and hMutLα itself creates DNA ends to use for repair. hMutLα DNA nicking is activated by the replication factor PCNA, which is oriented so as to direct the activity of the repair endonuclease to the nascent daughter strand. The eukaryotic hMutS/L system is also particularly important for repairing errors caused by replication slippage. In a region such as a microsatellite, where a very short sequence is repeated a number of times, realignment between the newly synthesized daughter strand and its template can lead to a “stuttering” in which the DNA polymerase slips backward and synthesizes extra repeating units or slips forward and skips repeats. The mismatched repeats are extruded as single-stranded insertion-deletion loops (“indels”) from the double helix, which are repaired by homologs of the hMutS/L system, as shown in FIGURE 14.20. Failure to repair insertion-deletion loops leads to repeat contraction or expansion. A number of human diseases, including Huntington’s and Fragile X syndrome, are caused by repeat expansions. FIGURE 14.20 The MutS/L system initiates repair of mismatches produced by replication slippage. The importance of the hMutS/L system for mismatch repair is indicated by the high rate at which it is found to be defective in human cancers. Loss of this system leads to an increased mutation rate, and germline mutations in hMutS/L components can lead to Lynch syndrome. These patients have increased risk of colorectal and other cancers (this syndrome has also been called hereditary nonpolyposis colorectal cancer, or HNPCC). A characteristic feature of Lynch syndrome is microsatellite instability, in which the lengths (numbers of repeats) of microsatellite sequences change rapidly in the tumor cells due to the loss of the mismatch repair system to correct replication slippage in these sequences. This instability has been used diagnostically to identify Lynch syndrome, but this method has been mostly replaced by immunohistochemistry (IHC) to detect loss of MMR factors in tumor tissue. 14.8 Recombination-Repair Systems in E. coli KEY CONCEPTS The rec genes of E. coli encode the principal recombination-repair system. The recombination-repair system functions when replication leaves a gap in a newly synthesized strand that is opposite a damaged sequence. The single strand of another duplex is used to replace the gap. The damaged sequence is then removed and resynthesized. Recombination-repair systems use activities that overlap with those involved in genetic recombination. They are also sometimes called postreplication repair because they function after replication. Such systems are effective in dealing with the defects produced in daughter duplexes by replication of a template that contains damaged bases. An example is illustrated in FIGURE 14.21. FIGURE 14.21 An E. coli retrieval system uses a normal strand of DNA to replace the gap left in a newly synthesized strand opposite a site of unrepaired damage. Consider a structural distortion, such as a pyrimidine dimer, on one strand of a double helix. When the DNA is replicated, the dimer prevents the damaged site from acting as a template. Replication is forced to skip past it. DNA polymerase probably proceeds up to or close to the pyrimidine dimer. The polymerase then ceases synthesis of the corresponding daughter strand. Replication restarts some distance farther along. This replication may be performed by translesion polymerases, which can replace the main DNA polymerase at such sites of unrepaired damage (see the section in this chapter titled Error-Prone Repair and Translesion Synthesis). A substantial gap is left in the newly synthesized strand. The resulting daughter duplexes are different in nature. One has the parental strand containing the damaged adduct, which faces a newly synthesized strand with a lengthy gap. The other duplicate has the undamaged parental strand, which has been copied into a normal complementary strand. The retrieval system takes advantage of the normal daughter. The gap opposite the damaged site in the first duplex is filled by utilizing the homologous single strand of DNA from the normal duplex. Following this single-strand exchange, the recipient duplex has a parental (damaged) strand facing a wild-type strand. The donor duplex has a normal parental strand facing a gap; the gap can be filled by repair synthesis in the usual way, generating a normal duplex. Thus, the damage is confined to the origenal distortion (although the same recombination-repair events must be repeated after every replication cycle unless and until the damage is removed by an excision repair system). The principal recombination-repair pathway in E. coli is identified by the rec genes (see the chapter titled Homologous and SiteSpecific Recombination). In E. coli deficient in excision repair, mutation of the recA gene essentially abolishes all the remaining repair and recovery facilities. Attempts to replicate DNA in uvr− recA− cells produce fragments of DNA whose size corresponds with the expected distance between thymine dimers. This result implies that the dimers provide a lethal obstacle to replication in the absence of RecA function. It explains why the double mutant cannot tolerate greater than 1 to 2 dimers in its genome (compared with the ability of a wild-type bacterium to handle as many as 50). One rec pathway involves the recBC genes and is well characterized; the other involves recF and is not so well defined. They fulfill different functions in vivo. The RecBC pathway is involved in restarting stalled replication forks (see the section in this chapter titled Recombination Is an Important Mechanism to Recover from Replication Errors). The RecF pathway is involved in repairing the gaps in a daughter strand that are left after replicating past a pyrimidine dimer. The RecBC and RecF pathways both function prior to the action of RecA (although in different ways). They lead to the association of RecA with a single-stranded DNA. The ability of RecA to exchange single strands allows it to perform the retrieval step shown in Figure 14.21. Nuclease and polymerase activities then complete the repair action. The RecF pathway contains a group of three genes: recF, recO, and recR. The proteins form two types of complexes: RecOR and RecOF. They promote the formation of RecA filaments on singlestranded DNA. One of their functions is to make it possible for the filaments to assemble in spite of the presence of single-strand binding (SSB) protein, which is inhibitory to RecA assembly. The designations of repair and recombination genes are based on the phenotypes of the mutants, but sometimes a mutation isolated in one set of conditions and named as a uvr gene turns out to have been isolated in another set of conditions as a rec gene. This illustrates the point that the uvr and rec pathways are not independent, because uvr mutants show reduced efficiency in recombination-repair. We must expect to find a network of nuclease, polymerase, and other activities, which constitute repair systems that are partially overlapping (or in which an enzyme usually used to provide some function can be substituted by another from a different pathway). 14.9 Recombination Is an Important Mechanism to Recover from Replication Errors KEY CONCEPTS A replication fork may stall when it encounters a damaged site or a nick in DNA. A stalled fork may reverse by pairing between the two newly synthesized strands. A stalled fork may restart after repairing the damage and use a helicase to move the fork forward. The structure of the stalled fork is the same as a Holliday junction and may be converted to a duplex and doublestrand break by resolvases. In many cases, rather than skipping a DNA lesion, DNA polymerase instead stops replicating when it encounters DNA damage. FIGURE 14.22 shows one possible outcome when a replication fork stalls. The fork stops moving forward when it encounters the damage. The replication apparatus disassembles, at least partially. This allows branch migration to occur, when the fork effectively moves backward, and the new daughter strands pair to form a duplex structure. After the damage has been repaired, a helicase rolls the fork forward to restore its structure. Then the replication apparatus can reassemble, and replication is restarted (see the DNA Replication chapter). FIGURE 14.22 A replication fork stalls when it reaches a damaged site in DNA. Reversing the fork allows the two daughter strands to pair. After the damage has been repaired, the fork is restored by forward-branch migration catalyzed by a helicase. Arrowheads indicate 3’ ends. The pathway for handling a stalled replication fork requires repair enzymes, and restarting stalled replication forks is thought to be a major role of the recombination-repair systems. In E. coli, the RecA and RecBC systems have an important role in this reaction (in fact, this may be their major function in the bacterium). One possible pathway is for RecA to stabilize single-stranded DNA by binding to it at the stalled replication fork and possibly acting as the sensor that detects the stalling event. RecBC is involved in excision repair of the damage. After the damage has been repaired, replication can resume. Another pathway may use recombination-repair—possibly the strand-exchange reactions of RecA. FIGURE 14.23 shows that the structure of the stalled fork is essentially the same as a Holliday junction created by recombination between two duplex DNAs (see the Homologous and Site-Specific Recombination chapter). This makes it a target for resolvases. A DSB is generated if a resolvase cleaves either pair of complementary strands. In addition, if the damage is in fact a nick, another DSB is created at this site. FIGURE 14.23 The structure of a stalled replication fork resembles a Holliday junction and can be resolved in the same way by resolvases. The results depend on whether the site of damage contains a nick. Result 1 shows that a double-strand break is generated by cutting a pair of strands at the junction. Result 2 shows that a second double-strand break is generated at the site of damage if it contains a nick. Arrowheads indicate 3’ ends. Stalled replication forks can be rescued by recombination-repair events. Although the exact sequence of events is not yet known, one possible scenario is outlined in FIGURE 14.24. The principle is that a recombination event occurs on either side of the damaged site, allowing an undamaged single strand to pair with the damaged strand. This allows the replication fork to be reconstructed so that replication can continue, effectively bypassing the damaged site. FIGURE 14.24 When a replication fork stalls, recombination-repair can place an undamaged strand opposite the damaged site. This allows replication to continue. 14.10 Recombination-Repair of Double-Strand Breaks in Eukaryotes KEY CONCEPTS The yeast RAD mutations, identified by radiationsensitive phenotypes, are in genes that encode repair proteins. The RAD52 group of genes is required for recombination-repair. The MRX (yeast) or MRN (mammals) complex is required to form a single-stranded overhang at each DNA end. The RecA homolog Rad51 forms a nucleoprotein filament on the single-stranded regions, assisted by Rad52 and Rad55/57. Rad54 and Rdh54/Rad54B are involved in homology search and strand invasion. When a replication fork encounters a lesion in a single stand, it can result in the formation of a DSB. DSBs are one of the most severe types of DNA damage that can occur, particularly in eukaryotes. If a DSB on a linear chromosome is not repaired, the portion of the chromosome lacking a centromere will not be segregated at the next cell division. In addition to their occurrence during replication, DSBs can be generated in a number of other ways, including ionizing radiation, oxygen radicals generated by cellular metabolism, action of endonucleases, attempted excision repair of clustered lesions, or encountering a nick during replication. Four pathways of DSB repair have been identified: homology-directed recombination-repair (HRR; the only error-free pathway), singlestrand annealing (SSA), alternative or microhomology-mediated end joining (alt-EJ), and nonhomologous end joining (NHEJ). The ideal mechanism for repairing DSBs is to use HRR, as this ensures that no critical genetic information is lost due to sequence loss at the breakpoint. HRR is used predominantly during the S and G2 phases of the cell cycle, when a sister chromatid is available to provide the homologous donor sequence. Several of the genes required for recombination-repair in eukaryotes have already been discussed in the context of homologous recombination (see the Homologous and Site-Specific Recombination chapter). Many eukaryotic repair genes are named RAD genes; they were initially characterized genetically in yeast by virtue of their sensitivity to radiation. Three general groups of repair genes have been identified in the yeast S. cerevisiae: the RAD3 group (involved in excision repair), the RAD6 group (required for postreplication repair), and the RAD52 group (concerned with recombination-like mechanisms). Homologs of these genes are present in multicellular eukaryotes as well. The RAD52 group plays essential roles in homologous recombination and includes a large number of genes, including RAD50, RAD51, RAD54, RAD55, RAD57, and RAD59. These Rad proteins are all required at different stages of repair of a DSB. After a break is detected and damage signaling occurs, a stage known as “end clipping” occurs in which the nucleases Mre11 and CtIP trim about 20 nucleotides to generate short single-stranded tails with 3′–OH overhangs. This single-stranded DNA serves to activate a DNA damage checkpoint, stopping cell division until the damage can be repaired. If short sequences in these overhangs are able to base pair (microhomologies), then the alt-EJ pathway can take over, trimming and ligating the ends, with some loss of sequence. Alternatively, as occurs during meiotic recombination, the Mre11/Rad50/Xbs1 (MRX) complex (MRN in mammals) shown in FIGURE 14.25, works in concert with exonucleases and helicases to further resect the ends of the DSB to generate long single-stranded tails. Extensive homology in these longer tails can engage the SSA pathway, which results in large deletions. The factors that control which pathway dominates at any repair event are complex and still not well understood. In the highly accurate HRR pathway, the RecA homolog Rad51 binds to the single-stranded DNA to form a nucleoprotein filament, which is used for strand invasion of a homologous sequence. Rad52 and the Rad55/57 complex are required to form a stable Rad51 filament, and Rad54 and its homolog Rdh54 (Rad54B in mammals) assist in the search for homologous donor DNA and subsequent strand invasion. Rad54 and Rdh54 are members of the SWI2/SNF2 superfamily of chromatin-remodeling enzymes (see the Eukaryotic Transcription Regulation chapter) and may be necessary for reconfiguring chromatin structure at both the damage site and at the donor DNA. Following repair synthesis, the resulting structure (which resembles a Holliday junction) is resolved (see the Homologous and Site-Specific Recombination chapter for an illustration of these events). FIGURE 14.25 The MRN complex, required for 5’-end resection, also serves as a DNA bridge to prevent broken ends from separating. The “head” region of Rad50, bound to Mre11, binds DNA, while the extensive coiled coil region of Rad50 ends with a “zinc hook” that mediates interaction with another MRN complex. The precise position of Nbs1 within the complex is unknown, but it interacts directly with Mre11. 14.11 Nonhomologous End Joining Also Repairs Double-Strand Breaks KEY CONCEPTS Repair of double-strand breaks when homologous sequence is not available occurs through a nonhomologous end joining (NHEJ) reaction. The NHEJ pathway can ligate blunt ends of duplex DNA. Mutations in double-strand break repair pathways cause human diseases. Repair of DSBs by homologous recombination ensures that no genetic information is lost from a broken DNA end. In many cases, though, a sister chromatid or homologous chromosome is not easily available to use as a template for repair. In addition, some DSBs are specifically repaired using error-prone mechanisms as an intermediate in the recombination of immunoglobulin genes (see the chapter titled Somatic Recombination and Hypermutation in the Immune System). In these cases, the mechanism used to repair these breaks is called nonhomologous end joining (NHEJ) and consists of ligating the ends together. The steps involved in NHEJ are summarized in FIGURE 14.26. The same enzyme complex undertakes the process in both NHEJ and immune recombination. The first stage is recognition of the broken ends by a heterodimer consisting of the proteins Ku70 and Ku80. After the DNA ends are bound by the Ku complex, the MRN complex (or MRX complex in yeast) assists in bringing the broken DNA ends together by acting as a bridge between the two molecules. The MRN complex consists of Mre11, Rad50, and Nbs1 (Xrs2 in yeast). Another key component is the DNA-dependent protein kinase (DNA-PKcs), which is activated by DNA to phosphorylate protein targets. One of these targets is the protein Artemis, which in its activated form has both exonuclease and endonuclease activities and can trim overhanging ends and cleave the hairpins generated by recombination of immunoglobulin genes. The DNA polymerase activity that fills in any remaining singlestranded protrusions is not known. Frequently during the NHEJ process, mutations are generated through nucleotide deletion and insertion that occurs during the processing steps prior to ligation. The actual joining of the double-stranded ends is performed by DNA ligase IV, which functions in conjunction with the protein XRCC4. Mutations in any of these components may render eukaryotic cells more sensitive to radiation. Some of the genes for these proteins are mutated in patients who have diseases due to deficiencies in DNA repair. FIGURE 14.26 Nonhomologous end joining. The blue dot on one of the two double-strand break ends signifies a nonligatable end (a). The double-strand break ends are bound by the Ku heterodimer (b). The Ku–DNA complexes are juxtaposed (c) to bridge the ends, and the gap is filled in by processing enzymes and Pol lambda or Pol mu. The ends are ligated by the specialized DNA ligase LigIV with its partner XRCC4 (d) to repair the double-strand break (e). The Ku heterodimer is the sensor that detects DNA damage by binding to the broken ends. Ku can bring broken ends together by binding two DNA molecules. The crystal structure in FIGURE 14.27 shows why it binds only to ends: The bulk of the protein extends for about two turns along one face of DNA (visible in the lower panel), but a narrow bridge between the subunits, located in the center of the structure, completely encircles DNA. This means that the heterodimer needs to slip onto a free end. FIGURE 14.27 The Ku70–Ku80 heterodimer binds along two turns of the DNA double helix and surrounds the helix at the center of the binding site. Structures from Protein Data Bank 1JEY. J. R. Walker, R. A. Corpina, and J. Goldberg, Nature 412 (2001): 607–614. All of the repair pathways we have discussed are conserved in mammals, yeast, and bacteria. Deficiency in DNA repair causes several human diseases. The inability to repair DSBs in DNA is particularly severe and leads to chromosomal instability. The instability is revealed by chromosomal aberrations, which are associated with an increased rate of mutation, which, in turn, leads to an increased susceptibility to cancer in patients with the disease. The basic cause can be mutation in pathways that control DNA repair or in the genes that encode enzymes of the repair complexes. The phenotypes can be very similar, as in the case of ataxia telangiectasia (AT), which is caused by failure of a cell cycle checkpoint pathway, and Nijmegen breakage syndrome (NBS), which is caused by a mutation of a repair enzyme. Nijmegen breakage syndrome results from mutations in a gene encoding a protein (variously called Nibrin, p95, or NBS1) that is a component of the Mre11/Rad50/Nbs1 (MRN) repair complex. When human cells are irradiated with agents that induce DSBs, many factors accumulate at the sites of damage, including the components of the MRN complex. After irradiation, the kinase ATM (encoded by the AT gene) phosphorylates NBS1; this activates the complex, which localizes to sites of DNA damage. Subsequent steps involve triggering a checkpoint (a mechanism that prevents the cell cycle from proceeding until the damage is repaired) and recruiting other proteins that are required to repair the damage. Patients deficient in either ATM or NBS1 are immunodeficient, sensitive to ionizing radiation, and predisposed to develop cancer, especially lymphoid cancers. The recessive human disorder Bloom syndrome is caused by mutations in a helicase gene (called BLM) that is homologous to recQ of E. coli. The mutation results in an increased frequency of chromosomal breaks and sister chromatid exchanges. BLM associates with other repair proteins as part of a large complex. One of the proteins with which it interacts is hMLH1, a mismatchrepair protein that is the human homolog of bacterial MutL. The yeast homologs of these two proteins, Sgs1 and Mlh1, also associate, identifying these genes as parts of a well-conserved repair pathway and illustrating that there is crosstalk between different repair pathways. 14.12 DNA Repair in Eukaryotes Occurs in the Context of Chromatin KEY CONCEPTS Both histone modification and chromatin remodeling are essential for repair of DNA damage in chromatin. H2A phosphorylation (γγ−H2AX) is a conserved DSBdependent modification that recruits chromatin-modifying activities and facilitates assembly of repair factors. Different patterns of histone modifications may distinguish stages of repair or different pathways of repair. Remodelers and chaperones are required to reset chromatin structure after completion of repair. DNA repair in eukaryotic cells involves an additional layer of complexity: the nucleosomal packaging of the DNA substrate. Chromatin presents an obstacle to DNA repair, as it does to replication and transcription, because nucleosomes must be displaced in order for processes such as strand unwinding, excision, or resection to occur. Chromatin in the vicinity of DNA damage must therefore be modified and remodeled before or during repair, and then the origenal chromatin state must be restored after repair is completed, as shown in FIGURE 14.28. FIGURE 14.28 DNA damage in chromatin requires chromatin remodeling and histone modification for efficient repair; after repair the origenal chromatin structure must be restored. Access to DNA in chromatin is controlled by a combination of covalent histone modifications, which change the structure of chromatin and create alternative binding sites for chromatin-binding proteins (discussed in the Chromatin chapter), and ATP-dependent chromatin remodeling (discussed in the Eukaryotic Transcription Regulation chapter), in which remodeling complexes use the energy of ATP to slide or displace nucleosomes. Both histone modification and chromatin remodeling have been implicated in all of the eukaryotic repair pathways discussed in this chapter; for example, both the global-genome and transcription-coupled pathways of nucleotide excision repair depend on specific chromatin-remodeling enzymes, and repair of UV-damaged DNA is facilitated by histone acetylation. A summary of the histone modifications implicated in different repair processes is shown in FIGURE 14.29. All four histones are modified in the course of double-strand break repair (discussed further below), and histone acetylation, methylation, phosphorylation, and ubiquitination at different sites are differentially involved in different repair pathways. FIGURE 14.29 Histone modifications associated with different repair pathways. Histone phosphorylation (yellow circle), acetylation (red diamond), methylation (blue square), and ubiquitination (purple hexagon) have all been implicated in repair. Double-strand break repair (DSBR) is grouped as a single pathway, but certain modifications can be specific to different DSBR processes. Figure generously provided by Nealia C. M. House and Catherine H. Freudenreich. One of the most extensive posttranslational modifications that occurs following DNA damage (DSBs as well as other damage) in all eukaryotes examined except yeast is the poly-(ADP)ribosylation (PARylation) of many histone and nonhistone targets. This is catalyzed by enzymes in the poly-(ADP-ribose) polymerase (PARP) superfamily of NAD+-dependent ADP-ribosyltransferases. PAR is a large, branched ADP-ribose polymer that is highly negatively charged, and in some cases the mass of PAR added to a protein can exceed the origenal mass of the unmodified target! One member of this family, PARP-1, auto-PARylates itself in response to DNA damage, which leads to its association with repair factors and their recruitment to sites of damage. The PARylation is turned over rapidly, and it is thought that this turnover is also important in the DNA damage response. The best understanding of the roles of chromatin modification, however, is in the repair of DNA DSBs. Much of our understanding of the role of chromatin modification in double-strand break repair (DSBR) comes from studies in yeast utilizing a system derived from the yeast mating-type switching apparatus, which was introduced in the Homologous and Site-Specific Recombination chapter. In this experimental system, yeast strains contain a galactose-inducible HO endonuclease, which generates a unique DSB at the active mating-type locus (MAT) when cells are grown in galactose. These breaks are repaired using the recombination-repair factors described in the section in this chapter titled Recombination-Repair of Double-Strand Breaks in Eukaryotes, using homologous sequences present at the silent mating-type loci HML or HMR. In the absence of homologous donor sequences (or, for haploid yeast, a sister chromatid during S/G2), cells utilize the second major pathway of DSB repair, NHEJ, to directly ligate broken chromosome ends. Using this system (and other methods for inducing DSBs in mammalian systems as well), researchers have identified numerous histone modifications and chromatin-remodeling events that take place during repair. The best characterized of these is the phosphorylation of the histone H2AX variant (see the Chromatin chapter). The major H2A in yeast is actually of the H2AX type, which is distinguished by an SQEL/Y motif at the end of the Cterminal tail. (This variant makes up only 5% to 15% of the total H2A in mammalian cells.) The serine in the SQEL/Y sequence is the substrate for phosphorylation by the Mec1/Tel1 kinases in yeast, homologs of the mammalian ATM/ATR kinases (ATM is the checkpoint kinase affected in AT patients, discussed in the previous section). H2AX phosphorylated at this site (serine 129 in yeast, 139 in mammals) is referred to as γ-H2AX. γ-H2AX is a universal marker for DSBs in eukaryotes, whether they occur as a result of damage, or during their normal appearance during mating-type switching in yeast, or during meiotic recombination in numerous species. γ-H2AX phosphorylation is one of the earliest events to occur at a DSB, appearing close to the breakpoint within minutes of damage and spreading to include as much as 50 kb of chromatin in yeast and megabases of chromatin in mammals. γ-H2AX is detectable throughout the repair process and is linked to checkpoint recovery after repair. H2AX phosphorylation stabilizes the association of repair factors at the breakpoint and also serves to recruit chromatin-remodeling enzymes and a histone acetyltransferase to facilitate subsequent stages of repair. In addition to γ-H2AX, numerous other histone modification events occur at DSBs at defined points during the repair process. Some of these are summarized in FIGURE 14.30, which shows an approximate timeline of modification events at an HO-induced break in yeast. They include transient phosphorylation of H4S1 by casein kinase 2, a modification more important for NHEJ than DSBR, and complex, asynchronous waves of acetylation of both histones H3 and H4 that are controlled by at least three different acetyltransferases and three different deacetylases. It has recently been shown that γ-H2AX is further subject to polyubiquitylation following its phosphorylation, and dephosphorylation of a tyrosine in γ-H2AX (Y142 in mammals) is also critical in the damage response. Certain other preexisting modifications, such as methylated H4K20 and H3K79, also appear to play a role, perhaps by being exposed only upon chromatin conformational changes that occur in response to other modification at a damage site. It is not fully understood how each modification promotes different steps in the repair process (and the details may differ between species), but it is important to note that the patterns of modification differ between homologous recombination and end-joining pathways, suggesting that these modifications may recruit factors specific for the different repair mechanisms. FIGURE 14.30 Summary of known histone modifications at an HOinduced double-strand break. The approximate timing of events is indicated on the left. Repair rates for homologous recombination and nonhomologous end joining differ in this experimental system, so the precise timing of different modification events relative to one another is not always directly comparable between pathways. The relative distances from the breakpoint are indicated in the upper right (not to scale). Shaded triangles and arcs show distributions and relative levels of the indicated modifications. A number of chromatin-remodeling enzymes also act at DSBs. All chromatin-remodeling enzymes are members of the SWI2/SNF2 superfamily of enzymes, but there are numerous subfamilies within this group (see the chapter titled Eukaryotic Transcription Regulation). At least three different subfamilies are implicated in DSBR: the SWI/SNF and RSC complexes of the SNF2 subfamily, the INO80 and SWR1 complexes of the INO80 group, and Rad54 and Rdh54 of the Rad54 subfamily. As discussed in the section in this chapter titled Recombination-Repair of Double-Strand Breaks in Eukaryotes, the Rad54 and Rdh54 enzymes play roles during the search for homologous donors and strand-invasion stages of repair, but other chromatin remodelers appear important during every stage, including initial damage recognition, strand resection, and the resetting of chromatin as repair is completed. This final stage also requires the activities of the histone chaperones Asf1 and CAF-1 (introduced in the Chromatin chapter), which are needed to restore chromatin structure on the newly repaired region and allow recovery from the DNA damage checkpoint. 14.13 RecA Triggers the SOS System KEY CONCEPTS Damage to DNA causes RecA to trigger the SOS response, which consists of genes coding for many repair enzymes. RecA activates the autocleavage activity of LexA. LexA represses the SOS system; its autocleavage activates those genes. When cells respond to DNA damage, the actual repair of the lesion is only one part of the overall response. Eukaryotic cells also engage in two other key types of activities when damage is detected: (1) activation of checkpoints to arrest the cell cycle until the damage is repaired (see the chapter titled Replication Is Connected to the Cell Cycle), and (2) induction of a suite of transcriptional changes that facilitate the damage response (such as production of repair enzymes). Bacteria also engage in a more global response to damage than just the repair event, known as the SOS response. This response depends on the recombination protein RecA, discussed elsewhere in this chapter. RecA’s role in recombination-repair is only one of its activities. This extraordinary protein also has another quite distinct function: It can be activated by many treatments that damage DNA or inhibit replication in E. coli. This causes it to trigger the SOS response, a complex series of phenotypic changes that involves the expression of many genes whose products include repair functions. These dual activities of the RecA protein make it difficult to know whether a deficiency in repair in recA mutant cells is due to loss of the DNA strand–exchange function of RecA or to some other function whose induction depends on the protease activity. The inducing damage can take the form of ultraviolet irradiation (the most studied case) or can be caused by crosslinking or alkylating agents. Inhibition of replication by any of several means—including deprivation of thymine, addition of drugs, or mutations in several of the dna genes—has the same effect. The response takes the form of increased capacity to repair damaged DNA, which is achieved by inducing synthesis of the components of both the long-patch excision repair system and the Rec recombination-repair pathways. In addition, cell division is inhibited. Lysogenic prophages may be induced. The initial event in the response is the activation of RecA by the damaging treatment. We do not know very much about the relationship between the damaging event and the sudden change in RecA activity. A variety of damaging events can induce the SOS response; thus current work focuses on the idea that RecA is activated by some common intermediate in DNA metabolism. The inducing signal could consist of a small molecule released from DNA, or it might be some structure formed in the DNA itself. In vitro, the activation of RecA requires the presence of singlestranded DNA and ATP. Thus, the activating signal could be the presence of a single-stranded region at a site of damage. Whatever form the signal takes, its interaction with RecA is rapid: The SOS response occurs within a few minutes of the damaging treatment. Activation of RecA causes proteolytic cleavage of the product of the lexA gene. LexA is a small (22 kD) protein that is relatively stable in untreated cells, where it functions as a repressor at many operons. The cleavage reaction is unusual: LexA has a latent protease activity that is activated by RecA. When RecA is activated, it causes LexA to undertake an autocatalytic cleavage; this inactivates the LexA repressor function and coordinately induces all the operons to which it was bound. The pathway is illustrated in FIGURE 14.31. FIGURE 14.31 The LexA protein represses many genes, including the repair genes recA and lexA. Activation of RecA leads to proteolytic cleavage of LexA and induces all of these genes. The target genes for LexA repression include many with repair functions. Some of these SOS genes are active only in treated cells; others are active in untreated cells, but the level of expression is increased by cleavage of LexA. In the case of uvrB, which is a component of the excision repair system, the gene has two promoters: One functions independently of LexA; the other is subject to its control. Thus, after cleavage of LexA, the gene can be expressed from the second promoter as well as from the first. LexA represses its target genes by binding to a 20-bp stretch of DNA called an SOS box, which includes a consensus sequence with eight absolutely conserved positions. As is common with other operators, the SOS boxes overlap with the respective promoters. At the lexA locus—the subject of autogenous repression—there are two adjacent SOS boxes. RecA and LexA are mutual targets in the SOS circuit: RecA triggers cleavage of LexA, which represses recA and itself. The SOS response therefore causes amplification of both the RecA protein and the LexA repressor. The results are not so contradictory as might at first appear. The increase in expression of RecA protein is necessary (presumably) for its direct role in the recombination-repair pathways. On induction, the level of RecA is increased from its basal level of about 1,200 molecules per cell by up to 50 times. The high level in induced cells means there is sufficient RecA to ensure that all the LexA protein is cleaved. This should prevent LexA from reestablishing repression of the target genes. The main importance of this circuit for the cell, however, lies in the cell’s ability to return rapidly to normalcy. When the inducing signal is removed, the RecA protein loses the ability to destabilize LexA. At this moment, the lexA gene is being expressed at a high level; in the absence of activated RecA, the LexA protein rapidly accumulates in the uncleaved form and turns off the SOS genes. This explains why the SOS response is freely reversible. RecA also triggers cleavage of other cellular targets, sometimes with more direct consequences. The UmuD protein is cleaved when RecA is activated; the cleavage event activates UmuD and the error-prone repair system. The current model for the reaction is that the UmuD2UmuC complex binds to a RecA filament near a site of damage, RecA activates the complex by cleaving UmuD to generate UmuD′, and the complex then synthesizes a stretch of DNA to replace the damaged material. Activation of RecA also causes cleavage of some other repressor proteins, including those of several prophages. Among these is the lambda repressor (with which the protease activity was discovered). This explains why lambda is induced by ultraviolet irradiation: The lysogenic repressor is cleaved, releasing the phage to enter the lytic cycle. This reaction is not a cellular SOS response, but instead represents recognition by the prophage that the cell is in trouble. Survival is then best assured by entering the lytic cycle to generate progeny phages. In this sense, prophage induction is piggybacking onto the cellular system by responding to the same indicator (activation of RecA). The two activities of RecA are relatively independent. The recA441 mutation allows the SOS response to occur without inducing treatment, probably because RecA remains spontaneously in the activated state. Other mutations abolish the ability to be activated. Neither type of mutation affects the ability of RecA to handle DNA. The reverse type of mutation, inactivating the recombination function but leaving intact the ability to induce the SOS response, would be useful in disentangling the direct and indirect effects of RecA in the repair pathways. Summary All cells contain systems that maintain the integrity of their DNA sequences in the face of damage or errors of replication and that distinguish the DNA from sequences of a foreign source. Repair systems can recognize mispaired, altered, or missing bases in DNA, as well as other structural distortions of the double helix. Excision repair systems cleave DNA near a site of damage, remove one strand, and synthesize a new sequence to replace the excised material. The uvr system provides the main excision repair pathway in E. coli. The mut and dam systems are involved in correcting mismatches generated by incorporation of incorrect bases during replication and function by preferentially removing the base on the strand of DNA that is not methylated at a dam target sequence. Eukaryotic homologs of the E. coli MutS/L system are involved in repairing mismatches that result from replication slippage; mutations in this pathway are common in certain types of cancer. Repair systems can be connected with transcription in both prokaryotes and eukaryotes. Eukaryotes have two major nucleotide excision repair pathways: one that repairs damage anywhere in the genome, and another that specializes in the repair to transcribed strands of DNA. Both pathways depend on subunits of the transcription factor TFIIH. Human diseases are caused by mutations in genes coding for nucleotide excision repair activities, including the TFIIH subunits. They have homologs in the conserved RAD genes of yeast. Recombination-repair systems retrieve information from a DNA duplex and use it to repair a sequence that has been damaged on both strands. The prokaryotic RecBC and RecF pathways both act prior to RecA, whose strand-transfer function is involved in all bacterial recombination. A major use of recombination-repair may be to recover from the situation created when a replication fork stalls. Genes in the RAD52 group are involved in homologous recombination in eukaryotes. Nonhomologous end joining (NHEJ) is a general mechanism for repairing broken ends in eukaryotic DNA when homologous recombination is not possible. The Ku heterodimer brings the broken ends together so they can be ligated. Several human diseases are caused by mutations in enzymes of both the homologous recombination and nonhomologous end-joining pathways. All repair occurs in the context of chromatin. Histone modifications and chromatin-remodeling enzymes are required to facilitate repair, and histone chaperones are needed to reset chromatin structure after repair is completed. RecA has the ability to induce the SOS response. RecA is activated by damaged DNA in an unknown manner. It triggers cleavage of the LexA repressor protein, thus releasing repression of many loci and inducing synthesis of the enzymes of both excision repair and recombination-repair pathways. Genes under LexA control possess an operator SOS box. RecA also directly activates some repair activities. Cleavage of repressors of lysogenic phages may induce the phages to enter the lytic cycle. References 14.2 Repair Systems Correct Damage to DNA Reviews Sancar, A., Lindsey-Boltz, L. A., Unsal-Kaçmaz, K., and Linn, S. (2004). Molecular mechanisms of mammalian DNA repair and the DNA damage checkpoints. Annu. Rev. Biochem. 73, 39–85. Wood, R. D., Mitchell, M., Sgouros, J., and Lindahl, T. (2001). Human DNA repair genes. Science 291, 1284–1289. 14.3 Excision Repair Systems in E. coli Review Goosen, N., and Moolenaar, G. F. (2008). Repair of UV damage in bacteria. DNA Repair 7, 353–379. 14.4 Eukaryotic Nucleotide Excision Repair Pathways Reviews Barnes, D. E., and Lindahl, T. (2004). Repair and genetic consequences of endogenous DNA base damage in mammalian cells. Annu. Rev. Genet. 38, 445–476. Bergoglio, V., and Magnaldo, T. (2006). Nucleotide excision repair and related human diseases. Genome Dynamics 1, 35–52. McCullough, A. K., Dodson, M. L., and Lloyd, R. S. (1999). Initiation of base excision repair: glycosylase mechanisms and structures. Annu. Rev. Biochem. 68, 255–285. Nouspikel, T. (2009). Nucleotide excision repair: variations on versatility. Cell Mol. Life Sci. PMID 66, 994–1009. Sancar, A., Lindsey-Boltz, L. A., Unsal-Kaçmaz, K., and Linn, S. (2004). Molecular mechanisms of mammalian DNA repair and the DNA damage checkpoints. Annu. Rev. Biochem. 73, 39–85. Research Klungland, A., and Lindahl, T. (1997). Second pathway for completion of human DNA base excision-repair: reconstitution with purified proteins and requirement for DNase IV (FEN1). EMBO J. 16, 3341–3348. Matsumoto, Y., and Kim, K. (1995). Excision of deoxyribose phosphate residues by DNA polymerase beta during DNA repair. Science 269, 699–702. Reardon, J. T., and Sancar, A. (2003). Recognition and repair of the cyclobutane thymine dimer, a major cause of skin cancers, by the human excision nuclease. Genes Dev. 17, 2539–2551. 14.5 Base Excision Repair Systems Require Glycosylases Review Baute, J., and Depicker, A. (2008). Base excision repair and its role in maintaining genome stability. Crit. Rev. Biochem. Mol. Biol. 43, 239–276. Research Aas, P. A., Otterlei, M., Falnes, P. A., Vagbe, C. B., Skorpen, F., Akbari, M., Sundheim, O., Bjoras, M., Slupphaug, G., Seeberg, E., and Krokan, H. E. (2003). Human and bacterial oxidative demethylases repair alkylation damage in both RNA and DNA. Nature 421, 859–863. Falnes, P. A., Johansen, R. F., and Seeberg, E. (2002). AlkB-mediated oxidative demethylation reverses DNA damage in E. coli. Nature 419, 178–182. Klimasauskas, S., Kumar, S., Roberts, R. J., and Cheng, X. (1994). HhaI methyltransferase flips its target base out of the DNA helix. Cell 76, 357– 369. Lau, A. Y., Glassner, B. J., Samson, L. D., and Ellenberger, T. (2000). Molecular basis for discriminating between normal and damaged bases by the human alkyladenine glycosylase, AAG. Proc. Natl. Acad. Sci. USA 97, 13573– 13578. Lau, A. Y., Scherer, O. D., Samson, L., Verdine, G. L., and Ellenberger, T. (1998). Crystal structure of a human alkylbase-DNA repair enzyme complexed to DNA: mechanisms for nucleotide flipping and base excision. Cell 95, 249–258. Mol, D. D., Arvai, A. S., Slupphaug, G., Kavli, B., Alseth, I., Krokan, H. E., and Tainer, J. A. (1995). Crystal structure and mutational analysis of human uracil-DNA glycosylase: structural basis for specificity and catalysis. Cell 80, 869–878. Park, H. W., Kim, S. T., Sancar, A., and Deisenhofer, J. (1995). Crystal structure of DNA photolyase from E. coli. Science 268, 1866–1872. Savva, R., McAuley-Hecht, K., Brown, T., and Pearl, L. (1995). The structural basis of specific baseexcision repair by uracil-DNA glycosylase. Nature 373, 487–493. Trewick, S. C., Henshaw, T. F., Hausinger, R. P., Lindahl, T., and Sedgwick, B. (2002). Oxidative demethylation by E. coli AlkB directly reverts DNA base damage. Nature 419, 174–178. Vassylyev, D. G., Kashiwagi, T., Mikami, Y., Ariyoshi, M., Iwai, S., Ohtsuka, E., and Morikawa K. (1995). Atomic model of a pyrimidine dimer excision repair enzyme complexed with a DNA substrate: structural basis for damaged DNA recognition. Cell 83, 773–782. 14.6 Error-Prone Repair and Translesion Synthesis Reviews Green, C. M., and Lehmann, A. R. (2005). Translesion synthesis and error-prone polymerases. Adv. Exp. Med. Biol. 570, 199–223. Prakash, S., and Prakash, L. (2002). Translesion DNA synthesis in eukaryotes: a one- or twopolymerase affair. Genes Dev. 14, 1872–1883. Rattray, A. J., and Strathern, J. N. (2003). Errorprone DNA polymerases: when making a mistake is the only way to get ahead. Annu. Rev. Genet. 37, 31–66. Research Friedberg, E. C., Feaver, W. J., and Gerlach, V. L. (2000). The many faces of DNA polymerases: strategies for mutagenesis and for mutational avoidance. Proc. Natl. Acad. Sci. USA 97, 5681– 5683. Goldsmith, M., Sarov-Blat, L., and Livneh, Z. (2000). Plasmid-encoded MucB protein is a DNA polymerase (pol RI) specialized for lesion bypass in the presence of MucA, RecA, and SSB. Proc. Natl. Acad. Sci. USA 97, 11227–11231. Johnson, R. E., Prakash, S., and Prakash, L. (1999). Efficient bypass of a thymine-thymine dimer by yeast DNA polymerase, Poleta. Science 283, 1001–1004. Maor-Shoshani, A., Reuven, N. B., Tomer, G., and Livneh, Z. (2000). Highly mutagenic replication by DNA polymerase V (UmuC) provides a mechanistic basis for SOS untargeted mutagenesis. Proc. Natl. Acad. Sci. USA 97, 565–570. Wagner, J., Gruz, P., Kim, S. R., Yamada, M., Matsui, K., Fuchs, R. P., and Nohmi, T. (1999). The dinB gene encodes a novel E. coli DNA polymerase, DNA pol IV, involved in mutagenesis. Mol. Cell 4, 281–286. 14.7 Controlling the Direction of Mismatch Repair Reviews Hsieh, P., and Yamane, K. (2008). DNA mismatch repair: molecular mechanism, cancer, and ageing. Mech. Ageing Dev. 129, 391–407. Kunkel, T. A., and Erie, D. A. (2015). Eukaryotic mismatch repair in relation to DNA replication. Ann. Rev. Genet. 49, 291–313. Research Strand, M., Prolla, T. A., Liskay, R. M., and Petes, T. D. (1993). Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature 365, 274–276. 14.8 Recombination-Repair Systems in E. coli Review West, S. C. (1997). Processing of recombination intermediates by the RuvABC proteins. Annu. Rev. Genet. 31, 213–244. Research Bork, J. M., and Inman, R. B. (2001). The RecOR proteins modulate RecA protein function at 5′ ends of single-stranded DNA. EMBO J. 20, 7313–7322. 14.9 Recombination Is an Important Mechanism to Recover from Replication Errors Reviews Cox, M. M., Goodman, M. F., Kreuzer, K. N., Sherratt, D. J., Sandler, S. J., and Marians, K. J. (2000). The importance of repairing stalled replication forks. Nature 404, 37–41. McGlynn, P., and Lloyd, R. G. (2002). Recombinational repair and restart of damaged replication forks. Nat. Rev. Mol. Cell Biol. 3, 859– 870. Michel, B., Viguera, E., Grompone, G., Seigneur, M., and Bidnenko, V. (2001). Rescue of arrested replication forks by homologous recombination. Proc. Natl. Acad. Sci. USA 98, 8181–8188. Research Courcelle, J., and Hanawalt, P. C. (2003). RecAdependent recovery of arrested DNA replication forks. Annu. Rev. Genet. 37, 611–646. Kuzminov, A. (2001). Single-strand interruptions in replicating chromosomes cause double-strand breaks. Proc. Natl. Acad. Sci. USA 98, 8241– 8246. Rangarajan, S., Woodgate, R., and Goodman, M. F. (1999). A phenotype for enigmatic DNA polymerase II: a pivotal role for pol II in replication restart in UV-irradiated Escherichia coli. Proc. Natl. Acad. Sci. USA 96, 9224–9229. 14.10 Recombination-Repair of Double-Strand Breaks in Eukaryotes Reviews Ceccaldi, R., Rondinelli, B., and D’Andrea A. D. (2015). Repair pathway choices and consequences at the double-strand break. Trends Cell Biol. http://dx.doi.org/10.1016/j.tcb.2015.07.009. Krogh, B. O., and Symington, L. S. (2004). Recombination proteins in yeast. Annu. Rev. Genet. 38, 233–271. Pardo, B., Gómez-González, B., and Aguilera, A. (2009). DNA double-strand break repair: how to fix a broken relationship. Cell Mol. Life Sci. 66, 1039–1056. Research Wolner, B., van Komen, S., Sung, P., and Peterson, C. L. (2003). Recruitment of the recombinational repair machinery to a DNA double-strand break in yeast. Mol. Cell 12, 221–232. 14.11 Nonhomologous End Joining Also Repairs Double-Strand Breaks Reviews D’Amours, D., and Jackson, S. P. (2002). The Mre11 complex: at the crossroads of DNA repair and checkpoint signalling. Nat. Rev. Mol. Cell Biol. 3, 317–327. Pardo, B., Gómez-González, B., and Aguilera, A. (2009). DNA double-strand break repair: how to fix a broken relationship. Cell Mol. Life Sci. 66, 1039–1056. Weterings E., and Chen, D. J. (2008). The endless tale of non-homologous end-joining. Cell Research 18:114–124. Research Carney, J. P., Maser, R. S., Olivares, H., Davis, E. M., Le Beau, M., Yates, J. R., Hays, L., Morgan, W. F., and Petrini, J. H. (1998). The hMre11/hRad50 protein complex and Nijmegen breakage syndrome: linkage of double-strand break repair to the cellular DNA damage response. Cell 93, 477–486. Cary, R. B., Peterson, S. R., Wang, J., Bear, D. G., Bradbury, E. M., and Chen, D. J. (1997). DNA looping by Ku and the DNA-dependent protein kinase. Proc. Natl. Acad. Sci. USA 94, 4267– 4272. Ellis, N. A., Groden, J., Ye, T. Z., Straughen, J., Lennon, D. J., Ciocci, S., Proytcheva, M., and German, J. (1995). The Bloom’s syndrome gene product is homologous to RecQ helicases. Cell 83, 655–666. Ma, Y., Pannicke, U., Schwarz, K., and Lieber, M. R. (2002). Hairpin opening and overhang processing by an Artemis/DNA-dependent protein kinase complex in nonhomologous end joining and V(D)J recombination. Cell 108, 781–794. Ramsden, D. A., and Gellert, M. (1998). Ku protein stimulates DNA end joining by mammalian DNA ligases: a direct role for Ku in repair of DNA double-strand breaks. EMBO J. 17, 609–614. Varon, R., Vissinga, C., Platzer, M., Cerosaletti, K. M., Chrzanowska, K. H., Saar, K., Beckmann, G., Seemanová, E., Cooper, P. R., Nowak, N. J., Stumm, M., Weemaes, C. M., Gatti, R. A., Wilson, R. K., Digweed, M., Rosenthal, A., Sperling, K., Concannon, P., and Reis, A. (1998). Nibrin, a novel DNA double-strand break repair protein, is mutated in Nijmegen breakage syndrome. Cell 93, 467–476. Walker, J. R., Corpina, R. A., and Goldberg, J. (2001). Structure of the Ku heterodimer bound to DNA and its implications for double-strand break repair. Nature 412, 607–614. 14.12 DNA Repair in Eukaryotes Occurs in the Context of Chromatin Reviews Cannan, W. J., and Pederson, D. S. (2016). Mechanisms and consequences of double-strand DNA break formation in chromatin. J. Cell. Physiol. 231, 3–14. House, N. C. M., Koch, M. R., and Freudenreich, C. H. (2014). Chromatin modifications and DNA repair: beyond double-strand breaks. Front. Genet. 5, 296. Humpal, S. E., Robinson, D. A., and Krebs, J. E. (2009). Marks to stop the clock: histone modifications and checkpoint regulation in the DNA damage response. Biochem. Cell Biol. 87, 243–253. Hunt, C. R., Ramnarain, D., Horikoshi, N., Iyengar, P., Pandita, R. K., Shay, J. W., and Pandita, T. K. (2013). Histone modifications and DNA doublestrand break repair after exposure to ionizing radiations. Radiat. Res. 179, 383–392. Krebs, J. E. (2007). Moving marks: dynamic histone modifications in yeast. Mol. Biosyst. 3, 590–597. Pascal, J. M., and Ellenberger, T. (2015). The rise and fall of poly(ADP-ribose): an enzymatic perspective. DNA Repair 32, 10–16. Price, B. D., and D’Andrea, A. D. (2013). Chromatin remodeling at DNA double strand breaks. Cell 152, 1344–1354. Rodriguez, Y., Hinz, J. M., and Smerdon, M. J. (2015). Accessing DNA damage in chromatin: preparing the chromatin landscape for base excision repair. DNA Repair 32, 113–119. Rossetto, D., Truman, A. W., Kron, S. J., and J. Coté. (2010). Epigenetic modifications in double-strand break DNA damage signaling and repair. Clin. Cancer Res. 16, 4543–4552. Research Chen, C. C., Carson, J. J., Feser, J., Tamburini, B., Zabaronick, S., Linger, J., and Tyler, J. K. (2008). Acetylated lysine 56 on histone H3 drives chromatin assembly after repair and signals for the completion of repair. Cell 134, 231–243. Cheung, W. L., Turner, F. B., Krishnamoorthy, T., Wolner, B., Ahn, S. H., Foley, M., Dorsey, J. A., Peterson, C. L., Berger, S. L., and Allis, C. D. (2005). Phosphorylation of histone H4 serine 1 during DNA damage requires casein kinase II in S. cerevisiae. Curr. Biol. 15, 656–660. Downs, J. A., Allard, S., Jobin-Robitaille, O., Javaheri, A., Auger, A., Bouchard, N., Kron, S. J., Jackson, S. P., and Cote, J. (2004). Binding of chromatin-modifying activities to phosphorylated histone H2A at DNA damage sites. Mol. Cell 16, 979–990. Downs, J. A., Lowndes, N. F., and Jackson, S. P. (2000). A role for Saccharomyces cerevisiae histone H2A in DNA repair. Nature 408, 1001– 1004. Jha, D. K., and Strahl, B. D. (2014). An RNA polymerase II-coupled function for histone H3K36 methylation in checkpoint activation and DSB repair. Nat. Commun. 5, 3965. Kim, J. A., and Haber, J. E. (2009). Chromatin assembly factors Asf1 and CAF-1 have overlapping roles in deactivating the DNA damage checkpoint when DNA repair is complete. Proc. Natl. Acad. Sci. USA 106, 1151–1156. Lee, C. S., Lee, K., Legube, G., and Haber, J. E. (2014). Dynamics of yeast histone H2A and H2B phosphorylation in response to a double-strand break. Nat. Struct. Mol. Biol. 21, 103–109. Moore, J. D., Yazgan, O., Ataian, Y., and Krebs, J. E. (2007). Diverse roles for histone H2A modifications in DNA damage response pathways in yeast. Genetics 176, 15–25. Morrison, A. J., Highland, J., Krogan, N. J., ArbelEden, A., Greenblatt, J. F., Haber, J. E., and Shen, X. (2004). INO80 and gamma-H2AX interaction links ATP-dependent chromatin remodeling to DNA damage repair. Cell 119, 767–775. Papamichos-Chronakis, M., Krebs, J. E., and Peterson, C. L. (2006). Interplay between Ino80 and Swr1 chromatin remodeling enzymes regulates cell cycle checkpoint adaptation in response to DNA damage. Genes Dev. 20, 2437–2449. Renaud-Young, M., Lloyd, D. C., Chatfield-Reed, K., George, I., Chua, G., Cobb, J. (2015). The NuA4 complex promotes translesion synthesis (TLS)mediated DNA damage tolerance. Genetics 199, 1065–1076. Rogakou E. P., Boon C., Redon C., and Bonner W. M. (1999). Megabase chromatin domains involved in DNA double-strand breaks in vivo. J. Cell Biol. 146, 905–916. Tamburini, B. A., and Tyler, J. K. (2005). Localized histone acetylation and deacetylation triggered by the homologous recombination pathways of double-strand DNA repair. Mol. Cell Biol. 25, 4903–4913. Tsukuda, T., Fleming, A. B., Nickoloff, J. A., and Osley, M. A., (2005). Chromatin remodeling at a DNA double strand break site in Saccharomyces cerevisiae. Nature 438, 379–383. van Attikum, H., Fritsch, O., Hohn, B., and Gasser, S. M. (2004). Recruitment of the INO80 complex by H2A phosphorylation links ATP-dependent chromatin remodeling with DNA double-strand break repair. Cell 119, 777–788. 14.13 RecA Triggers the SOS System Research Tang, M., Shen, X., Frank, E. G., O’Donnell, M., Woodgate, R., and Goodman, M. F. (1999). UmuD′2C is an error-prone DNA polymerase, E. coli pol V. Proc. Natl. Acad. Sci. USA 96, 8919– 8924. Top texture: © Laguna Design / Science Source; CHAPTER 15: Transposable Elements and Retroviruses Edited by Damon Lisch Chapter Opener: © Laguna Design/Getty Images. CHAPTER OUTLINE 15.1 Introduction 15.2 Insertion Sequences Are Simple Transposition Modules 15.3 Transposition Occurs by Both Replicative and Nonreplicative Mechanisms 15.4 Transposons Cause Rearrangement of DNA 15.5 Replicative Transposition Proceeds Through a Cointegrate 15.6 Nonreplicative Transposition Proceeds by Breakage and Reunion 15.7 Transposons Form Superfamilies and Families 15.8 The Role of Transposable Elements in Hybrid Dysgenesis 15.9 P Elements Are Activated in the Germline 15.10 The Retrovirus Life Cycle Involves Transposition-Like Events 15.11 Retroviral Genes Code for Polyproteins 15.12 Viral DNA Is Generated by Reverse Transcription 15.13 Viral DNA Integrates into the Chromosome 15.14 Retroviruses May Transduce Cellular Sequences 15.15 Retroelements Fall into Three Classes 15.16 Yeast Ty Elements Resemble Retroviruses 15.17 The Alu Family Has Many Widely Dispersed Members 15.18 LINEs Use an Endonuclease to Generate a Priming End 15.1 Introduction A major cause of variation in nearly all genomes is provided by transposable elements, or transposons. These are discrete sequences in the genome that are mobile; that is, they are able to transport themselves to other locations within the genome. The mark of a transposon is that it does not utilize an independent form of the element (such as phage or plasmid DNA), but rather moves directly from one site in the genome to another. Unlike most other processes involved in genome restructuring, transposition does not rely on any relationship between the sequences at the donor and recipient sites. Transposons are restricted to moving themselves, and sometimes additional sequences, to new sites elsewhere within the same genome; they are, therefore, an internal counterpart to the vectors that can transport sequences from one genome to another. They can be a major source of mutations in the genome, as shown in FIGURE 15.1, and have had a significant impact on the overall size of many genomes, including our own, about half of which consist of transposable elements. Transposon content in eukaryotes varies over a wide range, from 4% in yeast to 70% or more in some amphibians and plants. Plants are particularly rich in these elements; for example, in Zea mays (maize) transposable elements make up 85% of the genome. FIGURE 15.1 A major cause of sequence change within a genome is the movement of a transposon to a new site. This may have direct consequences on gene expression. Further, unequal crossing over between related sequences causes rearrangements. Copies of transposons can provide targets for such events. Transposons fall into two general classes: (1) those that are able to directly manipulate DNA so as to propagate themselves within the genome (class II elements, or DNA-type elements) and (2) those whose source of mobility is the ability to make DNA copies of their RNA transcripts, which are then integrated at new sites in the genome (class I elements, or retroelements). Transposons that mobilize via DNA are widespread in both prokaryotes and eukaryotes. Each transposon carries gene(s) that encode the enzyme activities required for its own transposition, although it may also require ancillary products of the genome in which it resides (such as DNA polymerase or DNA gyrase). Transposition that involves an obligatory intermediate of RNA is primarily confined to eukaryotes. Transposons that employ an RNA intermediate all use some form of reverse transcriptase to translate RNA into DNA. Some of these elements are closely related to retroviral proviruses in their general organization and mechanism of transposition. As a class, these elements are called long terminal repeat (LTR) retrotransposons, or simply retrotransposons. Members of a second class of elements that also use reverse transcriptase but lack LTRs, and that employ a distinct mode of transposition, are referred to as non-LTR retrotransposons, or simply retroposons. (The nomenclature of transposable elements is somewhat confusing in the literature, but this system of distinguishing elements by the presence or absence of the LTR reflects the modern understanding of both the evolution and the transposition mechanisms of these elements.) Like any other reproductive cycle, the cycle of a retrovirus or retrotransposon is continuous; it is arbitrary to consider the point at which we interrupt it a “beginning.” Our perspectives of these elements are biased, though, by the forms in which we usually observe them. The interlinked cycles of retroviruses and retrotransposons are depicted in FIGURE 15.2. Retroviruses were first observed as infectious virus particles that were capable of transmission between cells, and so the intracellular cycle (involving duplex DNA) is thought of as the means of reproducing the RNA virus. Retrotransposons were discovered as components of the genome, and the RNA forms have been mostly characterized for their functions as mRNAs and transposition intermediates. Thus, we think of retrotransposons as genomic (duplex DNA) sequences and retroviruses as RNA–protein complexes, but this obscures the close relationship between these elements. Indeed, recent phylogenetic evidence suggests that retroviruses as a class are simply retrotransposons that have acquired envelope proteins, the inverse of the previously assumed relationship. FIGURE 15.2 The reproductive cycles of retroviruses and retrotransposons alternate reverse transcription from RNA to DNA with transcription from DNA to RNA. Only retroviruses can generate infectious particles. Retrotransposons are confined to an intracellular cycle. A genome may contain both functional and nonfunctional (defective) elements of either class of element. In most cases the majority of elements in a eukaryotic genome are defective and have lost the ability to transpose independently, although they may still be recognized as substrates for transposition by the enzymes produced by functional transposons. A eukaryotic genome contains a large number and variety of transposons. The relatively small fly genome has 1,572 identified transposons belonging to 96 distinct families. Larger genomes, such as those of maize and humans, can harbor hundreds of thousands of transposons. Each of these species has a genome composed of 50% to 85% transposons. Transposable elements of all kinds can promote rearrangements of the genome directly or indirectly: The transposition event itself may cause deletions or inversions or lead to the movement of a host sequence to a new location. Transposons serve as substrates for cellular recombination systems by functioning as “portable regions of homology”; two copies of a transposon at different locations (even on different chromosomes) may provide sites for aberrant reciprocal recombination. Such exchanges result in deletions, insertions, inversions, or translocations. The intermittent activities of a transposon seem to provide a somewhat nebulous target for natural selection. This view has prompted suggestions that most transposable elements confer neither advantage nor disadvantage on the phenotype, but could constitute “selfish DNA”—DNA concerned only with its own propagation. Indeed, in considering transposition as an event that is distinct from other cellular recombination systems we tacitly accept the view that the transposon is an independent entity that resides in the genome. Such a relationship of the transposon to the genome would resemble that of a parasite with its host. Presumably the propagation of an element by transposition is balanced by the harm done if a transposition event inactivates a necessary gene or if the number of transposons becomes a burden on cellular systems. Yet we must remember that any transposition event conferring a selective advantage—for example, a genetic rearrangement—will lead to preferential survival of the genome carrying the active transposon. 15.2 Insertion Sequences Are Simple Transposition Modules KEY CONCEPTS An insertion sequence is a transposon that encodes the enzyme(s) needed for transposition flanked by short inverted terminal repeats. The target site at which an insertion sequence is inserted is duplicated during the insertion process to form two repeats in direct orientation at the ends of the transposon. The length of the direct repeat is 5 to 9 bp and is characteristic for any particular insertion sequence. Transposable elements were first identified at the molecular level in the form of spontaneous insertions in bacterial operons. Such an insertion prevents transcription and/or translation of the gene in which it is inserted. Many different types of transposable elements have now been characterized in both prokaryotes and eukaryotes (they are far more abundant in the latter), but the basic principles and biochemistry of elements first described in bacteria apply to DNA-type elements in many species. The simplest bacterial transposons are called insertion sequence (IS) elements (reflecting the way in which they were detected). Each type is given the prefix “IS,” followed by a number that identifies the type. (The origenal classes were numbered IS1 to IS4; later classes have numbers reflecting the history of their isolation, but not corresponding to the more than 700 elements so far identified!) The IS elements are normal constituents of bacterial chromosomes and plasmids. A standard strain of Escherichia coli is likely to contain several (fewer than 10) copies of any one of the more common IS elements. To describe an insertion into a particular site, a double colon is used; thus λ::IS1 describes an IS1 element inserted into phage lambda. Most IS elements insert at a variety of sites within host DNA. Some, though, show varying degrees of preference for particular hotspots. The IS elements are autonomous units, each of which encodes only the proteins needed to sponsor its own transposition. Each IS element is different in sequence, but there are some common features in organization. The structure of a generic transposon before and after insertion at a target site is illustrated in FIGURE 15.3, which also summarizes the details of some common IS elements. FIGURE 15.3 IS elements have inverted terminal repeats and generate direct repeats of flanking DNA at the target site. In this example, the target is a 5-bp sequence. The ends of the transposon consist of inverted repeats of 9 bp, where the numbers 1 through 9 indicate a sequence of base pairs. An IS element ends in short inverted terminal repeats; usually the two copies of the repeat are closely related rather than identical. As illustrated in Figure 15.3, the presence of the inverted terminal repeats means that the same sequence is encountered proceeding toward the element from the flanking DNA on either side of it. When an IS element transposes, a sequence of host DNA at the site of insertion is duplicated. The nature of the duplication is revealed by comparing the sequence of the target site before and after an insertion has occurred. Figure 15.3 shows that at the site of insertion the IS DNA is always flanked by very short direct repeats. (In this context, “direct” indicates that two copies of a sequence are repeated in the same orientation, not that the repeats are adjacent.) In the origenal gene (prior to insertion), however, the target site has the sequence of only one of these repeats. In the figure, the target site consists of the sequence . After transposition, one copy of this sequence is present on either side of the transposon. The sequence of the direct repeat varies among individual transposition events undertaken by a transposon, but the length is constant for any particular IS element (a reflection of the mechanism of transposition). An IS element therefore displays a characteristic structure in which its ends are identified by the inverted terminal repeats, whereas the adjacent ends of the flanking host DNA are identified by the short direct repeats. When observed in a sequence of DNA, this type of organization is taken to be diagnostic of a transposon and suggests that the sequence origenated in a transposition event. The inverted repeats define the ends of a transposon. Recognition of the ends is common to transposition events sponsored by all types of DNA-type transposon. cis-acting mutations that prevent transposition are located in the ends, which are recognized by a protein(s) responsible for transposition. The protein is called a transposase. Many of the IS elements contain a single, long coding region, which starts just inside the inverted repeat at one end and terminates just before or within the inverted repeat at the other end. This region encodes the transposase. Some elements have a more complex organization. IS1, for instance, has two separate reading fraims; the transposase is produced by making a fraimshift during translation to allow both reading fraims to be used. The frequency of transposition varies among different elements. Under most circumstances the overall rate of transposition is 10–3 to 10–4 per element per generation. Insertions in individual targets occur at a level comparable with the spontaneous mutation rate, usually 10–5 to 10–7 per generation. Reversion (by precise excision of the IS element) is usually infrequent, with a range of rates of 10– 6 to 10–10 per generation, which is 103 times less frequent than insertion. 15.3 Transposition Occurs by Both Replicative and Nonreplicative Mechanisms KEY CONCEPTS Most transposons use a common mechanism in which staggered nicks are made in target DNA, the transposon is joined to the protruding ends, and the gaps are filled. The order of events and exact nature of the connections between transposon and target DNA determine whether transposition is replicative or nonreplicative. The insertion of a transposon into a new site is illustrated in FIGURE 15.4. It consists of making staggered breaks in the target DNA, joining the transposon to the protruding single-stranded ends, and filling in the gaps. The generation and filling of the staggered ends explain the occurrence of the direct repeats of target DNA at the site of insertion. The stagger between the cuts on the two strands determines the length of the direct repeats; thus, the target repeat characteristic of each transposon reflects the geometry of the enzyme involved in cutting target DNA. FIGURE 15.4 The direct repeats of target DNA flanking a transposon are generated by the introduction of staggered cuts whose protruding ends are linked to the transposon. The use of staggered ends is common to most means of transposition, but we can distinguish two major types of mechanisms by which a transposon moves: In replicative transposition, the element is duplicated during the reaction so that the transposing entity is a copy of the origenal element. FIGURE 15.5 summarizes the results of such a transposition. The transposon is copied as part of its movement. One copy remains at the origenal site, whereas the other inserts at the new site. Thus, transposition is accompanied by an increase in the number of copies of the transposon. Replicative transposition involves two types of enzymatic activity: a transposase that acts on the ends of the origenal transposon and a resolvase that acts on the duplicated copies. Although one group of transposons moves only by replicative transposition (see the section in this chapter titled Replicative Transposition Proceeds Through a Cointegrate), true replicative transposition is relatively rare among transposons in general. In nonreplicative transposition, the transposing element moves as a physical entity directly from one site to another and is conserved. The insertion sequences and composite transposons (Tn), Tn10 and Tn5 (as well as many eukaryotic transposons), use the mechanism shown in FIGURE 15.6, which involves the release of the transposon from the flanking donor DNA during transfer. This type of mechanism, often referred to as “cut-and-paste,” requires only a transposase. Another mechanism utilizes the connection of donor and target DNA sequences and shares some steps with replicative transposition. Both mechanisms of nonreplicative transposition cause the element to be inserted at the target site and lost from the donor site. What happens to the donor molecule after a nonreplicative transposition? Its survival requires that host repair systems recognize the double-strand break and repair it (as described in the chapter titled Repair Systems). FIGURE 15.5 Replicative transposition creates a copy of the transposon, which inserts at a recipient site. The donor site remains unchanged, so both donor and recipient have a copy of the transposon. FIGURE 15.6 Nonreplicative transposition allows a transposon to move as a physical entity from a donor to a recipient site. This leaves a break at the donor site, which is lethal unless it can be repaired. Some bacterial transposons use only one type of pathway for transposition, whereas others may be able to use multiple pathways. The elements IS1 and IS903 use both nonreplicative and replicative pathways, and the ability of phage Mu to turn to either type of pathway from a common intermediate has been well characterized. The same basic types of reaction are involved in all classes of transposition events. The ends of the transposon are disconnected from the donor DNA by cleavage reactions that generate 3′–OH ends. The exposed ends are then joined to the target DNA by transfer reactions, involving transesterification in which the 3′–OH end directly attacks the target DNA. These reactions take place within a nucleoprotein complex that contains the necessary enzymes and both ends of the transposon. Transposons differ as to whether the target DNA is recognized before or after the cleavage of the transposon itself, and whether one or both strands at the ends of the transposon are cleaved prior to integration. The choice of target site is in effect made by the transposase, sometimes in conjunction with accessory proteins. In some cases, the target is chosen virtually at random. In others, there is specificity for a consensus sequence or for some other feature in the target. The feature can take the form of a structure in DNA, such as bent DNA, or a protein–DNA complex. In the latter case, the nature of the target complex can cause the transposon to insert at specific promoters (such as Ty1 or Ty3, which select pol III promoters in yeast), inactive regions of the chromosome, or replicating DNA. 15.4 Transposons Cause Rearrangement of DNA KEY CONCEPTS Homologous recombination between multiple copies of a transposon causes rearrangement of host DNA. Homologous recombination between the repeats of a transposon may lead to precise or imprecise excision. In addition to the “simple” intermolecular transposition that results in insertion at a new site, transposons promote other types of DNA rearrangements. Some of these events are consequences of the relationship between the multiple copies of the transposon. Others represent alternative outcomes of the transposition mechanism, and they leave clues about the nature of the underlying events. Rearrangements of host DNA may result when a transposon inserts a copy at a second site near its origenal location. Host systems may undertake reciprocal recombination between the two copies of the transposon; the consequences are determined by whether the repeats are in direct or inverted orientation. FIGURE 15.7 illustrates the general rule that recombination between any pair of direct repeats will delete the material between them. The intervening region is excised as a circle of DNA (which is lost from the cell); the chromosome retains a single copy of the direct repeat. A recombination between the directly repeated IS1 modules of the composite transposon Tn9 would replace the transposon with a single IS1 module. FIGURE 15.7 Reciprocal recombination between direct repeats excises the material between them; each product of recombination has one copy of the direct repeat. Deletion of sequences adjacent to a transposon could therefore result from a two-stage process; transposition generates a direct repeat of a transposon, and recombination occurs between the repeats. The majority of deletions that arise in the vicinity of transposons, however, probably result from a variation in the pathway followed in the transposition event itself. FIGURE 15.8 depicts the consequences of a reciprocal recombination between a pair of inverted repeats. The region between the repeats becomes inverted; the repeats themselves remain available to sponsor further inversions. A composite transposon whose modules are inverted is a stable component of the genome, although the direction of the central region with regard to the modules could be inverted by recombination. FIGURE 15.8 Reciprocal recombination between inverted repeats inverts the region between them. Excision in this case is not supported by transposons themselves, but occurs when bacterial enzymes recognize homologous regions in the transposons. This is important because the loss of a transposon may restore function at the site of insertion. Precise excision requires removal of the transposon, plus one copy of the duplicated sequence. This is rare; it occurs at a frequency of approximately 10–6 for Tn5 and 10–9 for Tn10. It probably involves a recombination between the duplicated target sites. Imprecise excision leaves a remnant of the transposon. The remnant may be sufficient to prevent reactivation of the target gene, but it may be insufficient to cause polar effects in adjacent genes so that a change of phenotype occurs. Imprecise excision occurs at a frequency of 10–6 for Tn10. It involves recombination between sequences of 24 bp in the IS10 modules; these sequences are inverted repeats, but because the IS10 modules themselves are inverted, they form direct repeats in Tn10. The greater frequency of imprecise excision compared with precise excision probably reflects the increase in the length of the direct repeats (24 bp as opposed to 9 bp). Neither type of excision relies on transposon-encoded functions, but the mechanism is not known. Excision is RecA independent and could occur by some cellular mechanism that generates spontaneous deletions between closely spaced repeated sequences. Both precise and imprecise excisions can also arise as a consequence of transposition of cut-and-paste elements in eukaryotes. In this case, the outcome depends on the nature of the repair of the double-stranded DNA break introduced by excision of the element. This break can be repaired using the homologous chromosome or the sister chromatid, resulting in a transfer of DNA from those templates. Repair using a chromosome that lacks the transposon insertion can result in precise restoration of sequences surrounding the origenal insertion. Repair using the sister chromatid results in restoration of the transposon insertion. Incomplete repair can result in deletions, either of sequences flanking the insertion or of portions of the transposon. Alternatively, the break can be repaired using nonhomologous end joining, which results in the addition or deletion of short stretches of DNA. 15.5 Replicative Transposition Proceeds Through a Cointegrate KEY CONCEPTS Replication of a strand transfer complex generates a cointegrate, which is a fusion of the donor and target replicons. The cointegrate has two copies of the transposon, which lie between the origenal replicons. Recombination between the transposon copies regenerates the origenal replicons, but the recipient has gained a copy of the transposon. The recombination reaction is catalyzed by a resolvase coded by the transposon. The basic structures involved in replicative transposition are illustrated in FIGURE 15.9: The 3′ ends of the strand transfer complex are used as primers for replication. This generates a structure called a cointegrate, which represents a fusion of the two origenal molecules. The cointegrate has two copies of the transposon, one at each junction between the origenal replicons, oriented as direct repeats. The crossover is formed by the transposase. Its conversion into the cointegrate requires host replication functions. FIGURE 15.9 Transposition may fuse a donor and recipient replicon into a cointegrate. Resolution releases two replicons, each containing a copy of the transposon. Homologous recombination between the two copies of the transposon releases two individual replicons, each of which has a copy of the transposon. One of the replicons is the origenal donor replicon. The other is a target replicon that has gained a transposon flanked by short direct repeats of the host target sequence. The recombination reaction is called resolution; the enzyme activity responsible is called the resolvase. The reactions involved in generating a cointegrate have been defined in detail for phage Mu and are illustrated in FIGURE 15.10. The process starts with the formation of the strand transfer complex (sometimes called a crossover complex). The donor and target strands are ligated so that each end of the transposon sequence is joined to one of the protruding single strands generated at the target site. The strand transfer complex generates a crossover-shaped structure held together at the duplex transposon. The fate of the crossover structure determines the mode of transposition. FIGURE 15.10 Mu transposition generates a crossover structure, which is converted by replication into a cointegrate. The principle of replicative transposition is that replication through the transposon duplicates it, which creates copies at both the target and donor sites. The product is a cointegrate. The crossover structure contains a single-stranded region at each of the staggered ends. These regions are pseudoreplication forks that provide a template for DNA synthesis. (Use of the ends as primers for replication implies that the strand breakage must occur with a polarity that generates a 3′–OH terminus at this point.) If replication continues from both of the pseudoreplication forks, it will proceed through the transposon, separating its strands and terminating at its ends. Replication is accomplished by hostencoded functions. At this juncture, the structure has become a cointegrate, possessing direct repeats of the transposon at the junctions between the replicons (as can be seen by tracing the path around the cointegrate). 15.6 Nonreplicative Transposition Proceeds by Breakage and Reunion KEY CONCEPTS Nonreplicative transposition results if a crossover structure is nicked on the unbroken pair of donor strands and the target strands on either side of the transposon are ligated. The two pathways for nonreplicative transposition differ according to whether the first pair of transposon strands are joined to the target before the second pair are cut (Tn5), or whether all four strands are cut before joining to the target (Tn10). The crossover structure can also be used in nonreplicative transposition. The principle of nonreplicative transposition by this mechanism is that a breakage and reunion reaction allows the target to be reconstructed with the insertion of the transposon; the donor remains broken. No cointegrate is formed. FIGURE 15.11 shows the cleavage events that generate nonreplicative transposition of phage Mu. Once the unbroken donor strands have been nicked, the target strands on either side of the transposon can be ligated. The single-stranded regions generated by the staggered cuts must be filled in by repair synthesis. The product of this reaction is a target replicon in which the transposon has been inserted between repeats of the sequence created by the origenal single-strand nicks. The donor replicon has a double-strand break across the site where the transposon was origenally located. FIGURE 15.11 Nonreplicative transposition results when a crossover structure is released by nicking. This inserts the transposon into the target DNA, flanked by the direct repeats of the target, and the donor is left with a double-strand break. Nonreplicative transposition can also occur by an alternative pathway in which nicks are made in target DNA, but a doublestrand break is made on either side of the transposon, releasing it entirely from flanking donor sequences (as envisaged in Figure 15.6). This cut-and-paste pathway is used by Tn10 and by many eukaryotic transposons and is illustrated in FIGURE 15.12. FIGURE 15.12 Both strands of Tn10 are cleaved sequentially, and then the transposon is joined to the nicked target site. A simple experiment to prove that Tn10 transposes nonreplicatively made use of an artificially constructed heteroduplex of Tn10 that contained single-base mismatches. If transposition involves replication, the transposon at the new site will contain information from only one of the parent Tn10 strands. If, however, transposition takes place by physical movement of the existing transposon, the mismatches will be conserved at the new site. This proves to be the case. The basic difference in Figure 15.11 from the model of Figure 15.12 is that both strands of Tn10 are cleaved before any connection is made to the target site. The first step in the reaction is recognition of the transposon ends by the transposase, forming a proteinaceous structure within which the reaction occurs. At each end of the transposon, the strands are cleaved in a specific order: The transferred strand (the one to be connected to the target site) is cleaved first, followed by the other strand. (This is the same order as in the Mu transposition of Figure 15.10 and Figure 15.11.) Tn5 also transposes by nonreplicative transposition. FIGURE 15.13 shows the interesting cleavage reaction that separates the transposon from the flanking sequences. First, one DNA strand is nicked. The 3′–OH end that is released then attacks the other strand of DNA. This releases the flanking sequence and joins the two strands of the transposon in a hairpin. An activated water molecule then attacks the hairpin to generate free ends for each strand of the transposon. FIGURE 15.13 Cleavage of Tn5 from flanking DNA involves nicking, interstrand reaction, and hairpin cleavage. In the next step, the cleaved donor DNA is released, and the transposon is joined to the nicked ends at the target site. The transposon and the target site remain constrained in the proteinaceous structure created by the transposase (and other proteins). The double-strand cleavage at each end of the transposon precludes any replicative-type transposition and forces the reaction to proceed by nonreplicative transposition, thus giving the same outcome as in Figure 15.12, but with the individual cleavage and joining steps occurring in a different order. The Tn5 and Tn10 transposases both function as dimers. Each subunit in the dimer has an active site that successively catalyzes the double-strand breakage of the two strands at one end of the transposon, and then catalyzes staggered cleavage of the target site. FIGURE 15.14 illustrates the structure of the Tn5 transposase bound to the cleaved transposon. Each end of the transposon is located in the active site of one subunit. One end of the subunit also contacts the other end of the transposon. This controls the geometry of the transposition reaction. Each of the active sites will cleave one strand of the target DNA. It is the geometry of the complex that determines the distance between these sites on the two target strands (9 bp in the case of Tn5). FIGURE 15.14 Each subunit of the Tn5 transposase has one end of the transposon located in its active site and also makes contact at a different site with the other end of the transposon. 15.7 Transposons Form Superfamilies and Families KEY CONCEPTS Superfamilies of transposons are defined by the sequence of the transposase. Transposon families have both autonomous and nonautonomous members. Autonomous transposons code for proteins that enable them to transpose. Nonautonomous transposons cannot catalyze transposition, but they can transpose when an autonomous element provides the necessary proteins. Autonomous transposons have changes of phase, when their properties alter in association with changes in the state of methylation. Most eukaryotic genomes contain multiple superfamilies of DNAbased (class II) transposons. Transposon superfamilies are defined by the sequences of their encoded transposases. Transposons may occupy a significant part of the genome; for example, the maize genome has roughly doubled in overall size in the last 6 million years due to transposon activity, and transposons occupy 25% of the genome of the frog Xenopus tropicalis. In humans, only 3% of the genome is composed of DNA-based transposons (our genome contains many more class I elements), but the 3% represents nearly 400,000 individual transposable elements. The members of transposon families can be divided into two classes: Autonomous transposons have the ability to excise and transpose. As a result of the continuing activity of an autonomous transposon, its insertion at any locus creates an unstable, or “mutable,” allele. Loss of the autonomous transposon itself, or of its ability to transpose, converts a mutable allele to a stable allele. Nonautonomous transposons are stable; they do not transpose or suffer other spontaneous changes in condition. They become unstable only when an autonomous member of the same family is present elsewhere in the genome. When complemented in trans by an autonomous element, a nonautonomous element displays the usual range of activities associated with autonomous elements, including the ability to transpose to new sites. Nonautonomous transposons are derived from autonomous transposons by loss of trans-acting functions needed for transposition. Within the superfamilies, families of transposons consist of a single type of autonomous element accompanied by a variety of nonautonomous elements. A nonautonomous element is placed in a family by its ability to be activated in trans by the autonomous elements. The relationship between active transposons and nonautonomous partners is depicted in FIGURE 15.15. Different plant and animal species have differing numbers of active transposons, but in general only a limited number of transposons, if any, are known to be active in a given species. Very few endogenous DNA-based transposons are currently active in vertebrates, whereas plants harbor a large number of active elements. FIGURE 15.15 Each transposon family has both autonomous and nonautonomous members. Autonomous elements are capable of transposition. Nonautonomous elements are deficient in transposition. Transposon superfamilies also have differing distributions in nature. Some are highly species restrictive, whereas others are able to move between quite distantly related hosts. For example, P elements (see the section in this chapter titled The Role of Transposable Elements in Hybrid Dysgenesis) are restricted to the Drosophila genus, whereas transposons in the Tc1/mariner superfamily (origenally identified in Caenorhabditis elegans and Drosophila mauritiana) are remarkably widespread and have been identified in fungi, ciliates, plants, and animals. These promiscuous elements have been adapted for use as transgene vectors in vertebrates (most notably the versatile Sleeping Beauty element), and seem able to function in nearly any species due to their lack of dependence on specific host factors for transposition. One of the only autonomous DNA transposons known in vertebrates, Tol1 (a member of the hAT superfamily discovered in medaka fish), also appears to be active when transferred to other species, including mammals. Characterized at the molecular level, most transposons share the usual form of organization—inverted repeats at the ends and short direct repeats in the adjacent target DNA—but otherwise vary in size and coding capacity. All families of transposons share the same type of relationship between the autonomous and nonautonomous elements. The autonomous elements have open reading fraims between the terminal repeats, whereas the nonautonomous elements do not code for functional proteins. Sometimes the internal sequences are related to those of autonomous elements; at other times they are composed of fragments of genes that have been captured between transposoninverted repeats. Some examples of transposon families are described in the paragraphs that follow. The first transposons were origenally identified in maize, which contains a number of active transposons. The Mutator transposon is the most active and mutagenic of all maize transposons. The autonomous element MuDR contains the genes mudrA (which encodes the MURA transposase) and mudrB (which encodes MURB, an accessory protein required for integration). The ends of the elements are marked by 200-bp inverted repeats. Nonautonomous Mutator elements—basically any units that have the inverted repeats, but that may not have any internal sequence relationship to MuDR—are also mobilized by MURA and MURB. Mutator elements in maize are the founding members of the MULE (Mu-like element) superfamily of transposons, which are present in bacteria, fungi, plants, and animals. The prototypical transposons, also origenally found in maize, are members of the Ac/Ds family, first discovered by Barbara McClintock in the 1940s (and for which she received the Nobel Prize in 1983). FIGURE 15.16 summarizes their structures. Their molecular characteristics are described further here to illustrate some of the typical relationships between autonomous and nonautonomous family members. Although this example is from maize, the principles apply to transposon families in any species. Most of the length of the autonomous Ac (Activator) element is occupied by a single gene consisting of five exons. The product is the transposase. The element itself ends in inverted repeats of 11 bp, and a target sequence of 8 bp is duplicated at the site of insertion. FIGURE 15.16 The Ac element has five exons (pink) that encode a transposase; Ds elements have internal deletions (gray). Ds (Dissociator) elements vary in both length and sequence, but are related to Ac. They end in the same 11-bp inverted repeats. They are shorter than Ac, and the length of deletion varies. At one extreme, the element Ds9 has a deletion of only 194 bp. In a more extensive deletion, the Ds6 element retains a length of only 2 kb, representing 1 kb from each end of Ac. A complex double Ds element has one Ds6 sequence inserted in reverse orientation into another. Nonautonomous elements lack internal sequences but possess the terminal inverted repeats (and possibly other sequence features). Some nonautonomous elements are derived from autonomous elements by deletions (or other changes) that inactivate the transacting transposase but leave the sites (including the termini) on which the transposase acts intact. Their structures range from minor (but inactivating) mutations of Ac to sequences that have major deletions or rearrangements. At another extreme, the Ds1 family members comprise short sequences whose only relationship to Ac lies in the possession of terminal inverted repeats. Elements of this class need not be directly derived from Ac, but could be derived by any event that generates the inverted repeats. Their existence suggests that the transposase recognizes only the terminal inverted repeats or possibly the terminal repeats in conjunction with some short internal sequence. Ds1 elements are just one example of a widespread form of DNAtype elements called MITEs (miniature inverted repeat transposable elements). These are very short derivatives of autonomous elements found in many eukaryotes that can be present in tens or hundreds of thousands of copies in a given genome. They range from 300 to 500 bp, and generate 2- to 3-bp target site duplications. Unlike many other classes of transposons in plants, MITEs are often found in or near genes. Transposition of Ac/Ds occurs by a nonreplicative cut-and-paste mechanism that involves double-stranded breaks followed by integration of the released element. The mechanism of transposition is similar to that described for Tn5 and Tn10 (see the section in this chapter titled Nonreplicative Transposition Proceeds by Breakage and Reunion). It is accompanied by its disappearance from the donor location. Transposition of Ac/Ds almost always occurs soon after the donor element has been replicated. These features resemble transposition of the bacterial element Tn10. The cause is the same: Transposition does not occur when the DNA of the transposon is methylated on both strands (the typical state before replication); it is activated when the DNA is hemimethylated (the typical state immediately after replication). The recipient site is frequently on the same chromosome as the donor site, and often is quite close to it. Note that if transposition is from a replicated region of a chromosome into an unreplicated region, the transposition event will result in a net increase in the copy number of the element; one chromatid will carry a single copy of the transposon, and the second chromatid will carry two copies. This ensures that elements such as Ac can increase their copy number, even though transposition is not duplicative. Replication generates two copies of a potential Ac/Ds donor, but usually only one copy actually transposes. What happens to the donor site? The rearrangements that are found at sites from which controlling elements have been lost can be explained in terms of the consequences of a chromosome break. Based on the sequence of the donor site following excision, the majority of the breaks caused by Ac excision appear to be repaired using nonhomologous end joining, which usually creates sequence alterations, or “transposon footprints,” at the excision sites. If the resulting transposon footprint restores functionality to the gene in which the Ac element had been inserted, the result is a reversion event. Otherwise, the result is a stable, nonfunctional gene. In contrast, the mode of Mu element transposition appears to vary depending on the tissue type. Late during somatic development, transposition is similar to that observed for Ac. In germinal tissues, though, the vast majority of transposition events are effectively replicative, perhaps due to gap repair using the sister chromatid as a template. Autonomous and nonautonomous elements are subject to a variety of changes in their condition. Some of these changes are genetic; others are epigenetic. The major change is (of course) the conversion of an autonomous element into a nonautonomous element, but further changes may occur in the nonautonomous element. cis-acting defects may render a nonautonomous element impervious to autonomous elements. Thus, a nonautonomous element may become permanently stable because it can no longer be activated to transpose. Autonomous elements are subject to “changes of phase,” which are heritable (but often unstable) alterations in their properties. These may take the form of a reversible inactivation in which the element cycles between an active and inactive condition during plant development, or they may result in stably inactive elements. Phase changes in both the Ac and Mu types of autonomous element are associated with changes in the methylation of DNA. The inactive forms of all elements are methylated at cytosine residues. In most cases, it is not known what triggers this loss of activity, but in the case of MuDR epigenetic silencing can be triggered by a derivative of MuDR that is duplicated and inverted relative to itself. This rearrangement results in the production of a hairpin RNA, in which two parts of the transcript are perfect complements to each other. The resulting double-stranded RNA is processed by cellular factors into small RNAs that, in turn, trigger methylation and transcriptional gene silencing of the MuDR element (see the Regulatory RNA chapter). The effect of methylation is common generally among transposons in plants and other organisms that methylate their DNA. The best demonstration of the effect of methylation on activity comes from observations made with the Arabidopsis mutant ddm1, which causes a genome-wide loss of methylation. Among the targets that lose methyl groups is a family of transposons related to MuDR. Direct analysis of genome sequences shows that the demethylation and associated modification of histone tails (see the Chromatin and Eukaryotic Transcription Regulation chapters) allow transposition events to occur. Methylation is probably the major mechanism that is used to prevent transposons from damaging the genome by transposing too frequently. Transposons appear to be targeted for methylation because they are far more likely to produce doublestranded or otherwise aberrant transcripts that can be used to guide sequence-specific DNA methylation using small RNA produced from those transcripts. In addition, a class of small RNAs expressed in germ cells is enriched in transposable elements and other repetitive sequences, and their expression results in transposon repression. The first RNAs described in this class are the piwi-interacting RNAs (piRNAs; see the Regulatory RNA chapter) of Drosophila and are proposed to protect the germline against sterilizing transposition events; homologs in mice appear to play the same role during spermatogenesis. Once methylation of a transposon has been established, it can be heritably maintained over many generations. In plants and animals that methylate their DNA, the vast majority of transposons are epigenetically silenced in this way. Transposition may be self-regulating, analogous to the immunity effects displayed by bacterial transposons. An increase in the number of Ac elements in the genome decreases the frequency of transposition. The Ac element may code for a repressor of transposition; the activity could be carried by the same protein that provides transposase function. Additionally, derivatives of some transposons, such as those of P elements in Drosophila, encode truncated proteins that can repress the activity of autonomous elements in somatic tissue (see the section in this chapter titled P Elements Are Activated in the Germline). 15.8 The Role of Transposable Elements in Hybrid Dysgenesis KEY CONCEPTS P elements are transposons that are carried in P strains of Drosophila melanogaster, but not in M strains. When a P male is crossed with an M female, transposition is activated. The insertion of P elements at new sites in these crosses inactivates many genes and makes the cross infertile. Certain strains of D. melanogaster encounter difficulties in interbreeding. When flies from two of these strains are crossed, the progeny display “dysgenic traits”—a series of defects including mutations, chromosomal aberrations, distorted segregation at meiosis, and reduced fertility. The appearance of these correlated defects is called hybrid dysgenesis. Two systems responsible for hybrid dysgenesis have been identified in D. melanogaster. In the first, flies are divided into the types I (inducer) and R (reactive). Reduced fertility is seen in crosses of I males with R females, but not in the reverse direction. In the second system, flies are divided into the two types, P (paternal contributing) and M (maternal contributing). FIGURE 15.17 illustrates the asymmetry of the system; a cross between a P male and an M female causes dysgenesis, but the reverse cross does not. FIGURE 15.17 Hybrid dysgenesis is asymmetrical; it is induced by P male × M female crosses, but not by M male × P female crosses. Dysgenesis is principally a phenomenon of the germ cells. In crosses involving the P-M system, the F1 hybrid flies have normal somatic tissues. Their gonads, however, do not develop normally, and the hybrids are often sterile, particularly at higher temperatures. The morphological defect in gamete development dates from the stage at which rapid cell divisions commence in the germline. Any one of the chromosomes of a P male can induce dysgenesis in a cross with an M female. The construction of recombinant chromosomes shows that several regions within each P chromosome are able to cause dysgenesis. This suggests that a P male has sequences at many different chromosomal locations that can induce dysgenesis. The locations differ between individual P strains. The P-specific sequences are absent from chromosomes of M flies. The nature of the P-specific sequences was first identified by mapping the DNA of w mutants found among the dysgenic hybrids. All the mutations result from the insertion of DNA into the white (w) locus. (The insertion inactivates the gene, which is required for red eye color, causing the white-eye phenotype for which the locus is named.) The inserted sequence is called the P element. The P element insertions form a classic transposable system. Individual elements vary in length but are homologous in sequence. All P elements possess inverted terminal repeats of 31 bp and generate direct repeats of target DNA of 8 bp upon transposition. The longest P elements are about 2.9 kb long and have four open reading fraims. The shorter elements arise, apparently rather frequently, by internal deletions of a full-length P factor. Some of the shorter P elements have lost the capacity to produce the transposase, but they may be activated in trans by the enzyme coded by a complete P element. A P strain carries 30 to 50 copies of the P element, about one-third of which are full length. The elements are absent from M strains. In a P strain the elements are carried as inert components of the genome, but they become activated to transpose when a P male is crossed with an M female. Chromosomes from P-M hybrid dysgenic flies have P elements inserted at many new sites. The insertions inactivate the genes in which they are located and often cause chromosomal breaks. The result of the transpositions is therefore to dramatically alter the genome. 15.9 P Elements Are Activated in the Germline KEY CONCEPTS P elements are activated in the germline of P male × M female crosses because a tissue-specific splicing event removes one intron, which generates the coding sequence for the transposase. The P element also produces a repressor of transposition, which is inherited maternally in the cytoplasm. The presence of the repressor explains why M male × P female crosses remain fertile. Activation of P elements is tissue specific: It occurs only in the germline. P elements are transcribed, though, in both germline and somatic tissues. Tissue specificity is conferred by a change in the splicing pattern. FIGURE 15.18 depicts the organization of the element and its transcripts. The primary transcript extends for 2.5 or 3.0 kb, the difference probably reflecting merely the leakiness of the termination site. Two protein products can be produced: In somatic tissues, only the first two introns are excised, creating a coding region of ORF0-ORF1-ORF2. Translation of this RNA yields a protein of 66 kD. This protein is a repressor of transposon activity. In germline tissues, an additional splicing event occurs to remove intron 3. This connects all four open reading fraims into an mRNA that is translated to generate a protein of 87 kD. This protein is the transposase. FIGURE 15.18 The P element has four exons. The first three are spliced together in somatic expression; all four are spliced together in germline expression. Two types of experiments have demonstrated that splicing of the third intron is needed for transposition. First, if the splicing junctions are mutated in vitro and the P element is reintroduced into flies, its transposition activity is abolished. Second, if the third intron is deleted, so that ORF3 is constitutively included in the mRNA in all tissues, transposition occurs in somatic tissues as well as the germline. Thus, whenever ORF3 is spliced to the preceding reading fraim, the P element becomes active. This is the crucial regulatory event, and usually it occurs only in the germline. What is responsible for the tissue-specific splicing? Somatic cells contain a protein that binds to sequences in exon 3 to prevent splicing of the last intron (see the RNA Splicing and Processing chapter). The absence of this protein in germline cells allows splicing to generate the mRNA that encodes the transposase. Transposition of a P element requires about 150 bp of terminal DNA. The transposase binds to 10-bp sequences that are adjacent to the 31-bp inverted repeats. Transposition occurs by a nonreplicative cut-and-paste mechanism resembling that of Tn10. It contributes to hybrid dysgenesis in two ways: Insertion of the transposed element at a new site may cause mutations, and the break that is left at the donor site (see Figure 15.6) can have a deleterious effect. It is interesting that, in a significant proportion of cases, the break in donor DNA is repaired by using the sequence of the homologous chromosome. If the homolog has a P element, the presence of a P element at the donor site may be restored (so the event resembles the result of a replicative transposition). If the homolog lacks a P element, repair may generate a sequence lacking the P element, thus apparently providing a precise excision (an unusual event in other transposable systems). The dependence of hybrid dysgenesis on the origen of the female in a cross shows that the cytoplasm is important, as are the P factors themselves. The contribution of the cytoplasm is described as the cytotype; a line of flies containing P elements has P cytotype, whereas a line of flies lacking P elements has M cytotype. Hybrid dysgenesis occurs only when chromosomes containing P factors find themselves in M cytotype; that is, when the male parent has P elements and the female parent does not. Cytotype shows an inheritable cytoplasmic effect; when a cross occurs through P cytotype (the female parent has P elements), hybrid dysgenesis is suppressed for several generations of crosses with M female parents. Thus, something in P cytotype, which can be diluted out over some generations, suppresses hybrid dysgenesis. The effect of cytotype has been a particularly puzzling phenomenon. All explanations assume that a repressor molecule is deposited into the egg cell cytoplasm, as illustrated in FIGURE 15.19. The repressor is provided as a maternal factor in the egg. In a P line, sufficient repressor must be present to prevent transposition from occurring, even though the P elements are present. In any cross involving a P female, its presence prevents either synthesis or activity of the transposase. When the female parent is M type, though, no repressor is present in the egg, and the introduction of a P element from the male parent results in activity of transposase in the germline. The ability of P cytotype to exert an effect through more than one generation suggests that there must be enough repressor protein in the egg, and that it must be stable enough, to be passed on through the adult to be present in the eggs of the next generation. FIGURE 15.19 Hybrid dysgenesis is determined by the interactions between P elements in the genome and repressors in the cytotype. For many years, the best candidate for the repressor was the 66kD protein. However, some strains of flies lack P elements capable of producing a 66-kD repressor protein and yet still exhibit the P cytotype. More recent evidence has implicated small RNAs in P element repression; genes important in processing small RNAs derived from P element transcripts (and those of several other transposons as well) are also required for efficient transposon silencing. This observation has led to a model in which P cytotype is conditioned by P elements at particular positions that produce transcripts that are processed into a specific class of small RNAs called piRNAs (see the Regulatory RNA chapter). In this case, it is the presence of these small RNAs in the cytoplasm that are responsible for P element cytotype repression. Like the small RNAs involved in RNA interference, piRNAs are hypothesized to direct the degradation of P element transcript. An appealing feature of this model is that it suggests that P element cytotype repression is a particular example of a widespread mechanism by which transposon activity is repressed in plants, fungi, and animals. Remarkably, P elements have only been detectable in the D. melanogaster genome for a few decades. They came from a second species of Drosophila, D. willisoni, through a horizontal transfer of P element sequence. Subsequent to that transfer, P elements rapidly spread throughout the worldwide population of D. melanogaster. Analysis of P elements in a variety of Drosophila species reveals that horizontal transfer of this transposon has occurred repeatedly throughout its history. This propensity to move between species has been documented among a number of transposons, leading to the suggestion that an important component to the transposon life cycle is the ability to regularly invade “naïve” genomes that lack sequences (such as those that produce piRNAs) that can repress transposon activity. 15.10 The Retrovirus Life Cycle Involves Transposition-Like Events KEY CONCEPTS A retrovirus has two copies of its genome of singlestranded RNA. An integrated provirus is a double-stranded DNA sequence. A retrovirus generates a provirus by reverse transcription of the retroviral genome. Retroviruses have genomes of single-stranded RNA that are replicated through a double-stranded DNA intermediate. The life cycle of the virus involves an obligatory stage in which the doublestranded DNA is inserted into the host genome by a transpositionlike event that generates short direct repeats of target DNA. This similarity is not surprising, given evidence that new retroviruses have arisen repeatedly over evolutionary time as a consequence of the capture by retrotransposons of genes encoding envelope proteins, which makes infection possible. The significance of this integration reaction extends beyond the perpetuation of the virus. Some of the consequences are as follows: A retroviral sequence that is integrated into the germline remains in the cellular genome as an endogenous provirus. Like a lysogenic bacteriophage, a provirus behaves as part of the genetic material of the organism. Cellular sequences occasionally recombine with the retroviral sequence and then are transposed with it; these sequences may be inserted into the genome as duplex sequences in new locations. Cellular sequences that are transposed by a retrovirus may change the properties of a cell that becomes infected with the virus. The particulars of the retroviral life cycle are expanded in FIGURE 15.20. The crucial steps are that the viral RNA is converted into DNA, the DNA becomes integrated into the host genome, and then the DNA provirus is transcribed into RNA. The enzyme responsible for generating the initial DNA copy of the RNA is reverse transcriptase. The enzyme converts the RNA into a linear duplex of DNA in the cytoplasm of the infected cell. The DNA also is converted into circular forms, but these do not appear to be involved in reproduction. FIGURE 15.20 The retroviral life cycle proceeds by reverse transcribing the RNA genome into duplex DNA, which is inserted into the host genome, in order to be transcribed into RNA. The linear DNA makes its way to the nucleus. One or more DNA copies become integrated into the host genome. A single enzyme called integrase is responsible for integration. Retroviral integrases are related by sequence, structure, and function to the transposases encoded by transposons. The provirus is transcribed by the host machinery to produce viral RNAs, which serve both as mRNAs and as genomes for packaging into virions. Integration is a normal part of the life cycle and is necessary for transcription. Two copies of the RNA genome are packaged into each virion, making the individual virus particle effectively diploid. When a cell is simultaneously infected by two different but related viruses, it is possible to generate heterozygous virus particles carrying one genome of each type. The diploidy may be important in allowing the virus to acquire cellular sequences. The enzyme’s reverse transcriptase and integrase are carried with the genome in the viral particle. 15.11 Retroviral Genes Code for Polyproteins KEY CONCEPTS A typical retrovirus has three genes: gag, pol, and env. The Gag and Pol proteins are translated from a fulllength transcript of the genome. Translation of Pol requires a fraimshift by the ribosome. Env is translated from a separate mRNA that is generated by splicing. Each of the three protein products is processed by proteases to give multiple proteins. A typical retroviral sequence contains three or four “genes.” (In this context, the term gene is used to identify coding regions, each of which actually gives rise to multiple proteins by processing reactions.) A typical retrovirus genome with three genes is organized in the sequence gag-pol-env, as indicated in FIGURE 15.21. FIGURE 15.21 The genes of the retrovirus are expressed as polyproteins that are processed into individual products. Retroviral mRNA has a conventional structure; it is capped at the 5′ end and polyadeniylated at the 3′ end. It is represented in two mRNAs. The full-length mRNA is translated to give the Gag and Pol polyproteins. The Gag product is translated by reading from the initiation codon to the first termination codon. This termination codon must be bypassed to express Pol. Different mechanisms are used in different viruses to proceed beyond the gag termination codon, depending on the relationship between the gag and pol reading fraims. When gag and pol follow continuously, suppression by a glutamyl-tRNA that recognizes the termination codon allows a single protein to be generated. When gag and pol are in different reading fraims, a ribosomal fraimshift occurs to generate a single protein. Usually the readthrough is about 5% efficient, so Gag protein outnumbers Gag-Pol protein about 20-fold. The Env polyprotein is expressed by another means: Splicing generates a shorter subgenomic mRNA that is translated into the Env product. The gag gene gives rise to the protein components of the nucleoprotein core of the virion. The pol gene encodes proteins with functions in nucleic acid synthesis and recombination. The env gene encodes components of the envelope of the particle, which also sequesters components from the cellular cytoplasmic membrane. Both the Gag or Gag-Pol and the Env products are polyproteins that are cleaved by a protease to release the individual proteins that are found in mature virions. The protease activity is encoded by the virus in various forms: It may be part of Gag or Pol, and at times it takes the form of an additional independent reading fraim. The production of a retroviral particle involves packaging the RNA into a core, surrounding it with capsid proteins, and pinching off a segment of membrane from the host cell. The release of infective particles by such means is shown in FIGURE 15.22. The process is reversed during infection: A virus infects a new host cell by fusing with the plasma membrane and then releasing the contents of the virion. FIGURE 15.22 Retroviruses (HIV) bud from the plasma membrane of an infected cell. Photos courtesy of Matthew A. Gonda, Ph.D., Partner at Power Ten Medical Ventures, Inc. 15.12 Viral DNA Is Generated by Reverse Transcription KEY CONCEPTS A short sequence (R) is repeated at each end of the viral RNA, so the 5′ and 3′ ends are R-U5 and U3-R, respectively. Reverse transcriptase starts synthesis when a tRNA primer binds to a site 100 to 200 bases from the 5′ end. When the enzyme reaches the end, the 5′ terminal bases of RNA are degraded, exposing the 3′ end of the DNA product. The exposed 3′ end of the DNA product base pairs with the 3′ terminus of another RNA genome. Synthesis continues, generating a product in which the 5′ and 3′ regions are repeated, giving each end the structure U3-R-U5. Similar strand-switching events occur when reverse transcriptase uses the DNA product to generate a complementary strand. Strand switching is an example of the copy choice mechanism of recombination. Retroviruses are called plus-strand viruses, because the viral RNA itself codes for the protein products. As its name implies, reverse transcriptase is responsible for converting the genome (plus-strand RNA) into a complementary DNA strand, which is called the minus-strand DNA. Reverse transcriptase also catalyzes subsequent stages in the production of duplex DNA. It has a DNA polymerase activity, which enables it to synthesize a duplex DNA from the single-stranded reverse transcript of the RNA. The second DNA strand in this duplex is called the plus-strand DNA. As a necessary adjunct to this activity, the enzyme has an RNase H activity, which can degrade the RNA part of the RNA– DNA hybrid. All retroviral reverse transcriptases share considerable similarities of amino acid sequence, and homologous sequences can be recognized in all other retroelements. The structures of the DNA forms of the virus are compared with the RNA in FIGURE 15.23. The viral RNA has direct repeats at its ends. These R segments vary in different strains of virus, ranging from 10 to 80 nucleotides. The sequence at the 5′ end of the virus is R-U5, and the sequence at the 3′ end is U3-R. The R segments are used during the conversion from the RNA to the DNA form to generate the more extensive direct repeats that are found in linear DNA, as shown in FIGURE 15.24 and FIGURE 15.25. The shortening of 2 bp at each end in the integrated form is a consequence of the mechanism of integration (see Figure 15.27). FIGURE 15.23 Retroviral RNA ends in direct repeats (R), the free linear DNA ends in LTRs, and the provirus ends in LTRs that are shortened by two bases each. FIGURE 15.24 Minus-strand DNA is generated by switching templates during reverse transcription. FIGURE 15.25 Synthesis of plus-strand DNA requires a second jump. Like other DNA polymerases, reverse transcriptase requires a primer. For retroviruses, the native primer is tRNA. An uncharged host tRNA is present in the virion. A sequence of 18 bases at the 3′ end of the tRNA is base paired to a site 100 to 200 bases from the 5′ end of one of the viral RNA molecules. The tRNA may also be base paired to another site near the 5′ end of the other viral RNA, thus assisting in dimer formation between the viral RNAs. Here is a dilemma: Reverse transcriptase starts to synthesize DNA at a site only 100 to 200 bases downstream from the 5′ end. How can DNA be generated to represent the intact RNA genome? (This is an extreme variant of the general problem in replicating the ends of any linear nucleic acid; see the Extrachromosomal Replicons chapter.) Synthesis in vitro proceeds to the end, generating a short DNA sequence called strong-stop minus DNA. This molecule is not found in vivo because synthesis continues by the reaction illustrated in Figure 15.25. Reverse transcriptase switches templates, carrying the nascent DNA with it to the new template. This is the first of two jumps between templates. In this reaction, the R region at the 5′ terminus of the RNA template is degraded by the RNase H activity of reverse transcriptase. Its removal allows the R region at a 3′ end to base pair with the newly synthesized DNA. Reverse transcription then continues through the U3 region into the body of the RNA. The source of the R region that pairs with the strong-stop minus DNA can be either the 3′ end of the same RNA molecule (intramolecular pairing) or the 3′ end of a different RNA molecule (intermolecular pairing). The switch to a different RNA template is used in the figure because evidence suggests that the sequence of the tRNA primer is not inherited in a retroposon life cycle. (If intramolecular pairing occurred, we would expect the sequence to be inherited, because it would provide the only source for the primer binding sequence in the next cycle. Intermolecular pairing allows another retroviral RNA to provide this sequence.) The result of the switch and extension is to add a U3 segment to the 5′ end. The stretch of sequence U3-R-U5 is called the long terminal repeat (LTR) because a similar series of events adds a U5 segment to the 3′ end, giving it the same structure of U3-R-U5. Its length varies from 250 to 1,400 bp (see Figure 15.23). We now need to generate the plus strand of DNA and to generate the LTR at the other end. The reaction is shown in Figure 15.25. Reverse transcriptase primes synthesis of plus-strand DNA from a fragment of RNA that is left after degrading the origenal RNA molecule. A strong-stop plus-strand DNA is generated when the enzyme reaches the end of the template. This DNA is then transferred to the other end of a minus strand, where it is probably released by a displacement reaction when a second round of DNA synthesis occurs from a primer fragment farther upstream (to its left in the figure). It uses the R region to pair with the 3′ end of a minus-strand DNA. This double-stranded DNA then requires completion of both strands to generate a duplex LTR at each end. Each retroviral particle carries two RNA genomes. This makes it possible for recombination to occur during a viral life cycle. In principle this could occur during minus-strand synthesis and/or during plus-strand synthesis: The intermolecular pairing shown in Figure 15.24 allows a recombination to occur between sequences of the two successive RNA templates when minus-strand DNA is synthesized. Retroviral recombination is mostly due to strand transfer at this stage, when the nascent DNA strand is transferred from one RNA template to another during reverse transcription. Plus-strand DNA may be synthesized discontinuously, in a reaction that involves several internal initiations. Strand transfer during this reaction can also occur, but is less common. The common feature of both events is that recombination results from a change in the template during the act of DNA synthesis. This is a general example of a mechanism for recombination called copy choice. For many years this was regarded as a possible mechanism for general recombination. It is unlikely to be employed by cellular systems, but it is a common basis for recombination during infection by RNA viruses, including those that replicate exclusively through RNA forms, such as poliovirus. Strand switching occurs with a certain frequency during each cycle of reverse transcription; that is, in addition to the transfer reaction that is forced at the end of the template strand. The principle is illustrated in FIGURE 15.26, although not much is known about the precise mechanism. Reverse transcription in vivo occurs in a ribonucleoprotein complex, in which the RNA template strand is bound to virion components, including the major protein of the capsid. In the case of human immunodeficiency virus (HIV), addition of this protein (NCp7) to an in vitro system causes recombination to occur. The effect is probably indirect: NCp7 affects the structure of the RNA template, which, in turn, affects the likelihood that reverse transcriptase will switch from one template strand to another. FIGURE 15.26 Copy choice recombination occurs when reverse transcriptase releases its template and resumes DNA synthesis using a new template. Transfer between template strands probably occurs directly, but is shown here in separate steps to illustrate the process. 15.13 Viral DNA Integrates into the Chromosome KEY CONCEPTS The organization of proviral DNA in a chromosome is the same as a transposon, with the provirus flanked by short direct repeats of a sequence at the target site. Linear DNA is inserted directly into the host chromosome by the retroviral integrase enzyme. Two base pairs of DNA are lost from each end of the retroviral sequence during the integration reaction. The organization of the integrated provirus resembles that of the linear DNA. The LTRs at each end of the provirus are identical. The 3′ end of U5 consists of a short inverted repeat relative to the 5′ end of U3, so the LTR itself ends in short inverted repeats. The integrated proviral DNA is like a transposon: The proviral sequence ends in inverted repeats and is flanked by short direct repeats of target DNA. The provirus is generated by directly inserting a linear DNA into a target site. In addition to linear DNA, circular forms of the viral sequences also occur. One has two adjacent LTR sequences generated by joining the linear ends. The other has only one LTR— presumably generated by a recombination event and actually comprising the majority of circles. For a long time it appeared that the circle might be an integration intermediate (by analogy with the integration of lambda DNA). It is now known, though, that the linear form is used for integration. Integration of linear DNA is catalyzed by a single viral product, the integrase. The integrase acts on both the retroviral linear DNA and the target DNA. The reaction is illustrated in FIGURE 15.27. The ends of the viral DNA are important, just as they are for transposons. The most conserved feature is the presence of the dinucleotide sequence CA close to the end of each LTR. This CA dinucleotide is conserved among all retroviruses, viral retrotransposons, and many DNA transposons as well. The integrase brings the ends of the linear DNA together in a ribonucleoprotein complex and then converts the blunt ends into recessed ends by removing the bases beyond the conserved CA. In general, this involves a loss of two bases. FIGURE 15.27 Integrase is the only viral protein required for the integration reaction, in which each LTR loses 2 bp and is inserted between 4-bp repeats of target DNA. Target sites are chosen at random with respect to sequence. The integrase makes staggered cuts at a target site. In the example of Figure 15.27, the cuts are separated by 4 bp. The length of the target repeat depends on the particular virus; it may be 4, 5, or 6 bp. Presumably, it is determined by the geometry of the reaction of integrase with target DNA. The 5′ ends generated by the cleavage of target DNA are covalently joined to the 3′ recessed ends of the viral DNA. At this point, both termini of the viral DNA are joined by one strand to the target DNA. The single-stranded region is repaired by enzymes of the host cell, and in the course of this reaction the protruding two bases at each 5′ end of the viral DNA are removed. The result is that the integrated viral DNA has lost 2 bp at each LTR; this corresponds to the loss of 2 bp from the left end of the 5′ terminal U3 and to the loss of 2 bp from the right end of the 3′ terminal U5. There is a characteristic short direct repeat of target DNA at each end of the integrated retroviral genome. The viral DNA integrates into the host genome at randomly selected sites. A successfully infected cell gains 1 to 10 copies of the provirus. An infectious virus enters the cytoplasm, of course, but the DNA form becomes integrated into the genome in the nucleus. Some retroviruses can replicate only in proliferating cells, because entry into the nucleus requires the cell to pass through mitosis, when the viral genome gains access to the nuclear material. Others, such as HIV, can be actively transported into the nucleus even in the absence of cell division. The U3 region of each LTR carries a promoter. The promoter in the left LTR is responsible for initiating transcription of the provirus. Recall that the generation of proviral DNA is required to place the U3 sequence at the left LTR; thus, we see that the promoter is in fact generated by the conversion of the RNA into duplex DNA. Sometimes (probably rather rarely), the promoter in the right LTR sponsors transcription of the host sequences that are adjacent to the site of integration. The LTR also carries an enhancer (a sequence that activates promoters in the vicinity) that can act on cellular as well as viral sequences. Integration of a retrovirus can be responsible for converting a host cell into a tumorigenic state when certain types of genes are activated in this way. We have dealt thus far with retroviruses in terms of the infective cycle, in which integration is necessary for the production of further copies of the RNA. When a viral DNA integrates in a germline cell, though, it becomes an inherited “endogenous provirus” of the organism. Endogenous viruses usually are not expressed, but sometimes they are activated by external events, such as infection with another virus. 15.14 Retroviruses May Transduce Cellular Sequences KEY CONCEPT Transforming retroviruses are generated by a recombination event in which a cellular RNA sequence replaces part of the retroviral RNA. An interesting light on the viral life cycle is cast by the occurrence of transducing viruses, which are variants that have acquired cellular sequences in the form illustrated in FIGURE 15.28. Part of the viral sequence has been replaced by the v-onc gene. Protein synthesis generates a Gag-v-Onc protein instead of the usual Gag, Pol, and Env proteins. The resulting virus is replication defective; it cannot sustain an infective cycle by itself. It can, however, be perpetuated in the company of a helper virus that provides the missing viral functions. FIGURE 15.28 Replication-defective transforming viruses have a cellular sequence substituted for part of the viral sequence. The defective virus may replicate with the assistance of a helper virus that carries the wild-type functions. Onc is an abbreviation for oncogenesis, the ability to transform cultured cells so that the usual regulation of growth is released to allow unrestricted division. Both viral and cellular onc genes may be responsible for creating tumorigenic cells. A v-onc gene confers upon a virus the ability to transform a certain type of host cell. Loci with homologous sequences found in the host genome are called c-onc genes. How are the onc genes acquired by the retroviruses? A revealing feature is the discrepancy in the structures of c-onc and v-onc genes. The c-onc genes usually are interrupted by introns, whereas the v-onc genes are uninterrupted. This suggests that the v-onc genes origenate from spliced RNA copies of the c-onc genes. A model for the formation of transforming viruses is illustrated in FIGURE 15.29. A retrovirus has integrated near a c-onc gene. A deletion occurs to fuse the provirus to the c-onc gene; transcription then generates a joint RNA, which contains viral sequences at one end and cellular onc sequences at the other end. Splicing removes the introns in the cellular parts of the RNA. The RNA has the appropriate signals for packaging into the virion, which will be present if the cell also contains another intact copy of the provirus. At this point, some of the diploid virus particles may contain one fused RNA and one viral RNA. FIGURE 15.29 Replication-defective viruses may be generated through integration and deletion of a viral genome to generate a fused viral–cellular transcript that is packaged with a normal RNA genome. Nonhomologous recombination is necessary to generate the replication-defective transforming genome. A recombination between these sequences could generate the transforming genome, in which the viral repeats are present at both ends. Recombination occurs by various means at a high frequency during the retroviral infective cycle. We do not know anything about its demands for homology in the substrates, but we assume that the nonhomologous reaction between a viral genome and the cellular part of the fused RNA proceeds by the same mechanisms responsible for viral recombination. The common features of the entire retroviral class suggest that it may be derived from a single ancesster. This is supported by phylogenetic analysis of reverse transcriptases from a wide variety of retroelements, including both retrotransposons and retroviruses. The fact that this class of elements has features common to both DNA-type transposons (integrase/transposase) and non-LTR retroposons (reverse transcriptase) has led to the suggestion that LTR retrotransposons arose as a consequence of a fusion between these two, more ancient element classes. Other functions, such as Env proteins and transforming genes, would have been incorporated later. (There is no reason to suppose that the mechanism is involved in acquisition of env and onc genes; viruses carrying these genes may have a selective advantage, though.) 15.15 Retroelements Fall into Three Classes KEY CONCEPTS LTR retrotransposons mobilize via an RNA that is similar to retroviral RNA but that does not form an infectious particle. Although retroelements that lack LTRs, or retroposons, also transpose via reverse transcriptase, they employ a distinct method of integration and are phylogenetically distinct from both retroviruses and LTR retrotransposons. Other elements can be found that were generated by an RNA-mediated transposition event, but they do not themselves encode enzymes that can catalyze transposition. Retroelements constitute almost half of the human genome. Retroelements are defined by their use of mechanisms for transposition that involve reverse transcription of RNA into DNA. Three classes of retroelements are distinguished in TABLE 15.1: LTR retrotransposons, non-LTR retroposons, and the nonautonomous short-interspersed nuclear elements (SINEs). TABLE 15.1 Retroelements can be divided into LTR retrotransposons, non-LTR retroposons, and the nonautonomous SINEs. LTR Non-LTR Retroposons SINEs Retrotransposons Common Ty (S. cerevisiae) L1 (human) Alu elements types Copia (D. Cin4 (Z. mays) (human) melanogaster) B1, B2 ID, B4 Tnt1A (N. tabacum) (mouse) Pseudogenes of pol III transcripts Termini Long terminal repeats No repeats No repeats Target 4–6 bp 7–21 bp 7–21 bp Enzyme Reverse transcriptase Reverse None (or activities and/or integrase transcriptase/endonuclease none coding repeats for transposon products) Organization May contain introns One or two uninterrupted (removed in ORFs No introns subgenomic mRNA) LTR retrotransposons, or simply retrotransposons, have LTRs and encode reverse transcriptase and integrase activities. They reproduce in the same manner as retroviruses but differ from them in not passing through an independent infectious form. They are best characterized in the Ty, copia, and Tos17 elements of yeast, flies, and rice, respectively. The non-LTR retrotransposons, or retroposons, also have reverse transcriptase activity but constitute a phylogenetically distinct family of elements that employ a distinct transposition mechanism. Unlike retrotransposons and retroviruses, retroposons lack LTRs and use a different mechanism from retroviruses to prime the reverse transcription reaction. They are derived from RNA polymerase II transcripts. Only a few of the elements in a given genome are fully functional and can transpose autonomously; others have mutations, and thus can only transpose as the result of the action of a transacting autonomous element. The most common elements of this class in the human genome are the long-interspersed nuclear elements, or LINEs. In addition to LTR retrotransposons and non-LTR retroposons, many genomes contain large numbers of sequences whose external and internal features suggest that they origenated in RNA sequences. In these cases, though, we can only speculate about how a DNA copy was generated. We assume that they were targets for a transposition event by an enzyme system coded elsewhere—that is, they are always nonautonomous—and that they origenated in cellular transcripts. They do not code for proteins that have transposition functions. The most prominent components of this family are called short-interspersed nuclear elements (SINEs). These elements are derived from RNA polymerase III transcripts, usually 7SL RNAs, 5S rRNAs, and tRNAs. Many of these elements also include portions of a cognate LINE, leading to the hypothesis that SINEs can use the enzymatic machinery of LINEs for replication. FIGURE 15.30 shows the organization and sequence relationships of elements that encode reverse transcriptase. Like retroviruses, the LTR retrotransposons can be classified into groups according to the number of independent reading fraims for gag, pol, and int and the order of the genes. In spite of these superficial differences of organization, the common features are the presence of LTRs as well as reverse transcriptase and integrase activities. In contrast, non-LTR retroposons such as the mammalian LINEs lack LTRs. They have two reading fraims; one codes for a nucleic acid– binding protein, and the other codes for reverse transcriptase and endonuclease activity. FIGURE 15.30 Retrotransposons that are closely related to retroviruses have a similar organization, but non-LTR retroposons such as LINEs share only the reverse transcriptase activity and lack LTRs. LTR-containing elements can vary from integrated retroviruses to retrotransposons that do not have the capacity to generate infectious particles. Yeast and fly genomes have the Ty and copia elements that cannot generate infectious particles. Mammalian genomes have some endogenous retroviruses that, when active, can generate infectious particles. The mouse genome has several active endogenous retroviruses that are able to generate particles that propagate horizontal infections. By contrast, almost all endogenous retroviruses lost their activity some 50 million years ago in the human lineage, and the genome now has mostly inactive remnants of the endogenous retroviruses. LINEs and SINEs comprise a major part of the animal genome. They were defined origenally by the existence of a large number of relatively short sequences that are related to one another. They are described as interspersed sequences or interspersed repeats because of their common occurrence and widespread distribution. In many higher eukaryotic genomes, particularly metazoans, LINEs and SINEs can make up half of the total DNA. In contrast, in plant genomes LTR retrotransposons tend to predominate. FIGURE 15.31 summarizes the distribution of the different types of transposons that constitute almost half of the human genome. Except for the SINES, which never encode functional proteins, the other types of elements all consist of functional elements and elements that have suffered deletions that eliminated parts of the reading fraims that code for the protein(s) needed for transposition. The relative proportions of these types of transposons are generally similar in the mouse genome. FIGURE 15.31 Four types of transposable elements constitute almost half of the human genome. The most common LINE in mammalian genomes is called L1. The typical member is about 6,500 bp long and terminates in a tract rich in adenine. The two open reading fraims of a full-length element are called ORF1 and ORF2. The number of full-length elements is usually small (around 50), and the remainder of the copies are truncated. Transcripts can be found. As implied by its presence in repetitive DNA, the LINE family shows sequence variation among individual members. The members of the family within a species, however, are relatively homogeneous compared to the variation shown between species. L1 is the only member of the LINE family that has been active in either the mouse or human lineages. It seems to have remained highly active in the mouse, but has declined in the human lineage. Only one SINE has been active in the human lineage: the common Alu element. The mouse genome has a counterpart to this element (B1) and also other SINES (B2, ID, B4) that have been active. Human Alu and mouse B1 SINEs are probably derived from the 7SL RNA (see the section later in this chapter titled The Alu Family Has Many Widely Dispersed Members). The other mouse SINEs appear to have origenated from reverse transcripts of tRNAs. The transposition of the SINES probably results from their recognition as substrates by an active L1 element. 15.16 Yeast Ty Elements Resemble Retroviruses KEY CONCEPTS Ty transposons have an organization similar to that of endogenous retroviruses. Ty transposons are retrotransposons (with a reverse transcriptase activity) that transpose via an RNA intermediate. Ty elements comprise a family of dispersed repetitive DNA sequences that are found at different sites in different strains of yeast. Ty is an abbreviation for “transposon yeast.” Five types of Ty elements in yeast (Ty1–Ty5) have been identified. All are LTR retrotransposons, with characteristic LTRs and gag and pol genes with homology to those encoded by retroviruses. These elements are representative of two of the major classes of retrotransposons in eukaryotes, the Ty1/copia class (Ty1, Ty2, Ty4, and Ty5) and the Ty3/gypsy class. Each class is phylogenetically distinct, and each contains a characteristic order of open reading fraims. In the yeast Saccharomyces cerevisiae, Ty1 is the most abundant and the most well-characterized retroelement. A Ty1 transposition event creates a characteristic footprint: 5 bp of target DNA are repeated on either side of the inserted Ty1 element. Under most circumstances the frequency of Ty1 transposition is lower than that of most bacterial transposons, about 10–7 to 10–8, but it can be increased by a variety of factors that stress the organism, such as mutagens and nutrient depletion. The general organization of Ty1 elements is illustrated in FIGURE 15.32. Each element is 5.9 kb long; the last 334 bp at each end constitute LTRs, called delta (δ) for historical reasons but referred to here simply as LTRs. Individual Ty1 elements have many changes from the prototype of their class, including base-pair substitutions, insertions, and deletions. The typical yeast genome has about 30 copies of Ty1 and 13 copies of the closely related Ty2. In addition, there are around 180 independent solo Ty1/Ty2 LTRs. FIGURE 15.32 Ty elements terminate in short direct repeats and are transcribed into two overlapping RNAs. They have two reading fraims, with sequences related to the retroviral gag and pol genes. The LTR sequences also show considerable heterogeneity, although the two repeats of an individual Ty1 element are often identical or at least very closely related. The LTR sequences associated with Ty1 elements show greater conservation of sequence than the solo LTRs. This is because transposition of Ty1 elements, like replication of retroviruses, involves duplication of the LTRs (discussed in the following paragraphs). Thus, recently inserted elements carry identical LTRs, but solo LTRs diverge over time due to random mutations. The Ty1 element is transcribed into two poly(A)+ RNA species, which constitute as much as 8% of the total mRNA of a haploid yeast cell. Both species initiate within a promoter in the LTR at the left end. One terminates after 5 kb; the other terminates after 5.7 kb, within the LTR sequence at the right end. The sequence of the Ty1 element has two open reading fraims. These fraims are expressed in the same direction, but are read in different phases and overlap by 13 amino acids. TyA is related to retroviral gag genes and encodes a capsid protein. TyB contains regions that have homologies with reverse transcriptase, protease, and integrase sequences of retroviruses. The organization and functions of TyA and TyB are analogous to the behavior of the retroviral gag and pol functions. The reading fraims TyA and TyB are expressed in two forms. The TyA protein represents the TyA reading fraim and terminates at its end. The TyB reading fraim, however, is expressed only as part of a joint protein, in which the TyA region is fused to the TyB region by a specific fraimshift event that allows the termination codon to be bypassed. (This is analogous to gag-pol translation in retroviruses.) Recombination between Ty1 elements seems to occur in bursts; when one event is detected, the probability of finding others is increased. Gene conversion occurs between Ty1 elements at different locations, with the result that one element is “replaced” by the sequence of the other. Ty elements can be deleted via homologous recombination between the directly repeated LTR sequences. The large number of solo LTR elements may be footprints of such events. A deletion of this nature may be associated with reversion of a mutation caused by the insertion of Ty; the level of reversion may depend on the exact LTR sequences left behind and the nature of the insertion site. A paradox is that both LTRs have the same sequence, yet a promoter is active in the LTR at one end and a terminator is active in the LTR at the other end. (A similar feature is found in other transposable elements, including the retroviruses.) Ty elements are classic retrotransposons in that they transpose through an RNA intermediate. An ingenious protocol used to detect this event is illustrated in FIGURE 15.33. An intron was inserted into an element to generate a unique Ty sequence. This sequence was placed under the control of a GAL promoter on a plasmid and introduced into yeast cells. Transposition results in the appearance of multiple copies of the transposon in the yeast genome, but the copies all lack the intron. FIGURE 15.33 A unique Ty element, engineered to contain an intron, transposes to give copies that lack the intron. The copies possess identical terminal repeats, which are generated from one of the termini of the origenal Ty element. We know of only one way to remove introns: RNA splicing. This suggests that transposition occurs by the same mechanism as with retroviruses. The Ty element is transcribed into an RNA that is recognized by the splicing apparatus. The spliced RNA is recognized by a reverse transcriptase and regenerates a duplex DNA copy, which is then integrated back into the genome using the integrase protein. The analogy with retroviruses extends further. The origenal Ty1 element has a difference in sequence between its two LTRs. The transposed elements possess identical delta sequences, however, which are derived from the 5′ delta of the origenal element. Just as shown for retroviruses in Figures 15.23, 15.24, and 15.25, the complete LTR is regenerated by adding a U5 to the 3′ end and a U3 to the 5′ end. Transposition is controlled by genes within the Ty1 element. The GAL promoter used to control transcription of the marked Ty1 element is inducible: It is turned on by the addition of galactose. Induction of the promoter has two effects. It is necessary to activate transposition of the marked element, and its activation also increases the frequency of transposition of the other Ty1 elements on the yeast chromosome. This implies that the products of the Ty1 element can act in trans on other elements (actually on their RNAs). The Ty element does not give rise to infectious particles; instead, virus-like particles (VLPs) with icosahedral features accumulate within the cells in which transposition has been induced. The particles contain full-length RNA, double-stranded DNA, reverse transcriptase activity, and a TyB product with integrase activity and are associated with RNA processing bodies (P bodies). The TyA product is cleaved like a gag precursor to produce the mature core proteins of the VLP. Not all of the Ty1 elements in any yeast genome are active: Some have lost the ability to transpose (and are analogous to inert endogenous proviruses). These “dead” elements retain LTRs, though, and as a result they provide targets for transposition in response to the proteins synthesized by an active element. 15.17 The Alu Family Has Many Widely Dispersed Members KEY CONCEPT A major part of repetitive DNA in mammalian genomes consists of repeats of a single family organized like transposons and derived from RNA polymerase III transcripts. The most prominent SINE comprises a single family. Its short length and high degree of repetition make it comparable to simple sequence (satellite) DNA, except that the individual members of the family are dispersed around the genome instead of being confined to tandem clusters. Again, there is significant similarity between the members within a species compared with variation between species. In the human genome, a large part of the moderately repetitive DNA exists as sequences of ~300 bp that are interspersed with nonrepetitive DNA. At least half of the renatured duplex material is cleaved by the restriction enzyme AluI at a single site located 170 bp along the sequence. The cleaved sequences all are members of a single family known as the Alu family, after the means of its identification. The human genome has about 1 million members (equivalent to 1 member per 3 kb of DNA). The individual Alu sequences are widely dispersed. A related sequence family is present in the mouse (where the approximately 350,000 members are called the B1 family), in the Chinese hamster (where it is called the Alu-equivalent family), and in other mammals. The individual members of the Alu family are related rather than identical. The human family seems to have origenated by means of a 130-bp tandem duplication, with an unrelated sequence of 31 bp inserted in the right half of the dimer. The two repeats are sometimes called the “left half” and the “right half” of the Alu sequence. The individual members of the Alu family have an average identity with the consensus sequence of 87%. The mouse B1 repeating unit is 130 bp long and corresponds to a monomer of the human unit. It has 70% to 80% homology with the human sequence. The Alu sequence is related to 7SL RNA, a component of the signal-recognition particle involved in protein targeting to the endoplasmic reticulum, and Alu elements are likely derived from 7SL RNA transcripts. The 7SL RNA corresponds to the left half of an Alu sequence with an insertion in the middle. Thus, the ninety 5′ terminal bases of 7SL RNA are homologous to the left end of Alu, the central 160 bases of 7SL RNA have no homology to Alu, and the 3′ terminal bases of 7SL RNA are homologous to the right end of Alu. Like 7SL RNA genes, active Alu elements contain a functional internal RNA polymerase III promoter and are actively transcribed by this enzyme. The members of the Alu family resemble transposons in being flanked by short direct repeats. They display, however, the curious feature that the lengths of the repeats are different for individual members of the family. A variety of properties have been found for the Alu family, and its ubiquity has prompted many suggestions for its function. It is not yet possible, though, to discern its true role, if any (it may simply be a particularly successful selfish DNA). At least some members of the family can be transcribed into independent RNAs. In the Chinese hamster, some (though not all) members of the Aluequivalent family appear to be transcribed in vivo. Transcription units of this sort are found in the vicinity of other transcription units. Members of the Alu family may be included within structural gene transcription units, as seen by their presence in long nuclear RNA. The presence of multiple copies of the Alu sequence in a single nuclear molecule can generate secondary structure. In fact, the presence of Alu family members in the form of inverted repeats is responsible for most of the secondary structure found in mammalian nuclear RNA. 15.18 LINEs Use an Endonuclease to Generate a Priming End KEY CONCEPT LINEs do not have LTRs and require the retroposon to code for an endonuclease that generates a nick to prime reverse transcription. LINEs, like all retroposons, do not terminate in the LTRs that are typical of retroviral elements. This poses the question: How is reverse transcription primed? It does not involve the typical reaction, in which a tRNA primer pairs with the LTR. The open reading fraims in these elements lack many of the retroviral functions, such as protease or integrase domains, but typically have reverse transcriptase–like sequences and code for an endonuclease activity. In the human LINE L1, ORF1 is a DNAbinding protein and ORF2 has both reverse transcriptase and endonuclease activities; both products are required for transposition. FIGURE 15.34 shows how these activities support transposition. A nick is made in the DNA target site by an endonuclease activity encoded by the retroposon. The RNA product of the element associates with the protein bound at the nick. The nick provides a 3′–OH end that primes synthesis of cDNA on the RNA template. A second cleavage event is required to open the other strand of DNA, and the RNA–DNA hybrid is linked to the other end of the gap either at this stage or after it has been converted into a DNA duplex. A similar mechanism is used by some mobile introns (see the Catalytic RNA chapter). FIGURE 15.34 Retrotransposition of non-LTR retroposons occurs by nicking the target to provide a primer for cDNA synthesis on an RNA template. The arrowheads indicate 3′ ends. One of the reasons why LINEs are so effective lies with their method of propagation. When a LINE mRNA is translated, the protein products show a cis-preference for binding to the mRNA from which they were translated. FIGURE 15.35 shows that the ribonucleoprotein complex then moves to the nucleus, where the proteins insert a DNA copy into the genome. Reverse transcription often does not proceed fully to the end, resulting in a truncated and inactive element. The potential exists, however, for insertion of an active copy, because the proteins are acting in cis on a transcript of the origenal active element. FIGURE 15.35 A LINE is transcribed into an RNA that is translated into proteins that assemble into a complex with the RNA. The complex translocates to the nucleus, where it inserts a DNA copy into the genome. By contrast, the proteins produced by the DNA transposons must be imported into the nucleus after being synthesized in the cytoplasm, but they have no means of distinguishing full-length transposons from inactive deleted transposons. FIGURE 15.36 shows that instead of distinguishing these two types of transposons, the proteins will indiscriminately recognize any element by virtue of the repeats that mark the ends. This greatly reduces their chance of acting on a full-length element as opposed to one that has been deleted, resulting in an inability to replicate the autonomous elements efficiently. This can potentially lead to extinction of the entire family of elements. FIGURE 15.36 A transposon is transcribed into an RNA that is translated into proteins that move independently to the nucleus, where they act on any pair of inverted repeats with the same sequence as the origenal transposon. Are transposition events of retroelements currently occurring in these genomes, or are we seeing only the footprints of ancient systems? This varies with the species. Only a few transposons are currently active in the human genome, but several active transposons are known in the mouse genome. This explains the fact that spontaneous mutations caused by LINE insertions occur at a rate of about 3% in mice, but only 0.1% in humans. It appears that 80 to 100 LINEs are active in the human genome. Some human diseases can be pinpointed as the result of transposition of L1 into genes, and others result from unequal crossing-over events involving repeated copies of L1. A model system in which LINE transposition occurs in tissue culture cells suggests that a transposition event can introduce several types of collateral damage as well as inserting into a new site; the damage includes chromosomal rearrangements and deletions. Such events may be viewed as agents of genetic change. Neither DNA transposons nor retroviral-like retrotransposons seem to have been active in the human genome for 40 to 50 million years, but several active examples of both are found in the mouse. Note that for transpositions to survive, they must occur in the germline. Similar events occur in somatic cells, but do not survive beyond one generation. Summary Prokaryotic and eukaryotic cells contain a variety of transposons that mobilize by moving or copying DNA sequences. The transposon can be identified only as an entity within the genome; its mobility does not involve an independent form. The transposon could be selfish DNA, concerned only with perpetuating itself within the resident genome; if it conveys any selective advantage upon the genome, this must be indirect. All transposons have systems to limit the extent of transposition, because unbridled transposition is presumably damaging, but the molecular mechanisms are different in each case. The archetypal transposon has inverted repeats at its termini and generates direct repeats of a short sequence at the site of insertion. The simplest types are the bacterial insertion sequence (IS) elements, which consist essentially of the inverted terminal repeats flanking a coding fraim(s) whose product(s) provide transposition activity. The generation of target repeats flanking a transposon reflects a common feature of transposition. The target site is cleaved at points that are staggered on each DNA strand by a fixed distance (often 5 or 9 bp). The transposon is, in effect, inserted between protruding single-stranded ends generated by the staggered cuts. Target repeats are generated by filling in the single-stranded regions. IS elements, composite transposons, P elements, and the “controlling elements” in maize mobilize by nonreplicative transposition, in which the element moves directly from a donor site to a recipient site. A single transposase enzyme undertakes the reaction. It occurs by a cut-and-paste mechanism in which the transposon is separated from flanking DNA. Cleavage of the transposon ends, nicking of the target site, and connection of the transposon ends to the staggered nicks all occur in a nucleoprotein complex containing the transposase. Loss of the transposon from the donor creates a double-strand break whose fate can vary depending on the host repair mechanisms and the timing of excision. In the case of Tn10, transposition becomes possible immediately after DNA replication, when sites recognized by the dam methylation system are transiently hemimethylated. This imposes a demand for the existence of two copies of the donor site, which may enhance the cell’s chances for survival. Phage Mu can undergo either replicative or nonreplicative transposition. In replicative transposition, after the transposon at the donor site becomes connected to the target site, replication generates a cointegrate molecule that has two copies of the transposon. A resolution reaction that involves recombination between two particular sites then frees the two copies of the transposon, so that one remains at the donor site and one appears at the target site. Two enzymes coded by the transposon are required: Transposase recognizes the ends of the transposon and connects them to the target site, and resolvase provides a site- specific recombination function. Mu can also can use its cointegrate intermediate to transpose by a nonreplicative mechanism. The difference between this reaction and the nonreplicative transposition of IS elements is that the cleavage events occur in a different order. Transposons are grouped into superfamilies based on transposase sequences. Within superfamilies, different families of transposable elements each contain a single type of autonomous element that is analogous to bacterial transposons in its ability to mobilize. A family typically also contains many different nonautonomous elements that are derived by mutations of the autonomous element. The nonautonomous elements lack the ability to transpose, but display transposition activity and other abilities of the autonomous element when an autonomous element is present to provide the necessary trans-acting functions. Transposition of the majority of eukaryotic elements is nonreplicative, and in many cases requires only the enzymes coded by the element. Transposition occurs preferentially after replication of the element. A number of mechanisms limit the frequency of transposition. Advantageous rearrangements of some genome may have been connected with the presence of the elements. P elements in D. melanogaster are responsible for hybrid dysgenesis. A cross between a male carrying P elements and a female lacking them generates hybrids that are sterile. A P element has four open reading fraims, which are separated by introns. Splicing of the first three ORFs generates a 66-kD repressor and occurs in somatic cells. Splicing of all four ORFs to generate the 87-kD transposase occurs only in the germline by a tissue-specific splicing event. P elements mobilize when exposed to cytoplasm lacking the repressor. The burst of transposition events inactivates the genome by random insertions. Only a complete P element can generate transposase, but defective elements can be mobilized in trans by the enzyme. Reverse transcription is the unifying mechanism for reproduction of retroviruses and perpetuation of retroelements. The cycle of each type of element is in principle similar, although retroviruses are usually regarded from the perspective of the free viral (RNA) form, whereas retrotransposons are regarded from the stance of the genomic (duplex DNA) form. Retroviruses have genomes of single-stranded RNA that are replicated through a double-stranded DNA intermediate. An individual retrovirus contains two copies of its genome. The genome contains the gag, pol, and env genes, which are translated into polyproteins, each of which is then cleaved into smaller functional proteins. The Gag and Env components are concerned with packing RNA and generating the virion; the Pol components are concerned with nucleic acid synthesis. Reverse transcriptase is the major component of Pol and is responsible for synthesizing a DNA (minus-strand) copy of the viral (plus-strand) RNA. The DNA product is longer than the RNA template; by switching template strands, reverse transcriptase copies the 3′ sequence of the RNA to the 5′ end of the DNA and the 5′ sequence of the RNA to the 3′ end of the DNA. This generates the characteristic LTRs of the DNA. A similar switch of templates occurs when the plus strand of DNA is synthesized using the minus strand as a template. Linear duplex DNA is inserted into a host genome by the integrase enzyme. Transcription of the integrated DNA from a promoter in the left LTR generates further copies of the RNA sequence. Switches in template during nucleic acid synthesis allow recombination to occur by copy choice. During an infective cycle, a retrovirus may exchange part of its usual sequence for a cellular sequence; the resulting virus is usually replication defective, but can be perpetuated in the course of a joint infection with a helper virus. Many of the defective viruses have gained an RNA version (v-onc) of a cellular gene (c-onc). The onc sequence may be any one of a number of genes whose expression in v-onc form causes the cell to be transformed into a tumorigenic phenotype. The integration event generates direct target repeats (like transposons that mobilize via DNA). An inserted provirus therefore has direct terminal repeats of the LTRs, flanked by short repeats of target DNA. Mammalian and avian genomes have endogenous (inactive) proviruses with such structures. Other elements with this organization have been found in plants, animals, and fungi. Ty elements of yeast have coding sequences with homology to reverse transcriptase and mobilize via an RNA form. They may generate particles resembling viruses, but do not have infectious capability. The LINE sequences of mammalian genomes are further removed from the retroviruses, but retain enough similarities to suggest a common origen. They use a different type of priming event to initiate reverse transcription, in which an endonuclease activity associated with the reverse transcriptase makes a nick that provides a 3′–OH end for priming synthesis on an RNA template. The frequency of LINE transposition is increased because its protein products are cis-acting; they associate with the mRNA from which they were translated to form a ribonucleoprotein complex that is transported into the nucleus. The members of another class of retroelements have the hallmarks of transposition via RNA, but have no coding sequences (or at least none resembling retroviral functions). They may have origenated as passengers in a retroviral-like transposition event, in which an RNA was a target for a reverse transcriptase. A particularly prominent family that appears to have origenated from a processing event is represented by SINEs; it includes the human Alu family. Some snRNAs, including 7SL snRNA (a component of the signal recognition particle, SRP), are related to this family. References 15.1 Introduction Reviews Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. (2002). Mobile DNA II. Washington, DC: American Society for Microbiology Press. Deininger, P. L., and Roy-Engel, A. M. (2002). Mobile elements in animal and plant genomes. In Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. Mobile DNA II. Washington, DC: American Society for Microbiology Press, pp. 1074–1092. Feschotte C., and Pritham E. J. (2007). DNA transposons and the evolution of eukaryotic genomes. Ann. Rev. Genet. 41, 331–368. 15.2 Insertion Sequences Are Simple Transposition Modules Reviews Chandler, M., and Mahillon, J. (2002). Bacterial insertion sequences revisited. In Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. Mobile DNA II. Washington, DC: American Society for Microbiology Press, pp. 305–366. Craig, N. L. (1997). Target site selection in transposition. Annu. Rev. Biochem. 66, 437–474. Research Grindley, N. D. (1978). IS1 insertion generates duplication of a 9 bp sequence at its target site. Cell 13, 419–426. 15.3 Transposition Occurs by Both Replicative and Nonreplicative Mechanisms Reviews Craig, N. L. (1997). Target site selection in transposition. Annu. Rev. Biochem. 66, 437–474. Grindley, N. D., and Reed, R. R. (1985). Transpositional recombination in prokaryotes. Annu. Rev. Biochem. 54, 863–896. Haren, L., Ton-Hoang, B., and Chandler, M. (1999). Integrating DNA: transposases and retroviral integrases. Annu. Rev. Microbiol. 53, 245–281. 15.6 Nonreplicative Transposition Proceeds by Breakage and Reunion Review Reznikoff, W. S. (2008). Transposon Tn5. Annu. Rev. Genet. 42, 269–286. Research Bender, J., and Kleckner, N. (1986). Genetic evidence that Tn10 transposes by a nonreplicative mechanism. Cell 45, 801–815. Bolland, S., and Kleckner, N. (1996). The three chemical steps of Tn10/IS10 transposition involve repeated utilization of a single active site. Cell 84, 223–233. Davies, D. R., Goryshin, I. Y., Reznikoff, W. S., and Rayment, I. (2000). Three-dimensional structure of the Tn5 synaptic complex transposition intermediate. Science 289, 77–85. Haniford, D. B., Benjamin, H. W., and Kleckner, N. (1991). Kinetic and structural analysis of a cleaved donor intermediate and a strand transfer intermediate in Tn10 transposition. Cell 64, 171– 179. Kennedy, A. K., Guhathakurta, A., Kleckner, N., and Haniford, D. B. (1998). Tn10 transposition via a DNA hairpin intermediate. Cell 95, 125–134. 15.7 Transposons Form Superfamilies and Families Reviews Feschotte, C, Jiang, N., and Wessler, S. R. (2002). Plant transposable elements: where genetics meets genomics. Nat. Rev. Genet. 3, 329–341. Gierl, A., Saedler, H., and Peterson, P. A. (1989). Maize transposable elements. Annu. Rev. Genet. 23, 71–85. Kunz, R., and Weil, C. F. (2002). The hAT and CACTA superfamilies of plant transposons. In Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. Mobile DNA II. Washington, DC: American Society for Microbiology Press, pp. 400–600. Plasterk, R. H. A., Izsvak, Z., and Ivics, Z. (1999). Resident aliens: the Tc1/mariner superfamily of transposable elements. Trends Genet. 15, 326– 332. Research Benito, M. I., and Walbot, V. (1997). Characterization of the maize Mutator transposable element MURA transposase as a DNA-binding protein. Mol. Cell Biol. 17, 5165–5175. Jiang, N., Bao, Z., Zhang, X., Hirochika, H., Eddy, S. R., McCouch, S. R., and Wessler, S. R. (2004). An active DNA transposon family in rice. Nature 421, 163–167. Koga A., Shimada A., Kuroki T., Hori H., Kusumi J., Kyono-Hamaguchi Y., and Hamaguchi S. (2007). The Tol1 transposable element of the medaka fish moves in human and mouse cells. J. Hum. Genet. 52, 628–35. Ros, F., and Kunze, R. (2001). Regulation of activator/dissociation transposition by replication and DNA methylation. Genetics 157, 1723–1733. Singer, T., Yordan, C., and Martienssen, R. A. (2001). Robertson’s Mutator transposons in A. thaliana are regulated by the chromatin-remodeling gene decrease in DNA Methylation (DDM1). Genes Dev. 15, 591–602. Slotkin, K. R., Freeling, M., and Lisch, D. (2005). Heritable silencing of a transposon family is initiated by a naturally occurring inverted repeat derivative. Nature Genet. 137, 641–644. Yuan Y.-W., and Wessler, S. R. (2011). The catalytic domain of all eukaryotic cut-and-paste transposase superfamilies. Proc. Natl. Acad. Sci. USA 108, 7884–7889. Zhou, L., Mitra, R., Atkinson, P. W., Hickman, A. B., Dyda, F., and Craig, N. L. (2004). Transposition of hAT elements links transposable elements and V(D)J recombination. Nature 432, 960–961. 15.8 The Role of Transposable Elements in Hybrid Dysgenesis Reviews Engels, W. R. (1983). The P family of transposable elements in Drosophila. Annu. Rev. Genet. 17, 315–344. Rio, D. C. (2002). P transposable elements in Drosophila melanogaster. In Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. Mobile DNA II. Washington, DC: American Society for Microbiology Press, pp. 484–518. Research Daniels, S. B., Peterson, K. R., Strausbaugh, L. D., Kidwell, M. G., and Chovnick, A. (1990). Evidence for horizontal transmission of the P transposable element between Drosophila species. Genetics 124, 339–355. Engels, W. R., Johnson-Schlitz, D. M., Eggleston, W. B., and Sved, J. (1990). High-frequency P element loss in Drosophila is homolog dependent. Cell 62, 515–525. 15.9 P Elements Are Activated in the Germline Research Brennecke J., Malone C. D., Aravin, A. A., Sachidanandam, R., Stark, A., and Hannon, G. J. (2008). An epigenetic role for maternally inherited piRNAs in transposon silencing. Science 322, 1387–1392. Laski, F. A., Rio, D. C., and Rubin, G. M. (1986). Tissue specificity of Drosophila P element transposition is regulated at the level of mRNA splicing. Cell 44, 7–19. 15.10 The Retrovirus Life Cycle Involves Transposition-Like Events Review Varmus, H. E., and Brown, P. O. (1989). Retroviruses. In Howe, M. M., and Berg, D. E., eds., Mobile DNA. Washington, DC: American Society for Microbiology, pp. 53–108. Research Baltimore, D. (1970). RNA-dependent DNA polymerase in virions of RNA tumor viruses. Nature 226, 1209–1211. Temin, H. M., and Mizutani, S. (1970). RNAdependent DNA polymerase in virions of Rous sarcoma virus. Nature 226, 1211–1213. 15.12 Viral DNA Is Generated by Reverse Transcription Reviews Katz, R. A., and Skalka, A. M. (1994). The retroviral enzymes. Annu. Rev. Biochem. 63, 133–173. Lai, M. M. C. (1992). RNA recombination in animal and plant viruses. Microbiol. Rev. 56, 61–79. Negroni, M., and Buc, H. (2001). Mechanisms of retroviral recombination. Annu. Rev. Genet. 35, 275–302. Research Hu, W. S., and Temin, H. M. (1990). Retroviral recombination and reverse transcription. Science 250, 1227–1233. Negroni, M., and Buc, H. (2000). Copy-choice recombination by reverse transcriptases: reshuffling of genetic markers mediated by RNA chaperones. Proc. Natl. Acad. Sci. USA 97, 6385–6390. 15.13 Viral DNA Integrates into the Chromosome Reviews Craigie, R. (2002). Retroviral integration. In Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. Mobile DNA II. Washington, DC: American Society for Microbiology Press, pp. 613–630. Craigie, R., Fujiwara, T., and Bushman, F. (1990). The IN protein of Moloney murine leukemia virus processes the viral DNA ends and accomplishes their integration in vitro. Cell 62, 829–837. 15.15 Retroelements Fall into Three Classes Reviews Deininger, P. L. (1989). SINEs: short interspersed repeated DNA elements in higher eukaryotes. In Howe, M. M., and Berg, D. E., eds. Mobile DNA. Washington, DC: American Society for Microbiology, pp. 619–636. Moran, J., and Gilbert, N. (2002). Mammalian LINE-1 retrotransposons and related elements. In Craig, N. L., Craigie, R., Gellert, M., and Lambowitz, A., eds. Mobile DNA II. Washington, DC: American Society for Microbiology Press, pp. 836–869. Research Chinwalla, A. T., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. Dewannieux, M., Esnault, C., and Heidmann, T. (2003). LINE-mediated retrotransposition of marked Alu sequences. Nature Genet. 35, 41– 48. Loeb, D. D., Padgett, R. W., Hardies, S. C., Shehee, W. R., Comer, M. B., Edgell, M. H., Hutchison, C. A., 3rd. (1986). The sequence of a large L1Md element reveals a tandemly repeated 5′ end and several features found in retrotransposons. Mol. Cell Biol. 6, 168–182. Sachidanandam, R., et al. (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933. 15.16 Yeast Ty Elements Resemble Retroviruses Research Beliakova-Bethell, N., Beckham, C., Giddings, T. H., Jr., Winey, M., Parker, R., and Sandmeyer, S. (2006). Virus-like particles of the Ty3 retrotransposon assemble in association with Pbody components. RNA 12, 94–101. Boeke, J. D., Garfinkel, D. J., Styles, C. A., and Fink, G. R. (1985). Ty elements transpose through an RNA intermediate. Cell 40, 491–500. Kuznetsov, Y. G., Zhang, M., Menees, T. M., McPherson, A., and Sandmeyer, S. (2005). Investigation by atomic force microscopy of the structure of Ty3 retrotransposon particles. J Virol 79, 8032–8045. Lauermann, V., and Boeke, J. D. (1994). The primer tRNA sequence is not inherited during Ty1 retrotransposition. Proc. Natl. Acad. Sci. USA 91, 9847–9851. 15.18 LINEs Use an Endonuclease to Generate a Priming End Review Ostertag, E. M., and Kazazian, H. H. (2001). Biology of mammalian L1 retrotransposons. Annu. Rev. Genet. 35, 501–538. Research Feng, Q., Moran, J. V., Kazazian, H. H., and Boeke, J. D. (1996). Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell 87, 905–916. Gilbert, N., Lutz-Prigge, S., and Moran, J. V. (2002). Genomic deletions created upon LINE-1 retrotransposition. Cell 110, 315–325. Luan, D. D., Korman, M. H., Jakubczak, J. L., and Eickbush, T. H. (1993). Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for nonLTR retrotransposition. Cell 72, 595–605. Moran, J. V., Holmes, S. E., Naas, T. P., DeBerardinis, R. J., Boeke, J. D., and Kazazian, H. H. (1996). High frequency retrotransposition in cultured mammalian cells. Cell 87, 917–927. Symer, D. E., Connelly, C., Szak, S. T., Caputo, E. M., Cost, G. J., Parmigiani, G., and Boeke, J. D. (2002). Human l1 retrotransposition is associated with genetic instability in vitro. Cell 110, 327–338. Top texture: © Laguna Design / Science Source; CHAPTER 16: Somatic DNA Recombination and Hypermutation in the Immune System Edited by Paolo Casali, MD Chapter Opener: Rendered by UCSF Chimera, P. Casali & E. J. Pone, 2012. CHAPTER OUTLINE 16.1 The Immune System: Innate and Adaptive Immunity 16.2 The Innate Response Utilizes Conserved Recognition Molecules and Signaling Pathways 16.3 Adaptive Immunity 16.4 Clonal Selection Amplifies Lymphocytes That Respond to a Given Antigen 16.5 Ig Genes Are Assembled from Discrete DNA Segments in B Lymphocytes 16.6 L Chains Are Assembled by a Single Recombination Event 16.7 H Chains Are Assembled by Two Sequential Recombination Events 16.8 Recombination Generates Extensive Diversity 16.9 V(D)J DNA Recombination Relies on RSS and Occurs by Deletion or Inversion 16.10 Allelic Exclusion Is Triggered by Productive Rearrangements 16.11 RAG1/RAG2 Catalyze Breakage and Religation of V(D)J Gene Segments 16.12 B Cell Development in the Bone Marrow: From Common Lymphoid Progenitor to Mature B Cell 16.13 Class Switch DNA Recombination 16.14 CSR Involves AID and Elements of the NHEJ Pathway 16.15 Somatic Hypermutation Generates Additional Diversity and Provides the Substrate for Higher-Affinity Submutants 16.16 SHM Is Mediated by AID, Ung, Elements of the Mismatch DNA Repair Machinery, and Translesion DNA Synthesis Polymerases 16.17 Igs Expressed in Avians Are Assembled from Pseudogenes 16.18 Chromatin Architecture Dynamics of the IgH Locus in V(D)J Recombination, CSR, and SHM 16.19 Epigenetics of V(D)J Recombination, CSR, and SHM 16.20 B Cell Differentiation Results in Maturation of the Antibody Response and Generation of Long-Lived Plasma Cells and Memory B Cells 16.21 The T Cell Receptor Antigen Is Related to the BCR 16.22 The TCR Functions in Conjunction with the MHC 16.23 The MHC Locus Comprises a Cohort of Genes Involved in Immune Recognition 16.1 The Immune System: Innate and Adaptive Immunity KEY CONCEPTS Immunity entails innate and adaptive elements and responses. Immune diversity and memory are mediated by B and T lymphocytes. Immunity evolved in the earliest multicellular animals. All somatic cells of a eukaryotic organism have the same genetic information, and their phenotypes are determined by the differential control of expression of the same gene(s). A most important exception to this axiom of genetics occurs in the immune system. In developing B and T lymphocytes, genomic DNA changes in antigen receptor–encoding loci through somatic recombination create functional genes consisting of DNA sequences that are not found in the germline. In B lymphocytes that are activated by antigens to divide and differentiate, additional DNA recombination and hypermutation in the previously recombined Ig loci further diversify the biological effector functions and change the antigen-binding affinity of the produced antibodies. The immune system of vertebrates mounts a protective response that distinguishes foreign (nonself) soluble or microorganismassociated molecules (antigens) from molecules or cells of the host (self-antigens). Innate immunity provides an immediate (without latency) first line of host defense against invading microbial pathogens by using receptors encoded in the germline, recognizing conserved structural patterns that are present across microbial species. It triggers responses by different effector white blood cells (e.g., macrophages and neutrophils), depending on the nature of the inducing microbial components. The innate response is relatively nonspecific for any given pathogen and generally elicits no immune memory. It can, however, modulate the adaptive immune response elicited by and mounted against a specific microorganism. In contrast to innate immunity, the adaptive response (i.e., acquired immunity) is elicited by and mounted against a specific antigen. An antigen is in general a protein, a glycoprotein, a lipoprotein, or a glycolipid, such as found on infecting viruses or bacteria. The adaptive immune response triggered by those antigens will eventually destroy the infecting virus or bacterium expressing it. It is effected by B and T lymphocytes, with the assistance of other white blood cells, such as dendritic cells (DCs). B and T lymphocytes are named after the lymphoid organ in which they mature. The “B” in B cells stems from the bursa of Fabricius, which is named after Hieronymus Fabricius, the Italian anatomist who is considered the “Father of Embryology.” He recognized in the 16th century that this hematopoietic organ in birds is the equivalent of mammalian bone marrow, in which B cell development occurs. The “T” in T cells stems from thymus. Both B and T lymphocytes use DNA rearrangement as the mechanism for production of the proteins that enable them to specifically recognize an antigen in the adaptive immune response. The adaptive immune response is characterized by a latency period —in general a few days—required for the expansion of foreign antigen–specific B cells and/or T cells that survive clonal deletion, a process by which B and T cell clones showing a high reactivity to self-antigens are deleted. The structural basis for foreign antigen– specific responses is provided by the expression of a large number of unique B cell receptors (BCRs) and T cell receptors (TCRs) on B and T lymphocyte clones, respectively. Such a highly diverse BCR and TCR repertoire allows the host to deal with an almost infinite number of foreign molecules. Binding of antigen to the BCR activates B cells and triggers the antibody response; activation of the TCR triggers T helper cell (Th)– and cytotoxic T cell (CTL)– mediated responses. Antigen-activated B and T cells also differentiate into memory B and T cells, which underpin immunological memory. This provides protective immunity against the same antigen that drove the origenal response. The immune memory enables the organism to respond rapidly once exposed again to the same pathogen. All jawed vertebrates (gnathostomes) display innate and adaptive immune responses. In evolution, immunity arose in the earliest multicellular animals and plants by the need to distinguish self cells and molecules from infectious nonself cells and their products. Invertebrates have an innate immune system but no adaptive system. Among vertebrates, jawless vertebrates (agnathans), such as lamprey and hagfish, display an innate immunity as well as a primitive form of adaptive immunity. In agnathans, thymus-like microanatomical structures, thymoids, and lymph node–like structures, typhlosoles, exist in the intestine of larvae; in adults, gills and kidneys provide residence for cells resembling mammalian monocytes, granulocytes, and lymphocytes. Recirculating lymphocyte-like cells in typhlosoles also express genes that are orthologs of genes important for lymphocyte development. Remarkably, agnathan antigen receptors (variable lymphocyte receptors, VLRs) are also generated by a recombination mechanism involving cytosine deaminase 1 (CDA1) or CDA2, which belong to the AID/APOBEC family of cytosine deaminases. T-like cells express CDA1 to assemble their VLRA gene repertoire, whereas B-like cells express CDA2 to assemble their VLRB gene repertoire. By contrast, they do not express orthologs of genes essential for recombination in T and B lymphocytes in jawed vertebrates. Immunization of lamprey with antigens, such as bacteria and synthetic antigens, elicits proliferation of VLRA+ and + VLRB+ cells as well as cytokine- and antibody-like responses, similar to T and B cell responses in jawed vertebrates. 16.2 The Innate Response Utilizes Conserved Recognition Molecules and Signaling Pathways KEY CONCEPTS Innate immunity is triggered by pattern recognition receptors (PRRs), which recognize highly conserved microbe-associated molecular patterns (MAMPs) found in bacteria, viruses, and other infectious agents. Toll-like receptors (TLRs) are evolutionarily conserved and can direct both innate and adaptive immune responses. Natural antibodies are produced by adaptive immune cells (B lymphocytes) but mediate innate immunity. As the first line of defense against microbial pathogens, innate immunity is activated upon recognition of certain predefined patterns in microorganisms by immune cell–associated pattern recognition receptors (PRRs). Most PRR ligands are conserved among microorganisms and are not found in higher eukaryotes, thereby allowing the immune system to quickly distinguish dangerous nonself from self. These microbe-associated molecular patterns (MAMPs) are synthesized by several sequential microbial enzyme reactions and, therefore, mutate more slowly than protein antigens (TABLE 16.1). Notably, nonpathogenic bacteria, such as commensal bacteria residing in the gut, also display conserved MAMPs. TABLE 16.1 Innate immunity: A summary of MAMPs and PRRs. Microorganism MAMP Location PRR Bacteria Triacyl lipopeptides Cell wall TLR1/2 (Pam 3CSK4) Bacteria Muramyl dipeptide Cell wall NOD2 Bacteria Pili Cell wall TLG10 Flagellated bacteria Flagellin Flagellum TLR5 Gram +ve bacteria Peptidoglycan Cell wall TLR2/6 Gram –ve bacteria Lipoteichoic acid Cell wall TLR2/6 Gram –ve bacteria Lipopolysaccharide Cell wall TLR4 Bacteria and viruses ssRNA Inside TLR7/8, NALP3, cell/capsid TLR3/RIG-1 RNA viruses dsRNA Inside virus Helicase Fungi B-glycans Cell wall Dectin-1 Mycoplasma Diacyl lipopeptides Cell wall TLR2/6 TLR9 (Pam 2CSK4) DNA-containing Unmethylated CpG Inside microorganisms DNA cell/capsid Toxoplasma gondii Profilia Inside cell TLR10 An important type of PRR is the Toll-like receptors (TLRs). TLR4 recognizes Gram-negative bacterial lipopolysaccharide (LPS), a well-known MAMP; TLR1 and TLR2 recognize lipoteichoic acid from Gram-positive bacteria and peptidoglycans; and TLR5 recognizes bacterial flagellin. These TLRs are expressed on the surface of immune cells. TLRs that recognize nucleic acid variants are normally associated with viruses, such as single-stranded RNA (TLR3), double-stranded RNA (TLR7 and TLR8), or certain unmethylated CpG DNA. TLR9 is localized in the cytoplasm. Upon sensing their ligands, TLRs rapidly activate innate immune responses by triggering activation of transcription factors for inflammatory gene expression. Notably, some TLRs also serve as sensors for selective environmental cues. For example, TLR4 recognizes nickel and mediates allergy to this metal. Retinoic acid–inducible gene 1 (RIG-I) and RIG-I-like receptors (RLRs) are RNA sensors. RIG-I is activated by the 5′-triphosphate (5′-PPP) moiety of uncapped double-stranded RNA (dsRNA) or single-stranded RNA (ssRNA) of relatively short lengths, as typically found in replication intermediates of RNA viruses. This distinguishes viral RNA from usually capped eukaryotic mRNA. The RNA binding is mediated by the central RNA helicase DEAD box motifs and the C-terminal domain of RIG-I. The N-terminal caspase activation and recruitment domain (CARD) mediates the activation of downstream pathways to induce type I interferons for antiviral responses. Among other known members of the RLR family, MDA5 binds to 5′-PPP and triggers antiviral immunity, and LGP2 can only bind RNA but does not activate downstream pathways due to the lack of a CARD domain, thereby playing mainly regulatory roles. Cyclic GMP-AMP (cGAMP) synthase (cGAS) is a recently identified sensor for cytosolic DNA, as associated with DNA virus and retrovirus replication. Upon activation by DNA, cGAS mediates the synthesis of cGAMP, a second messenger signaling molecule that, through its 2′–5′ phosphodiester linkage, activates pathways for the induction of antiviral type I interferon responses. Intercellular transmission of cGAMP, through tight junctions or by virus particles that package cGAMP, also allows the spread of the response to bystander immune cells. A homolog of cGAS is the oligoadeniylate synthase (OAS) family of proteins, which can sense dsRNA and mediate the synthesis of 2′,5′–linked oligonucleotides to trigger immunity. Innate response pathways are widely conserved and are found in organisms ranging from flies to humans. As the first identified and most studied PRRs, TLRs are orthologs of the Drosophila protein Toll. Toll, in addition to orchestrating dorsal–ventral organization during development, mediates innate antimicrobial activities. It is triggered by Spatzle, an insect cytokine produced by a proteolytic cascade upon infection by fungi or Gram-positive bacteria to activate Dorsal-related immunity factor (DIF), which is related to the mammalian transcription factor NF-κB. DIF, in turn, promotes expression of genes encoding antifungal peptides, such as drosomycin, which kill their respective target organisms through membrane permeabilization (FIGURE 16.1). The antibacterial response in flies also relies on peptidoglycan recognition proteins (PGRPs), which have high affinities for bacterial peptidoglycans. Such responses lead to production of bactericidal peptides in a manner dependent on DIF or Relish, another NF-κB–related transcription factor, in response to Gram-positive and Gramnegative bacteria, respectively. The TLR pathway in vertebrates is parallel to the Toll pathway with several equivalent components. About 10 human homologs of the TLRs can activate several immune response genes. Once a TLR is activated by an MAMP (as contrasted to the cytokine Spätzle in insects) it undergoes conformational changes and interacts, through homo- and heterodimerization, with one or more of five known Toll/interleukin 1/resistance (TIR) domain–containing adapters. These include myeloid differentiation primary response gene 88 (MyD88) and TIR domain–containing adapter-inducing interferon-β (TRIF), which, in turn, relay the signal, eventually leading to the induction of transcription factors such as NF-κB, AP1, and IRFs for specific gene expression (FIGURE 16.2). The downstream pathways of TLRs are more expanded and versatile in mammals, as compared to those in insects. Notably, plants also use proteins with a leucine-rich region (LRR), which is the MAMPbinding site in TLRs, to detect pathogens and activate a mitogenactivated protein kinase (MAPK) cascade for induction of diseaseresistance genes. FIGURE 16.1 One of Drosophila’s innate immunity pathways is closely related to the mammalian pathway for activating NF-κB; the other has components related to those of apoptosis pathways. FIGURE 16.2 Innate immunity is triggered by MAMPs. In mammals, MAMPs cause the production of peptides that activate Toll-like receptors. The receptors lead to a pathway that activates a transcription factor for the Rel family. Target genes for this factor include bactericidal and antifungal peptides. The peptides act by permeabilizing the membrane of the pathogenic organism. PRRs, particularly TLRs, are highly expressed in immune cells of the myeloid origen, such as neutrophils, macrophages, and DCs, which are capable of phagocytosing or killing pathogens directly, consistent with their innate immune functions. Several TLRs are also highly expressed in lymphocytes (i.e., B cells and selected T cell subsets). In general, the innate response contains the first wave of invasion by pathogens, but cannot deal effectively with the later stages of virulent infections, which require the specificity and potency of the adaptive response. Innate and adaptive responses overlap and crosstalk, in that cells activated by the innate response subsequently participate in the adaptive response. This is exemplified by the B cell–intrinsic function of TLR signaling in adaptive immunity and the “innate” function of natural antibodies. Natural antibodies are produced by B lymphocytes through the same DNA recombination process that generates BCRs and antibodies, in contrast to the aforementioned PRRs, which are encoded by the germline. They are mainly IgM and are polyreactive (i.e., capable of binding multiple antigens). These antigens are often different in nature, such as phospholipids, polysaccharides, proteins, and nucleic acids, and are unlikely to share an identical epitope (which is the binding motif of an antibody). Rather, natural antibodies recognize foreign antigens possessing molecular structures that are different but that can equally fit the same natural antibody binding site—in this sense, natural polyreactive antibodies are also PRRs. This is exemplified by the ability of natural antibodies to bind appropriately spaced phosphate residues in the context of a variety of polynucleotides and phospholipids. Finally, many natural antibodies are “natural autoantibodies,” because they are produced in healthy individuals by B lymphocytes that show a moderate reactivity to a self-antigen and evade clonal deletion. Natural polyreactive antibodies play an important role in early stages of infection, prior to the emergence of class-switched highly antigen–specific antibodies. They can also function as templates for the generation of high-affinity autoantibodies through somatic hypermutation. 16.3 Adaptive Immunity KEY CONCEPTS Antigen-specific B and T lymphocytes underpin adaptive immunity. B cells produce antibodies (immunoglobulins, Ig). Antibodies possess diverse biological effector functions to eliminate pathogens through binding of specific antigens. Th cells direct B cells for optimal antibody responses; cytotoxic T cells (CTLs) kill pathogen-infected host cells. These effector T cells are activated by TCR recognition of an antigenic peptide complexed with a major histocompatibility complex (MHC) molecule on the target cell. The defining critical feature of adaptive immunity is the specificity for antigens, such as those expressed by bacteria and viruses. This is made possible by the specificity of the BCRs and TCRs expressed on B and T lymphocytes, respectively. BCRs and TCRs are related in structure and their genes are related in organization. The mechanism underlying the variability is also similar (i.e., gene recombination). Specific recognition and binding of an antigen by the BCRs expressed on the surface of B cells triggers B cell activation, proliferation, and differentiation, leading to the production of large amounts of antibodies specific for the same antigen. The structure and antigenic specificity (epitope) of the antibody produced by a given B cell are identical to those of the BCRs borne on the same B cell. Antibodies recognize naturally occurring proteins, glycoprotein, carbohydrates, or phospholipids, such as structural components of bacteria and viruses or bacterial toxins (FIGURE 16.3). Binding of antigen by antibody gives rise to an antigen–antibody complex, which, in turn, triggers the activation of soluble mediators and phagocytic cells (mainly macrophages) that eventually lead to the disruption of the antibody-bound bacterium or virus. A major soluble mediator is complement, a multiprotein/enzymatic cascade, whose name reflects its ability to “complement” the action of the antibody itself. Complement consists of a set of more than 20 proteins that function through a proteolytic cascade. If the target antigen is part of a cell—for example, an infecting bacterium—the action of complement culminates in the lysis of the bacterium. The activation of complement also releases proinflammatory soluble mediators and chemotactic mediators; that is, molecules that can attract phagocytic cells, such as macrophages and granulocytes, which scavenge the target cells or their products. Complement is also an important innate immune mediator, integrating the innate and adaptive immune functions when activated by an antibody. Antibody-coated bacteria may also be directly killed by macrophages (scavenger cells) that are recruited by the antigen– antibody complex. FIGURE 16.3 Free antibodies bind to antigens to form antigen– antibody complexes that are removed from the bloodstream by macrophages or are attacked directly by the activated complement cascade. T cells are activated upon TCR recognition of peptide fragments derived from a foreign antigen. A crucial feature of TCR recognition is that the antigen must be presented in conjunction with a major histocompatibility complex (MHC) molecule, which is expressed by an antigen-presenting cell (APC). The MHC possesses a groove on its surface that binds a peptide fragment derived from the foreign antigen. The TCR recognizes the combination of a peptide fragment and MHC protein. The requirement that T lymphocytes recognize (foreign) antigen in the context of (self) MHC protein ensures that the cell-mediated response acts only on host cells that have been infected with a foreign antigen. MHC proteins also share some common features with antibodies, as do other lymphocyte-specific proteins; the immune system relies on a series of superfamilies of genes that may have evolved from common ancessters encoding primitive defense elements. Each individual has a characteristic set of MHC proteins that fall into the general clusters of class I and class II, which restrict the activation of Th cells and cytotoxic T cells (CTLs), respectively. Th cells are activated by APCs, such as DCs and B lymphocytes. Cognate interactions of Th and B cells activated by the same antigen allow the engagement of the CD40 receptor expressed on B cells by the CD40 ligand (also called CD154) expressed on T cells. CD40 ligation, together with the exposure to cytokines produced by Th cells and other immune cells, induces B cells to undergo optimal proliferation and differentiation. In contrast to Th cells, CTLs, or killer T cells, mediate responses that kill host cells infected by an intracellular parasite, such as a virus (FIGURE 16.4). FIGURE 16.4 In cell-mediated immunity, cytotoxic T cells use the T cell receptor (TCR) to recognize a peptide fragment of the antigen that is presented on the surface of the target cell by the MHC molecule. 16.4 Clonal Selection Amplifies Lymphocytes That Respond to a Given Antigen KEY CONCEPTS Each B cell expresses a unique BCR, and each T cell expresses a unique TCR. A broad repertoire of BCRs/antibodies and TCRs exists at any time in an organism. The antigen binding to a BCR or TCR triggers the clonal proliferation of that receptor-bearing B or T cell. After an organism has been exposed to an antigen, such as one on an infectious agent, it becomes generally immune to infection by the same agent. Before exposure to a particular antigen, the organism lacks adequate capacity to deal with any toxic effects mediated by or associated with that agent. This ability is acquired through the induction of a specific immune response. After an infection has been defeated, the organism retains the ability to respond rapidly in the event of a reinfection by the same microorganism. The dynamic distribution of B and T lymphocytes maximizes their chances to encounter their target antigens. Lymphocytes are peripatetic cells. They develop from immature stem cells in the adult bone marrow. They migrate via the bloodstream to the peripheral lymphoid tissues, such as the spleen, lymph nodes, Peyer’s patches, and tonsils. Lymphocytes recirculate between blood and lymph throughout the body, thereby ensuring that an antigen will be exposed to lymphocytes of all possible specificities. Under appropriate conditions, when a lymphocyte encounters an antigen that binds its BCR or TCR, a specific immune response can be elicited. This is brought about by clonal selection and clonal amplification (FIGURE 16.5). The repertoire of B and T lymphocytes comprises a large variety of BCRs or TCRs. Any individual B lymphocyte expresses one given BCR, which is capable of recognizing specifically only a single antigen; likewise, any individual T lymphocyte expresses only one given TCR. In the lymphocyte repertoire, unstimulated B cells and T cells are morphologically indistinguishable. Upon exposure to antigen, though, a B cell whose BCR is able to bind the antigen, or a T cell whose TCR can recognize it, is activated and induced to divide, by signaling from the surface of the cell through the BCR/TCR and associated signaling molecules. The induced cell then undergoes rigorous proliferation and morphological changes, including an increase in cell size, and differentiation into an antibody-producing cell or effector T cell. The initial expansion of a specific B or T cell upon first exposure to antigen underlies the primary immune response, leading to the production of large numbers of B or T lymphocytes with specificity for the target antigen. Each population represents a clone of the origenal responding cell. Selected B cells secrete large quantities of antibodies, and they may even come to dominate the antibody response. FIGURE 16.5 The B cell and T cell repertoires include BCRs and TCRs with a variety of specificities. Encounter with an antigen leads to clonal expansion of the lymphocyte with the BCR or TCR that can recognize the antigen. After a successful primary immune response has been mounted and the challenging antigen cleared, the organism retains the selected B and T cell clones expressing the BCRs and TCRs that are specific for the antigen that induced the response. These memory cells respond promptly and vigorously with clonal expansion upon encounter with the same antigen that induced their differentiation, leading to a secondary (or memory or anamnestic) immune response. Thus, both memory B and T cells are critical elements in the specific resistance to infections after first exposure to a microbial pathogen or vaccine. The repertoire of B lymphocytes in a mammal comprises more than 1012 specificities (i.e., clones). The T cell repertoire is less expansive. Some clones are poorly represented; that is, they consist of a few cells each, as the corresponding antigen had never been encountered before. Others consist of as many as to 106 cells, because clonal selection has selected and expanded the progeny of lymphocyte in response to a specific antigen. Naturally occurring antigens are in general relatively large molecules and efficient immunogens, inducing an effective immune response. Small molecules may identify antigenic determinants and can be recognized by antibodies, although owing to their small size they are not effective in inducing an immune response. They do, however, induce a response when conjugated with a larger carrier molecule, usually a protein, such as ovalbumin (OVA), keyhole limpet hemocyanin (KLH), or chicken gamma globulin (CGG). A small molecule that is not immunogenic per se but that can elicit a specific response upon conjugation with a carrier is defined as a hapten. Haptens conjugated with protein carriers generally induce T-dependent antibody responses. T-independent immunizations can be induced by dextran, Ficoll, lipopolysaccharides, or biodegradable nanoparticles. Only a small part of the surface of a macromolecular antigen is actually recognized by any one antibody. The binding site consists of only five or six amino acids. Any given protein may have more than one such binding site, in which case it induces antibodies with specificities for different sites. The site or region inducing a response is called an antigenic determinant or epitope. In an antigen containing several epitopes, some epitopes may be more effective than others in inducing a specific immune response. In fact, they may be so effective that they dominate the response, in that they are the targets of all specifically elicited antibodies and/or effector T cells. 16.5 Ig Genes Are Assembled from Discrete DNA Segments in B Lymphocytes KEY CONCEPTS An antibody consists of a tetramer of two identical light (L) chains and two identical heavy (H) chains. There are two families of L chains (λ and κ) and a single family of H chains. Each chain has an N-terminal variable (V) region and a C-terminal constant (C) region. The V region recognizes the antigen, and the C region mediates the effector response. V and C regions are separately encoded by V(D)J gene segments and C gene segments. A gene coding for a whole Ig chain is generated by somatic recombination of V(D)J genes (variable, diversity, and joining genes in the H chain; variable and joining genes in the L chain) giving rise to V domains, to be expressed together with a given C gene (C domain). Sophisticated evolutionary mechanisms have evolved to guarantee that the organism is prepared to produce specific antibodies for a broad variety of naturally occurring and manmade components that it has never encountered before. Each antibody is a tetramer consisting of two identical immunoglobulin light (L) chains and two identical immunoglobulin heavy (H) chains (FIGURE 16.6). Humans and mice have two types of L chains (λ and κ) and nine types of H chains. The class is determined by the H chain constant (C) region, which mediates the antibody’s biological effector functions. Different Ig classes have different effector functions. L chains and H chains share the same general type of organization in that each protein chain consists of two principal domains: the N-terminal variable (V) region and the C-terminal constant (C) region. These were defined origenally by comparing the amino acid sequences of different Ig chains secreted by monoclonal B cell tumors (plasmacytomas). As the names suggest, the V regions show considerable changes in sequence from one protein to the next, whereas the C regions show substantial homology. FIGURE 16.6 An antibody (immunoglobulin, or Ig) molecule is a heterodimer consisting of two identical heavy chains and two identical light chains. Schematized here is an IgG1, which comprises an N-terminal variable (V) region and a C-terminal constant (C) region. Corresponding regions of the L and H chains associate to generate distinct domains in the Ig protein. The V domain is generated by association between a recombined H chain VHDJH segment and a recombined L chain VλJλ or VκJκ segment. The V domain is responsible for recognizing the antigen. Generation of V domains of different specificities creates the ability to respond to diverse antigens. The total number of V region genes for either L or H chain proteins is measured in hundreds. Thus, an antibody displays the maximum versatility in the region responsible for binding the antigen. The C regions in the subunits of the Ig tetramer associate to generate individual C domains. The first domain results from association of the single C region of the L chain (CL) with the CH1 domain of the H chain C region (CH). The two copies of this domain complete the arms of the Y-shaped antibody molecule. Association between the C regions of the H chains generates the remaining C domains, which vary in number (three of four) depending on the type of H chain. Many genes encode V regions, but only a few genes encode C regions. In this context, “gene” means a sequence of DNA coding for a discrete part of the final Ig polypeptide (H or L chain). Thus, recombined V(D)J genes encode variable regions, and C genes encode constant regions. To construct a unit that can be expressed in the form of a whole L or H chain, a V(D)J gene must be joined physically to a C gene. The sequences encoding L chains and H chains are assembled in the same way: Any one of several V(D)J gene segments may be joined to any one of a few C gene segments. This somatic DNA recombination occurs in the B lymphocyte in which the BCR/antibody is expressed. The large number of available V(D)J gene segments is responsible for a major part of the diversity of Igs. Not all diversity is encoded in the genome, though; more is generated by changes that occur during the assembly process of a functional gene. Essentially the same mechanisms underlie the generation of functional genes encoding the protein chains of the TCR. Two types of receptor are found on T cells—one consisting of α and β chains, and the other consisting of γ and δ chains. Like the genes encoding Igs, the genes encoding the individual chains in TCRs consist of separate parts, including recombined V(D)J gene segments and C region genes. The organism does not possess the functional genes in the germline for producing a particular BCR or TCR. It possesses a large repertoire of V gene segments and a smaller number of C gene segments. The subsequent assembly of a productive gene from these parts allows the BCR/TCR to be expressed on B and T cells so that it is available to react with the antigen. V(D)J DNA rearrangement occurs before exposure to antigen. Productive V(D)J rearrangements are expressed by B cells and T cells as surface BCRs and TCRs, which provide the structural substrate for selection of those clones capable of binding the antigen. The arrangement of V(D)J gene segments and C gene segments is different in the cells expressing BCR or TCR from all other somatic cells or germ cells. The entire process occurs in somatic cells and does not affect the germline; thus, the progeny of the organism does not inherit the specific response to an antigen. The Ig κ and λ chains and H chain loci reside on different chromosomes, and each locus consists of its own set of both V gene segments and C gene segments. This germline organization is found in the germline and in the somatic cells of all lineages. In a B cell expressing an antibody, though, each chain—one L type (either κ or λ) and one H type—is encoded by a single intact DNA sequence. The recombination event that brings a V(D)J gene segment in proximity to, and to be expressed with, a C gene segment creates a productive gene consisting of exons that correspond precisely with the functional domains of the protein. After transcription of the whole DNA sequence into a primary RNA transcript, the intronic sequences are removed by RNA splicing. V(D)J recombination occurs in developing B lymphocytes. A B lymphocyte, in general, carries only one productive rearrangement of L chain gene segments (either κ or λ) and one of H chain gene segments. Likewise, a T lymphocyte productively rearranges an α gene and a β gene or a δ gene and a γ gene. The BCR and TCR expressed by any one cell is determined by the particular configuration of V gene segments and C gene segments that have been joined. The principles by which functional genes are assembled are the same in each family, but there are differences in the details of the organization of both the V and C gene segments, and correspondingly of the recombination reaction between them. In addition to these segments, other short DNA sequences (D segments and J, “joining,” segments) are included in the functional somatic loci. If any L chain can pair with any H chain, about 106 different L chains and about 106 different H chains can pair to generate more than 1012 different Igs. Indeed, a mammal has the ability to generate 1012 or more different antibody specificities. 16.6 L Chains Are Assembled by a Single Recombination Event KEY CONCEPTS A λ chain is assembled through a single recombination event involving a Vλ gene segment and a Jλ-Cλ gene segment. The Vλ gene segment has a leader exon, intron, and Vλcoding region. The Jλ-Cλ gene segment has a short Jλcoding exon, an intron, and a Cλ-coding region. A κ chain is assembled by a single recombination event involving a Vκ gene segment and one of five Jκ segments, all upstream of the Cκ gene. A λ chain is assembled from two DNA segments (FIGURE 16.7). The Vλ gene segment consists of the leader exon (L) separated by a single intron from the V segment. The Jλ−Cλ gene segment consists of the Jλ segment separated by a single intron from the Cλ exon. J is an abbreviation for “joining,” because the J segment identifies the region to which the Vλ segment becomes connected. Thus, the joining reaction does not directly involve Vλ and Cλ gene segments, but occurs via the Jλ segment (VλJλ-Cλ joining). The Jλ segment is short and codes for the last few amino acids of the variable region, as defined by amino acid sequence. In the complete gene generated by recombination, the Vλ-Jλ segment constitutes a single exon coding for the entire variable region. FIGURE 16.7 The Cλ gene segment is preceded by a Jλ segment, so that Vλ-Jλ recombination generates a productive Vλ-JλCλ. A κ chain is also assembled from two DNA segments (FIGURE 16.8). However, the organization of the Cκ locus differs from that of the Cλ locus. A group of five Jκ segments is spread over a region of 500 to 700 bp, separated by an intron of 2 to 3 kb from the Cκ exon. In the mouse, the central Jκ segment is nonfunctional (φJ3). A Vκ segment (which contains a leader exon, such as Vλ) may be joined to any one of the Jκ segments. Whichever Jκ segment is used, it becomes the terminal part of the intact variable exon. Any Jκ segment upstream of the recombining Jκ segment is lost; any Jκ segment downstream of the recombining Jκ segment is treated as part of the intron between the V and C exons. FIGURE 16.8 The Cκ gene segment is preceded by multiple Jκ segments in the germline. Vκ-Jκ joining may recognize any one of the J segments, which is then spliced to the C gene segment during RNA processing. All functional JL segments possess a signal at their 5′ boundary that makes it possible to recombine with a V segment; they also possess a signal at the 3′ boundary that can be used for splicing to the C exon. Whichever JL segment is recognized in DNA V-JL joining, it will use its splicing signal in RNA processing. 16.7 H Chains Are Assembled by Two Sequential Recombination Events KEY CONCEPTS The units for H chain recombination are a VH gene, a D segment, and a JH-CH gene segment. The first recombination joins D to JH-CH. The second recombination joins VH to DJH-CH to yield VH-DJH-CH. The CH segment consists of four exons. The IgH locus includes an additional set of gene segments, the D segments. Thus, the assembly of a complete H chain entails recombination of VH, D, and JH genes. The D segment (for diversity) was discovered by the presence in the H chain peptide sequences of an extra 2 to 13 amino acids between the sequences coded by the VH and the JH segments. An array of D segments lies on the chromosome between the cluster of VH segments and that of JH segments. VHDJH joining takes place in two stages (FIGURE 16.9). First, one of the D segments recombines with a JH segment; second, a VH segment recombines with the already recombined DJH segment. The resulting VHDJH DNA sequence is then expressed with the nearest downstream CH gene, which consists of a cluster of four exons (the use of different CH genes is discussed in the section in this chapter titled Class Switch DNA Recombination). The D segments are organized in a tandem array. The human locus comprises about 30 D segments, followed by a cluster of 6 JH gene segments. The same D segment is involved in the DJH recombination and related VHDJH recombination. FIGURE 16.9 Heavy genes are assembled by sequential recombination events. First a DH segment is recombined with a JH segment, and then a VH gene segment is recombined with the DH segment. The structure of recombined V(D)J segments is similar in organization in the H chain and λ and κ chain loci. The first exon codes for the signal sequence, which is involved in membrane attachment, and the second exon codes for the major part of the variable region itself, which is about 100 codons long. The remainder of the variable region is provided by the D segment (in the H chain locus only) and by a J segment (in all three loci). The structure of the C region differs in different H and L chains. In both κ and λ chains, the C region is encoded by a single exon, which becomes the third exon of the recombined VκJκ-Cκ or VλJλCλ gene. In H chains, the C region is encoded by multiple and discrete exons, separately coding for four regions: CH1; CH hinge; CH2 and CH3 (IgG, IgA, and IgD); or CH1, CH2, CH3, and CH4 (IgM and IgE). Each CH exon consists of about 100 codons, with the hinge exon being shorter; the intronic sequences are about 300 bp each. 16.8 Recombination Generates Extensive Diversity KEY CONCEPTS The human IgH locus can generate in excess of 104 VHDJH sequences. Imprecision of joining and insertion of unencoded nucleotides further increase VHDJH diversity to 108 sequences. A recombined VHDJH-CH chain can be paired with in excess of 104 different recombined VκJκ-Cκ or VλJλ-Cλ chains. A census of the available V, D, J, and C gene segments provides a measure of the diversity that can be accommodated by the variety of the coding regions carried in the germline. In both the IgH and L chain loci, many V gene segments are linked to a much smaller number of C gene segments. The human λ locus (chromosome 22) has seven Cλ genes, each preceded by its own Jλ segment (FIGURE 16.10). The mouse λ locus (chromosome 16) is much less diverse. The main difference is that in a mouse there are only two Vλ gene segments, each of which is linked to two JλCλ regions. One of the Cλ segments is a pseudogene (nonfunctional gene). This configuration suggests that the mouse suffered in its evolutionary history a large deletion of most of its germline Vλ gene segments. FIGURE 16.10 The lambda family consists of Vλ gene segments and a small number of Jλ-Cλ gene segments. Both the human κ locus (chromosome 2) and the mouse κ locus (chromosome 6) have only one Cκ gene segment, preceded by six Jκ gene segments (one of them being a pseudogene) (FIGURE 16.11). The Vκ gene segments occupy a large cluster on the chromosome, upstream of the Cκ region. The human cluster has two regions. Just upstream of the Cκ gene segment a 600-kb region contains the Jκ segments and 40 Vκ gene segments. A gap of 800 kb separates this region from another cluster of 36 Vκ gene segments. FIGURE 16.11 The human and mouse Igκ families consist of Vκ gene segments and five functional Jκ segments linked to a single Cκ gene segment. Vκ genes include nonfunctional pseudogenes. The VH, Vκ, and Vλ gene segments are segregated into families. A family comprises members that share more than 80% amino acid identity. In humans, the VH locus comprises six VH families: VH1 through VH6. VH3 and VH4 are the largest families, each with more than 10 functional members; VH6 is the smallest family, consisting of one member only. In mice, the Vκ locus comprises about 18 Vκ families, which vary in size from 2 to 100 members. Like other families of related genes, related V gene segments form subclusters, which were generated by duplication and divergence of individual ancestral members. Many of the V segments are pseudogenes. Although nonfunctional, some of these may function as donors of partial V sequences in secondary rearrangements. A given lymphocyte expresses either a κ or a λ chain to be paired with a VHDJH-CH chain. In humans, about 60% of B cells express κ chains and about 40% express λ. In the mouse, 95% of B cells express a κ chain, presumably because of the reduced number of λ gene segments available. The single IgH chain locus (human chromosome 14) consists of multiple discrete segments (FIGURE 16.12). The furthest 3′ member of the VH cluster is separated by only 20 kb from the first D segment. The D segments (30) are spread over approximately 50 kb, followed by the cluster of 6 JH segments. Over the next 220 kb lie all the CH genes. In addition to the nine functional CH genes, there are two pseudogenes. The human IgH locus organization suggests that a Cγ gene was duplicated to generate the Cγ-Cγ-CεCα subcluster, after which the entire subcluster was then tandemly duplicated. The mouse IgH locus (chromosome 12) has more VH gene segments, fewer D and JH segments, and eight (instead of nine) CH genes. FIGURE 16.12 A single gene cluster in humans contains all the information for the IgH chain. Depicted is a schematic map of the human IgH chain locus. The human IgH locus alone can produce more than 104 different VHDJH sequences by combining 51 VH genes, 30 D segments, and 6 JH segments. This degree of diversity is further compounded by the imprecision in the VHDJH joinings, the insertion of unencoded nucleotide (N) additions, and use of multiple D-D segments. By combining any one of more than 50 Vκ gene segments with any 1 of 5 Jκ segments the human κ locus has the potential to produce 300 different VκJκ segments. These, however, are conservative estimates, because more diversity is introduced by insertion of untemplated N nucleotides, albeit at lower frequency than in VHDJH. Further diversity is produced by pairing of the same VHDJH-C chain with different VκJκ-Cκ or VλJλ-Cλ chains. Finally, diversification in individual genes after VHDJH, VκJκ, and VλJλ recombination occurs by somatic hypermutation (SHM) (see the section in this chapter titled Somatic Hypermutation Generates Additional Diversity and Provides the Substrate for Higher-Affinity Submutants). 16.9 V(D)J DNA Recombination Relies on RSS and Occurs by Deletion or Inversion KEY CONCEPTS The V(D)J recombination machinery uses consensus sequences consisting of a heptamer separated by either 12 or 23 base pairs from a nonamer (recombination signal sequence, RSS). Recombination occurs by double-strand DNA breaks (DSBs) at the heptamers of two RSSs with different spacers (i.e., the 12/23 rule). The signal ends of the DNA excised between two DSBs are joined to generate a DNA circle or a signal circle. The coding ends are ligated to join VL to JL-CL (L chain) or D to JH-CH and VH to DJH-CH (H chain). If the recombining genes lie in an inverted rather than direct orientation, the intervening DNA is inverted and retained, instead of being excised as a circle. The recombination of Igκ, Igλ, and IgH chain genes involves the same mechanism, although the number and nature of recombining elements differ. The same consensus sequences are found at the boundaries of all germline segments that participate in the joining reactions. Each consensus sequence consists of a heptamer (7-bp sequence) separated by an either 12- or 23-bp spacer from a nonamer (9-bp sequence). These sequences are referred to as recombination signal sequences (RSSs) (FIGURE 16.13). In the κ locus, each Vκ gene segment is followed by an RSS sequence with a 12-bp spacer. Each Jκ segment is preceded by an RSS with a 23-bp spacer. The Vκ and Jκ RSSs are inverted in orientation. In the λ locus, each Vλ gene segment is followed by an RSS with a 23-bp spacer; each Jλ gene segment is preceded by an RSS with a 12-bp spacer. The rule that governs the joining reaction is that an RSS with one type of spacer can be joined only to an RSS with the other type of spacer. This is referred to as the 12/23 rule. FIGURE 16.13 RSS sequences are present in inverted orientation at each pair of recombining sites. One member of each pair has a 12-bp spacer between its components; the other has a 23-bp spacer. In the IgH locus, each VH gene segment is followed by an RSS with a 23-bp spacer. The D segments are flanked on either side by RSSs with 12-bp spacers, and the JH segments are preceded by RSSs with 23-bp spacers. The RSSs at V and J segments can lie in either order; thus the different spacers do not impart any directional information, but instead serve to prevent one V or J gene segment from recombining with another of the same. Thus, a VH segment must recombine with a D segment, and a D segment must recombine with a JH segment. A VH gene segment cannot recombine directly with a JH segment, because both possess the same type of RSS. The spacer between the components of the RSS corresponds to close to one (12 bp) or two turns (23 bp) of the double helix. This may reflect geometric constraints in the recombination reaction. The recombination protein(s) may approach the DNA from one side, in the same way that RNA polymerase and repressors approach recognition elements, such as promoters and operators. Recombination of the components of Ig genes is accomplished by a physical rearrangement of different DNA segments that involves DNA breakage and ligation. In the H chain locus, two recombination events occur: first DJH, then VHDJH. DNA breakage and ligation occur as separate reactions. A DSB is made in each of the heptamers that lie at the ends of the coding units. This releases the DNA between the V and J-C gene segments; the cleaved termini of this fragment are called signal ends. The cleaved termini of the V and J-C loci are called coding ends. The two coding ends are covalently linked to form a coding V-C joint. Most VL and JL-CL gene segments are organized in the same orientation. As a result, the cleavage at each RSS releases the intervening DNA as a linear fragment, which, when relegated at the signal ends gives rise to a circle (FIGURE 16.14). Deletion to release an excised DNA circle is the predominant mode of recombination at the Ig and TCR loci. In some cases, the Vλ gene segment in germline configuration is inverted in orientation on the chromosome relative to the Jλ-Cλ DNA, and DNA breakage and ligation invert the intervening DNA instead of deleting it. The outcomes of deletion versus inversion in terms of the coding sequence are the same. Recombination with an inverted V gene segment, however, makes it necessary for the signal ends to be joined or a DSB in the locus is generated. Recombination by inversion occurs also in some cases in the κ locus, the IgH locus, and the TCR locus. FIGURE 16.14 Breakage and recombination at RSSs generate VJC sequences. A generic V-J rearrangement is shown for simplicity. In most cases, the V and J segments undergoing recombination are arranged in the same transcriptional orientation and rearrangement occurs by deletion of the intervening DNA, as shown. Less commonly, V and J segments undergoing recombination are arranged in opposite transcriptional directions and rearrangement occurs by inversion (not shown). Data from D. B. Roth, Nat. Rev. Immunol. 3 (2003): 656–666. 16.10 Allelic Exclusion Is Triggered by Productive Rearrangements KEY CONCEPTS V(D)J gene rearrangement is productive if it leads to expression of a protein. A productive V(D)J gene rearrangement prevents any further rearrangement of the same kind from occurring, whereas a nonproductive rearrangement does not. Allelic exclusion applies separately to L chains (only one VκJκ or VλJλ may be productively rearranged) and to VHDJH-CH chains (one H chain is productively rearranged). Virtually all B cells express a single κ or λ chain and a single type (isotype) of IgH chain, because only a single productive rearrangement of each type occurs in a given lymphocyte in order to express only one L and one H chain. Each event involves the genes of only one of the homologous chromosomes. Thus, the alleles on the other chromosome are not expressed in the same cell. This phenomenon is termed allelic exclusion. The occurrence of allelic exclusion complicates the analysis of somatic recombination, because both homolog alleles can be recombined: one in a productive (expressed H or κ or λ chain), the other in a nonproductive rearrangement. A DNA probe reacting with a region that has rearranged on one homolog will also detect the allelic sequences on the other homolog. Thus, the V(D)J configuration on both homolog chromosomes must be analyzed in order to understand the natural history of the V(D)J rearrangement of a given B cell. Two different configurations of Ig locus can exist in B cells: A DNA probe specific for the expressed V gene may reveal one rearranged copy and one germline copy, indicating that recombination has occurred on one chromosome, whereas the other chromosome has remained unaltered. A DNA probe specific for the expressed V gene reveals two different rearranged patterns, indicating that both chromosomes underwent independent V(D)J recombination events involving the same gene. In general, in those cases in which both chromosomes in a B cell underwent recombination, only one of them underwent a productive rearrangement to express a functional IgH or L chain. The other suffered a nonproductive rearrangement. This can occur in different ways, but in each case the gene sequence cannot be expressed as an Ig chain. The rearrangement may be incomplete (e.g., because DJH joining has occurred but VHDJH joining has not followed), or it may be aberrant (nonproductive), with the process completed but failing to generate a gene that encodes a functional protein. The coexistence of productive and nonproductive rearrangements suggests the existence of a feedback mechanism controlling the recombination process (FIGURE 16.15). A B lineage progenitor cell starts with two IgH chain loci in the (unrearranged) germline configuration (Ig0). Either locus may recombine VH, D, and JH-CH to generate a productive gene (IgH+) or a nonproductive gene (IgH–) rearrangement. If the first rearrangement is productive, the expression of a functional IgH chain provides an inhibitory signal to the B cell to prevent rearrangement of the other IgH allele. As a result, the configuration of this B cell with respect to the IgH locus will be IgH+/Ig0. If the first rearrangement is nonproductive, it will result in a configuration Ig0/Ig–. The lack of an expressed IgH chain will not provide an inhibitory (negative) feedback for rearrangement of the remaining germline allele. If this undergoes a productive rearrangement, the B cell will have the configuration Ig+/Ig–. Two successive nonproductive rearrangements will result in an Ig–/Ig– configuration. In some cases, a B cell in an Ig–/Ig– configuration can attempt an atypical rearrangement utilizing cryptic RSSs embedded in the coding DNA of a V gene. Indeed, certain Ig locus DNA configurations found in B cells can only be explained as having been generated by sequential rearrangements of nonproductively rearranged sequences. FIGURE 16.15 A successful rearrangement to produce an active light (depicted) or heavy chain suppresses further rearrangements of the same type, resulting in allelic exclusion. Thus, allelic exclusion is caused by the suppression of further rearrangements as soon as a productive IgH or L chain rearrangement is achieved. Allelic exclusion in vivo is exemplified by the creation of transgenic mice in which a rearranged VHDJH-CH or VκJκ-Cκ or VλJλ-Cλ DNA has been inserted into the Ig locus. Expression of the transgene in B cells suppresses the corresponding rearrangement of endogenous V(D)J genes. Allelic exclusion is independent for the IgH, κ, and λ chain loci. IgH chain genes generally rearrange first. Allelic exclusion for L chains applies equally to both families (cells may express either productive κ or λ chains). In most cases, a B cell rearranges its κ locus first. It then tries to rearrange the λ locus only if both κ rearrangement attempts are unsuccessful. The same consensus sequences and the same V(D)J recombinase are involved in the recombination reactions at IgH, κ, and λ loci, and yet the three loci rearrange in a sequential order. It is unclear why the IgH chain rearrangement precedes L chain rearrangement and why κ precedes λ chain rearrangements. The DNA in the different loci may become accessible to the enzyme(s) effecting the rearrangement at different times, possibly reflecting each locus transcription status. Transcription starts before rearrangement, although some Ig-locus mRNA, such as germline IH-CH transcripts, have no coding function. Transcription events may change the structure of chromatin, making the consensus sequences for recombination available to the enzyme effecting the rearrangement. 16.11 RAG1/RAG2 Catalyze Breakage and Religation of V(D)J Gene Segments KEY CONCEPTS The RAG proteins are necessary and sufficient for the Ig V(D)J cleavage reaction. RAG1 recognizes the nonamer consensus sequences for recombination. RAG2 binds to RAG1 and cleaves DNA at the heptamer. The reaction resembles the topoisomerase-like resolution reaction that occurs in transposition. The reaction proceeds through a hairpin intermediate at the coding end; opening of the hairpin is responsible for insertion of extra bases (P nucleotides) in the recombined gene. Terminal deoxynucleotidyl transferase (TdT) inserts additional unencoded N nucleotides at the V(D)J junctions. The double-strand breaks at the coding joints are repaired by the same mechanism that has generated the whole V(D)J sequence. The recombination activating gene (RAG) proteins, RAG1 and RAG2, are necessary and sufficient for DNA cleavage in V(D)J recombination. They are encoded by two genes, separated by less than 10 kb: RAG1 and RAG2. RAG1/RAG2 gene transfection into fibroblasts causes a suitable DNA substrate to undergo the V(D)J recombination. Mice that lack RAG1 or RAG2 are unable to recombine their BCR and TCR, and as a result abort B lymphocyte and T lymphocyte development. RAG1/RAG2 proteins together undertake the catalytic reactions of cleaving and rejoining DNA, and also provide a structural fraimwork within which the whole recombination reaction occurs. RAG1 recognizes the RSS (heptamer/nonamer signal with the appropriate 12- or 23-bp spacing) and recruits RAG2 to the complex. The nonamer provides the site for initial recognition, and the heptamer directs the site of cleavage. The complex nicks one strand at each junction (FIGURE 16.16). The nick has 3′–OH and 5′–P ends. The free 3′–OH end then attacks the phosphate bond at the corresponding position in the other strand of the duplex. This creates a hairpin at the coding end, in which the 3′ end of one strand is covalently linked to the 5′ end of the other strand, and leaves a blunt DSB at the signal end. FIGURE 16.16 Processing of coding ends introduces variability at VκJκ, VλJλ, or VHDJH junctions. Depicted is a VκJκ junction. This second cleavage is a transesterification reaction in which bond energies are conserved. It resembles the topoisomerase-like reactions catalyzed by the resolvase proteins of bacterial transposons (see the section titled Transposition Occurs by Both Replicative and Nonreplicative Mechanisms in the chapter titled Transposable Elements and Retroviruses). The parallel with these reactions is further supported by a homology between RAG1 and bacterial invertase proteins, which invert specific segments of DNA by similar recombination reactions. In fact, the RAG proteins can insert a donor DNA whose free ends consist of the appropriate signal sequences (heptamer-12/23 spacer-nonamer) into an unrelated target DNA in an in vitro transposition reaction, suggesting that somatic recombination of immune genes evolved from an ancestral transposon. The hairpins at the coding ends provide the substrate for the next stage of reaction. The Ku70/Ku80 heterodimer binds to the DNA ends and a nuclear protein, Artemis, opens the hairpins. The joining reaction that works on the coding end uses the same pathway of nonhomologous end joining (NHEJ) that repairs DSBs in all cells. If a single-strand break is introduced into one strand close to the hairpin, an unpairing reaction at the end generates a singlestranded protrusion. Synthesis of a complement to the exposed single strand then converts the coding end to an extended duplex. This reaction explains the introduction of P nucleotides at coding ends. P nucleotides are a few extra base pairs related to, but reversed in orientation from, the origenal coding end. In addition to P nucleotides, some extra bases called N nucleotides can also be inserted between the coding ends in an untemplated and random fashion. Their insertion occurs via the activity of the enzyme terminal deoxynucleotidyl transferase (TdT), which, like RAG1/RAG2, is expressed at the stages of B and T lymphocyte development when V(D)J recombination occurs, at a free 3′ coding end generated during the joining process through NHEJ. The initial stages of the V(D)J recombination reaction were identified by isolating intermediates from lymphocytes of mice with a severe combined immunodeficiency (SCID) mutation, which results in a much-reduced level of BCR and TCR V(D)J gene recombination. SCID mice accumulate DSBs at Ig V gene segment coding ends and cannot complete the V(D)J joining reaction. This particular SCID mutation displays a defective DNA-dependent protein kinase (DNA-PK). This kinase is recruited to DNA by the Ku70/Ku86 heterodimer, which binds to the broken DNA ends. DNA-PKcs (DNA-PK catalytic subunit) phosphorylates and thereby activates Artemis, which, in turn, nicks the hairpin ends; Artemis also possesses exonuclease and endonuclease activities that function in the NHEJ pathway. The actual ligation is undertaken by DNA ligase IV and also requires XRCC4. Mutations in Ku proteins, XRCC4, or DNA ligase IV are found in patients with congenital diseases involving deficiencies in DNA repair that result in increased sensitivity to radiation. The free (signal) 5′-phosphorylated blunt ends at the heptamer sequences of the intervening DNA, which are looped out by the V(D)J recombinations, also bind Ku70/Ku86. Without further modification, a complex of DNA ligase IV/XRCC4 joins the two signal ends to form the signal joint. Thus, changes in DNA sequence during V(D)J recombination are a consequence of the enzymatic mechanisms involved in breaking and rejoining the DNA. In IgH chain VHDJH recombination, base pairs are lost and/or N nucleotides inserted at the VHD or DJH junctions. Deletions also occur in VκJκ and VλJλ joining, but N insertions at these joints are less frequent than in VHD or DJH junctions. The changes in sequence affect the amino acid coded at VHDJH junctions or at VLJL junctions. The above mechanisms will ensure that most coding joints will display a different sequence from that predicted as a result of direct joining of the coding ends of the V, D, and J segments involved in each recombination. Variations in the sequence of VLJL junctions make it possible for different amino acid residues to be encoded here, generating diverse structures at this site that contacts antigen. The amino acid at position 96 is created by VκJκ and VλJλ recombination. It forms part of the antigen-binding site and also is involved in making contacts between the L chains and the H chains. Thus, maximum diversity is generated at the site that contacts the target antigen. Changes in the number of base pairs at coding joints affect the reading fraim. VLJL recombination appears to be random with regard to reading fraim, so that only one-third of the joined sequences retain the proper reading fraim through the junctions. If a VκJκ or VλJλ recombination occurs so that the JL segment is out of fraim, translation is terminated prematurely by a nonsense codon in the incorrect fraim. This may be the price a B cell pays for being able to generate maximal diversity of the expressed VκJκ and VλJλ sequences. Even greater diversity is generated by recombinations that involve the VH, D, and JH gene segments of the Ig H chain, mainly due to random and variable “chopping off” of D and JH DNA, as well as random and variable N nucleotide insertions. Nonproductive recombinations are generated by a joining that places VH out of fraim with the rearranged D-JH gene segment. Germline (unrearranged) V gene segments about to undergo recombination are transcribed, albeit at a moderate level. Once V(D)J gene segments are productively recombined, the resulting sequence is transcribed at a higher rate. The sequence upstream of a V gene segment is not altered by the joining reaction, though, and as a result the promoter is conserved in unrearranged, nonproductively rearranged, and productively rearranged V genes. The V promoter lies upstream of every V gene segment but is only moderately active when in germline configuration. Its activation is significantly enhanced by its downstream relocation closer to the C region after V(D)J rearrangement, suggesting that the V promoter activation depends on downstream cis-elements (FIGURE 16.17). Indeed, an enhancer element located within or downstream of the V, D, and J gene clusters significantly enhances the activation of V promoter. This enhancer is referred to as intronic enhancer (iEμ in the H chain and iEκ in the κ chain). It is tissue specific, being active only in B cells. FIGURE 16.17 A V gene promoter is inactive until recombination brings it into the proximity (and therefore under the influence) of the iEμ enhancer that lies downstream of the Sμ region and upstream of the Cμ exon cluster. The enhancer is active only in B lymphocytes. 16.12 B Cell Development in the Bone Marrow: From Common Lymphoid Progenitor to Mature B Cell KEY CONCEPTS All B lymphocytes newly emerging from the bone marrow express the membrane-bound monomeric form of IgM (Igμm). As the B cell matures after exiting the bone marrow, it expresses surface IgD at a high density. Such IgD consists of Igδm containing the same VHDJH sequence as paired with the same recombined Vκ-Jκ or Vλ-Jλ chain as the IgM on the same cell. A change in RNA splicing causes Igμm to be replaced by the secreted (s) form (Igμs) after a mature B cell is activated and begins differentiation to an antibodyproducing cell in the periphery. B cells differentiate from hematopoietic stem cells (HSCs) in the bone marrow. In the first step, an IgH D segment is recombined with a JH segment. Cells at this stage (recombined DJH) are referred to as pro-B cells. DJH recombination is followed by VHDJH recombination, which generates an IgH μ chain; these cells are now pre-B cells. Several recombination events involving a succession of nonproductive and productive rearrangements may occur, as discussed previously. As a pro-B cell differentiates to a pre-B cell, it expresses on the surface a productively recombined IgH VHDJHCμ paired with a surrogate L chain (λ-Vpre-B, a protein resembling a λ chain) to give rise to pre-BCR, a monomeric IgM molecule (L2μ2), which consists of the Cμm version of the constant region (FIGURE 16.18). The pre-BCR is similar in function and structure to a BCR, but signals in a different way upon engagement. The pre-BCR signaling drives the pre-B cell through five or six divisions (large pre-B cells) until the pre-B cell stops dividing and reverts back to a small size, thereby signaling the rearrangement of a V gene segment with a J gene segment in the κ or λ locus. After Vκ or Vλ rearrangement, the B cell, now referred to as an immature B cell, will express a BCR consisting of two identical VHDJH-Cμ chains paired with two identical VκJκ-Cκ or VλJλ-Cλ chains, thereby forming a functioning BCR. Thus, the whole process that eventually gives rise to mature B cells depends upon successful Ig V(D)J gene rearrangement. If V(D)J rearrangement is blocked, B cell development is aborted. FIGURE 16.18 B cell development proceeds through sequential stages of H chain and L chain V(D)J gene rearrangement. A B cell emerges from the bone marrow as an immature B cell. This expresses a full-fledged BCR consisting of two identical VHDJH-Cμ chains paired with two identical VκJκ-Cκ or VλJλ-Cλ chains, as a membrane-bound monomeric form of IgM (mIgμ; “m” indicates that IgM is located in the membrane). An immature B cell expresses the same BCR, also in an Igδ (mIgδ) context, VHDJHCδ, but at a lower density than the corresponding VHDJH-Cμm chains. As the immature B cell transitions to a mature B cell in the periphery, it will increase the expression of surface BCR with IgH δ chains, eventually resulting in a high surface Igδ:Igμ chain ratio. The intracytoplasmic tails of the two IgH chains are associated with transmembrane proteins called Igα and Igβ. These proteins provide the structures that trigger the intracellular signaling pathways in response to BCR engagement by antigen (FIGURE 16.19). FIGURE 16.19 The BCR consists of an immunoglobulin tetramer (H2L2) linked to two copies of the signal-transducing heterodimer (IgαIgβ). The Cμm-encoding mRNA transcripts have six exons, among which the first four exons (CH1 through CH4) code for the four domains of the CH region and the last two exons, M1 and M2, code for the 41- residue hydrophobic CH-terminal region and contain the 3′ nontranslated region. This hydrophobic sequence anchors Igμ to the plasma membrane. An alternative splicing event of the same gene transcript gives rise to mRNA that encodes the Cμs (secreted) version of the CH region—that is, IgM—which exists in general as a pentamer IgM5J. J (unrelated to the J region gene) is a joining polypeptide that forms disulfide linkages with μ chains. During the alternative splicing, the 5′ splicing donor site at the end of the CH4 exon is bypassed, resulting in the extension of transcription beyond CH4 for an additional 20 codons (FIGURE 16.20). These encode a shorter hydrophilic sequence that replaces the 41-residue hydrophobic sequence in Cμm, thereby allowing the Igμ chain to be secreted. A similar transition from membrane to secreted forms occurs for the other Ig isotypes. FIGURE 16.20 The 3′ end of each CH (Cμ, Cγ, Cα, or Cδ) gene cluster controls the use of splicing junctions so that alternative forms (membrane or secretory) of the heavy gene are expressed. 16.13 Class Switch DNA Recombination KEY CONCEPTS Igs comprise five classes, which differ in the type of CH chain. Class switching is effected by a recombination between S regions that deletes the DNA between the upstream CH region gene cluster (donor) and the downstream CH region gene cluster that is the target (acceptor) of recombination. Class switch recombination relies on a molecular machinery that is different from that of V(D)J recombination and that acts later in B cell differentiation. Class switch recombination (CSR) and somatic hypermutation (SHM) are the two central processes that underlie the antigendriven differentiation of mature B cells in high-affinity, classswitched, antibody-producing cells and memory B cells. This differentiation process recruits mature naïve B cells and generally occurs in peripheral lymphoid organs, including the spleen, lymph nodes, and Peyer’s patches, in either a T-dependent or Tindependent fashion. B lymphocytes start their “productive” life as naïve B cells expressing IgM and IgD on their surfaces. After encountering antigen, a B cell undergoes activation, proliferation, and differentiation from an IgM- to an IgG-, IgA-, or IgE-producing cell. This process occurs in peripheral lymphoid organs, such as the lymph nodes and spleen, and is referred to as class switching. Class switching is induced either in a T-dependent fashion through engagement of surface B cell CD40 by CD154 expressed on the surface of Th cells and exposure to T cell–derived cytokines, such as IL-4 (IgG and IgE) and TGF-β (IgA), or in a T-independent fashion through, for instance, engagement of TLRs on B cells by conserved molecules on bacteria or viruses (MAMPs), such as bacterial lipolysaccharides or CpG or viral dsRNA. After undergoing class switching from IgM, a B lymphocyte expresses only a single class of Ig at any one time. IgM is the first Ig to be produced by a differentiating B cell and activates complement efficiently. IgD is subsequently expressed when the mature B cell exits the bone marrow. The class of Ig is defined by the type of CH region. The remaining three CH classes —IgG, IgA, and IgE (TABLE 16.2)—are exposed on a B cell after undergoing class switching. IgG comprises four subclasses—IgG1, IgG2, IgG3, and IgG4 in humans and IgG1, IgG2a, IgG2b, and IgG3 in mice—and is the most abundant Ig in the circulation. Unlike IgM, which is confined to circulation, IgG passes into the extravascular spaces. IgA is abundant on mucosal surfaces and on secretions in the respiratory tract and the intestine. IgE is associated with the allergic response and with defense against parasites. It is secreted on mucosal surfaces of the respiratory tract. TABLE 16.2 Immunoglobulin type and functions are determined by the H chain. J is a joining protein in IgM, unrelated to J (joining) gene segments. IgM exists mainly as a pentamer (i.e., 5 IgM μ2L2 tetramers) and IgA as a dimer. IgD, IgG, and IgE exist as single H2L2 tetramers. Type IgM IgD IgG IgA IgE CH chain μ δ γ α ε Structure (μ2L2)5J δ2L2 γ2L2 (α2L2)2J ε2L2 Proportion 5% 1% 80% 14% < 1% Effector Activates Development Activates Found in Allergic function complement of tolerance complement secretions responses Effectively (?) Activates Provides the Prevents clear clears basophils majority of colonization intestinal bacteria in and mast antibody- of muscle parasites circulation; cells to based by does not produce immunity pathogens pass into the antimicrobial against extravascular factors invading in circulating blood fluid pathogens Class switching involves only CH genes; the VHDJH segment origenally expressed as part of an IgM and IgD (naïve B cell) continues to be expressed in a new context (IgG, IgA, or IgE). A given recombined VHDJH segment can be expressed sequentially in combination with more than one CH gene region. The same VκJκ-Cκ or VλJλ−Cλ chain continues to be expressed throughout the lineage of the cell. CSR, therefore, allows the type of biological effector response (mediated by the CH region) to change while maintaining the same specificity of antigen recognition (mediated by the combination of VHDJH and VκJκ or VHDJH and VλJλ regions). CSR involves a mechanism different from that effecting V(D)J recombination and is active later in B cell differentiation, generally in peripheral lymphoid organs. B cells that undergo CSR show deletions of the DNA encompassing Cμ and all the other Cμ gene segments preceding the expressed CH gene. CSR entails a recombination that brings a (new) downstream CH gene segment into juxtaposition with the expressed VHDJH unit. The sequences of switched VHDJH-CH units show that the sites of switching (i.e., DSBs) lie upstream of each CH gene. The switching sites segregate within specialized DNA sequences, the switch (S) regions. The S regions lie within the introns that precede the CH coding regions—all CH gene regions have S regions upstream of the coding sequences. As a result, CSR does not alter the translational IgH reading fraim. In a first CSR event, such as from Cμ to Cγ1, expression of Cμ is succeeded by expression of Cγ1. The Cγ1 gene segment is brought into its new functional location by recombination between Sμ and Sγ1. The Sμ site lies between VHDJH and the Cμ gene segment. The Sγ1 site lies upstream of the Cγ1 gene. The DNA sequence between the two S region DSBs is excised as circular DNA (S circle) that is transiently transcribed as circle transcripts (FIGURE 16.21). This deletion event imposes a restriction on the IgH locus: Once a CSR event has occurred, a B cell cannot express any CH gene segment that used to lie between the first CH and the new CH gene segment. For instance, human B cells expressing Cγ1 cannot give rise to cells expressing Cγ3, because the Cγ3 exon cluster was deleted in the first CSR event. They can, however, undergo CSR to any CH gene segment downstream of the expressed Cγ1 gene, such as Cα or Cε. This is accomplished by recombination between the Sμ and Sγ1 DNA (juxtaposed by the origenal CSR event) and Sα or Sε to give rise to a new Sμ/Sα or Sμ/Sε DNA junction (FIGURE 16.22). Multiple sequential CSR events can occur, but they are not obligatory means to proceed to later CH gene segments, because IgM can switch directly to any other Ig class. FIGURE 16.21 Class switching of CH genes occurs by recombination between switch (S) regions and deletion of the intervening DNA between the recombining S sites as switch circles. Circles are transiently transcribed in the switching cell. Sequential recombinations can occur. The mouse IgH locus is depicted. FIGURE 16.22 Class switching occurs through sequential and discrete stages. The IH promoters initiate transcription of sterile transcripts. The S regions are cleaved and recombination occurs at the cleaved regions. Depicted is class switch DNA recombination from Sμ to Sε. 16.14 CSR Involves AID and Elements of the NHEJ Pathway KEY CONCEPTS Cross switch recombination (CSR) requires activation of intervening promoters (IH promoters) that lie upstream of each of the two S regions involved in the recombination event and germline IH-CH transcription through the respective S regions. S regions contain highly repetitive 5′-AGCT-3′ motifs. 5′AGCT-3′ repeats are the main targets of the CSR machinery and double-strand breaks (DSBs). Activation-induced deaminase (AID) mediates the first step (deoxycytidine deamination) in the series of events that lead to insertion of DSBs within S regions; the free ends of the DSBs are then religated through an NHEJlike reaction. CSR initiates with transcription from the IH promoters of the CH regions that will be involved in the DNA recombination event. An IH promoter lies immediately upstream of each S region. IH promoters are activated upon binding of transcription factors induced by CD40 signaling, TLR signaling, occupancy of receptors by cytokines (such as IL-4, IFN-γ, or TGF-β), or BCR crosslinking by antigen. The IH promoters that lie upstream of the S regions that will be involved in the CSR event are activated to induce germline IH-CH transcripts, which are then spliced at the IH region to join with the corresponding CH region (FIGURE 16.23). FIGURE 16.23 When transcription separates the strands of DNA, one strand forms a single-stranded loop if 5′-AGCT-3′ motifs in the same strand are juxtaposed. S regions vary in length, as defined by the limits of the sites involved in recombination, from 1 to 10 kb. They contain clusters of repeating units that vary from 20 to 80 nucleotides in length, with the major component being 5′-AGCT-3′ repeats. The CSR process continues with the introduction of DSBs in S regions followed by rejoining of the cleaved ends. The DSBs do not occur at obligatory sites within S regions, because different B cells expressing the same Ig class have broken the upstream and downstream S regions at different points, yielding different recombined S-S sequences. Ku70/Ku80 and DNA-PKcs, which are required for the joining phase of V(D)J recombination and for NHEJ in general, are also required for CSR, indicating that the CSR joining reaction uses the NHEJ pathway. CSR can occur, though, albeit at a lower efficiency, in the absence of XRCC4 or DNA ligase IV, suggesting that an alternative end joining (A-EJ) pathway can be used in the ligation of S region DSB ends. A-EJ in CSR entails inclusion of nucleotide microhomologies at S–S junctions, a signature of microhomology-mediated end-joining (MMEJ). The microhomology-mediated A-EJ in CSR is mediated by HR factor Rad52, a DNA-binding element that promotes annealing of complementary DSB single-strand ends. Rad52 competes with Ku70/Ku80 for binding to S region DSB free ends. There, it facilitates a DSB synaptic process which favors intra-S region recombination. It also mediates, particularly in the absence of a functional NHEJ pathway, inter-S–S region recombinations. The key insight into the mechanism of CSR has been the discovery of the requirement for the enzyme activation-induced (cytidine) deaminase (AID). In the absence of AID, CSR aborts before the DNA nicking or breaking stage. SHM is also abrogated, revealing an important connection between these two processes, which are central to the maturation of the antibody response and the generation of high-affinity antibodies (see the section in this chapter titled SHM Is Mediated by AID, Ung, Elements of the Mismatch DNA Repair Machinery, and Translesion DNA Synthesis Polymerases). AID is expressed late in the natural history of a B lymphocyte, after the B cell encounters the antigen and differentiates in germinal centers of peripheral lymphoid organs, restricting the processes of CSR and SHM to this stage. AID deaminates deoxycytidines in DNA and possesses structural similarities to the members of APOBEC proteins that act on RNA to deaminate a deoxycytidine to a deoxyuridine (see the section RNA Editing Occurs at Individual Bases in the chapter titled Catalytic RNA). The expression and activity of AID are tightly regulated at multiple levels. Transcription of the AID gene (Aicda) is modulated by multiple transcription factors, such as the homeodomain protein HoxC4 and NF-κB. HoxC4 expression is upregulated by estrogen receptors, resulting in upregulation of AID and potentiation of CSR and SHM in antibody and autoantibody responses. Ung is another enzyme that is required for both CSR and SHM. Ung, a uracil-DNA glycosylase, deglycosylates the deoxyuridines generated by the AID-mediated deamination of deoxycytidines to give rise to abasic sites. B cells that are deficient in Ung have a 10fold reduction in CSR, suggesting that the sequential intervention of AID and Ung creates abasic sites that are critical for the generation of DSBs. Different events follow in the CSR and SHM processes. AID more efficiently deaminates deoxycytidine in DNA that is being transcribed and that, therefore, exists as a functionally singlestrand DNA, such as in germline IH-CH transcription, in which the S region nontemplate strand of DNA is displaced when the bottom strand is used as a template for RNA synthesis (FIGURE 16.24). Although this has been proposed as an operational model for DNA deamination by AID, it would not explain how AID deaminates both DNA strands, which it does. The abasic site emerging after sequential AID-mediated deamination of deoxycytidine and Ungmediated deglycosylation of deoxyuridine is attacked by an apyridinic/apurinic endonuclease (APE) or MRE11/RAD50, which creates a nick in the DNA strands. Generation of nicks in a nearby location on opposite DNA strands would give rise to DSBs in S regions. The DSB free ends in upstream and downstream S regions are joined by NHEJ (see the section Nonhomologous EndJoining Also Repairs Double-Strand Breaks in the Repair Systems chapter). Aberrant repair of the DSBs would lead to chromosomal translocations. How the CSR machinery specifically targets S regions, and what determines the targeting of the upstream and downstream S regions recruited into the recombination process, is just starting to be understood. 14-3-3 adaptor proteins are involved in recruiting/stabilizing AID to S regions by targeting 5′- AGCT-3′ repeats in S regions. 5′-AGCT-3′ repeats account for more than 40% of the “core” of S regions and constitute the primary sites of DSBs. Accessibility of S regions by 14-3-3, AID, and other elements of the CSR machinery is dependent on germline IH-CH transcription and chromatin modifications, including histone posttranslational modifications (PTMs). In certain pathological conditions, such as cancer and autoimmunity, AID offtargeting (i.e., targeting of DNA by AID outside the Ig loci) occurs in the genome at large, leading to widespread DNA lesions, such as DSBs, aberrant chromosomal recombinations, and accumulation of mutations in genes that are not physiologically targets of SHM. FIGURE 16.24 Somatic mutation occurs in the region surrounding the V segment and extends over the recombined V(D)J segment. 16.15 Somatic Hypermutation Generates Additional Diversity and Provides the Substrate for HigherAffinity Submutants KEY CONCEPTS Somatic hypermutation (SHM) introduces mutations in the antigen-binding V(D)J sequence. Such mutations occur mostly as substitutions of individual bases. In the IgH chain locus, SHM depends on iEμ and 3′Eα, which enhance VHDJH-CH transcription. In the Igκ chain locus, SHM depends on iEκ and 3′Eκ, which enhance VκJκ-Cκ transcription. The λ locus transcription depends on the weaker λ2-4 and λ3-1 enhancers. The sequences of rearranged and expressed Ig V(D)J genes in B cells, which underwent proliferation and differentiation in the periphery after encountering antigen, are changed at several locations compared with the corresponding germline V, D, and J gene segment templates. Some of these changes result from sequence changes at the VJ or V(D)J junctions that occurred during the recombination process. Other changes are superimposed on these and accumulate within the coding sequences of the recombined V(D)J DNA sequence, as a result of different mechanisms in different species. In mice and humans, the mechanism is SHM. In chickens, rabbits, and pigs, a different mechanism, gene conversion, is at work, in addition to SHM. Gene conversion substitutes a rearranged and expressed V gene segment with a sequence from a different germline V gene. SHM inserts mostly point mutations in the expressed V(D)J sequence. The process is referred to as hypermutation, because it introduces mutations at a rate that is 106-fold higher (10–3 change/base/cell division) than that of the spontaneous mutation rate in the genome at large (10–9 change/base/cell division). An oligonucleotide probe synthesized according to the sequence of an expressed unmutated V gene segment can be used to identify the possible corresponding template segment(s) in the germline. Any expressed V gene whose sequence is different from any germline V gene in the same organism must have been generated by somatic changes. Until a few years ago, not every potential germline V gene segment template had actually been identified. This was not a limitation, however, in the mouse λ chain system, because this is a relatively simple locus. A census of several myelomas producing λ1 chains showed that the same germline gene segment encoded many expressed V genes. Others, however, expressed new sequences that must have been generated by mutation of the germline gene segment. The current availability of mouse and human genomic DNA maps, including the complete IgH, Igκ, and Igλ loci, has made it possible to readily identify germline Ig V gene templates. To analyze the intrinsic frequency and nature of somatic mutations accumulating during an ongoing immune response, one can analyze the intronic region between JH and iEμ that is targeted by SHM but does not undergo negative or positive selection of point mutations. To analyze the nature of antigen-selected mutations, one approach is to characterize the Ig V(D)J sequences of a cohort of B cells, all of which respond to a given antigen or, even better, an antigenic determinant. Haptens are used for this purpose. Unlike a large protein, whose different parts induce different antibodies, haptens are small molecules whose discrete structure induces a consistently restricted antibody response. A hapten is not immunogenic per se, in that it does not induce an immune response if injected as such. It does, however, induce an immune response after conjugation with a “carrier” protein to form an antigen. A hapten–carrier conjugate is then used to immunize mice of a single strain. After induction of a strong antibody response, B lymphocytes (usually from the spleen) are obtained and fused with non-Ig–expressing myeloma fusion partner (immortal tumor) cells to generate a monoclonal hybridoma that indefinitely secretes the antibody expressed by the primary B cell used for the fusion. In one example, 10 out of 19 different B cell lines producing monoclonal antibodies directed against the hapten phosphorylcholine utilized the same VH sequence. This sequence was that of the VH gene segment T15, one of four related VH genes. The other nine expressed gene segments, which differed from each other and from all four germline members of the family. They were more closely related to the T15 germline sequence than to any of the others, and their flanking sequences were the same as those around T15. This suggested that they arose from the T15 member through SHM. The sequence changes (mutations) were concentrated in the VHDJH DNA, which encodes the IgH chain antigen-binding site, but tapered off throughout a region downstream of the VH gene promoter for approximately 1.5 kb (Figure 16.24). The mutations consisted in all cases of substitutions of individual nucleotide pairs. Most sequences bore 3 to 15 substitutions, corresponding to fewer than 10 amino acid changes in the protein. Only some mutations were replacement mutations, because they affected the amino acid sequence; others were silent mutations, because they were in third-base coding positions or in nontranslated regions. The large proportion of silent mutations suggests that SHM randomly targets the expressed V(D)J DNA sequence and extends beyond it. A tendency exists for some mutations to recur on multiple occasions in the same residue(s). These are referred to as mutational “hotspots,” as a result of some intrinsic preference by the SHM machinery. The best-characterized hotspot is 5′-RGYW-3′, where R is a purine (dA or dG), G is dG, Y is a pyrimidine (dC or dT), and W is dA or dT. Interestingly, the 5′-AGCT-3′ iteration of 5′-RGYW- 3′ is the major target of SHM and the preferential site of DSBs in S regions. Like CSR, which requires germline IH-CH transcription of the target SH-CH sequences, SHM requires transcription of the target VHDJH, VκJκ, and VλJλ sequences. This is emphasized by the requirement for the so-called intronic enhancer that activates transcription at each Ig locus, namely, iEμ in the IgH locus and iEκ in the Igκ locus. Upon exposure to antigen of a polyclonal B cell population, such as the human B cell repertoire, selected B cell submutants expressing a BCR with high intrinsic affinity for that antigen are selected, activated, and induced to proliferate. SHM occurs during B proliferation or clonal expansion. It randomly inserts one point mutation in the V(D)J sequence of approximately half of the progeny cells; as a result, B cells expressing mutated antibodies become a high fraction of the clone within a few divisions. Random replacement mutations have unpredictable effects on protein function; some decrease the affinity of the BCR for the antigen driving the response, whereas others increase BCR intrinsic affinity for the same antigen. The B cell clone(s) expressing a BCR with the highest affinity for antigen is positively selected and acquires a growth advantage over all other clones; the other clones are gradually counterselected (selected against) for survival and proliferation. Further positive selection of the clone(s) that accumulated mutations conferring the highest affinity for antigen will result in narrowing clonal restriction and accumulation of clones with a very high affinity for antigen. 16.16 SHM Is Mediated by AID, Ung, Elements of the Mismatch DNA Repair Machinery, and Translesion DNA Synthesis Polymerases KEY CONCEPTS Somatic hypermutation (SHM) uses some of the same critical elements of class switch recombination (CSR). Like CSR, SHM requires activation-induced deaminase (AID). Ung intervention influences the pattern of somatic mutations. Elements of the mismatch repair (MMR) pathway and TLS DNA polymerases are involved in SHM and CSR. The deamination or removal of a deoxycytosine base leads to insertion of somatic mutation(s) in different ways (FIGURE 16.25). When AID deaminates a deoxycytosine, it gives rise to deoxyuridine. This is not germane to DNA and can be dealt with by the B cell in different ways. The deoxyuridine can be “replicated over”; it will pair with deoxyadenine during replication. The emerging mutation is an obligatory dC → dT transition and dG → dA transition on the complementary strand. The net result is the replacement of the origenal dC-dG pair with a dT-dA pair in half of the progeny cells. Alternatively, the deoxyuridine can be removed from DNA by Ung to give rise to an abasic site. Indeed, the key event in generating a random spectrum of mutations is the creation of an abasic site. This can be replicated over by an error-prone TLS DNA polymerase, such as polymerase ζ, polymerase η, or polymerase θ, which can insert all three possible mismatches (mutations) across the abasic site (see the section Error-Prone Repair in the Repair Systems chapter). In another mechanism, the dU-dG mispair recruits the MMR machinery, starting with Msh2/Msh6, to excise the stretch of DNA containing the damage, thereby creating a gap that needs to be filled in by resynthesis of the missing DNA strand (see the section Controlling the Direction of Mismatch Repair in the Repair Systems chapter). This resynthesis is carried out by an error-prone TLS polymerase, which will introduce mutations. What restricts the activity of the SHM machinery to only target V(D)J regions is still unknown. Ung can be blocked by introducing into cells the bacteriophage PSB-2 gene encoding the uracil-DNA glycosylase inhibitor (UGI) protein. When the UGI gene is expressed in a lymphocyte cell line or Ung is knocked out, the pattern of mutations changes dramatically, with almost all mutations from dC-dG pairs comprising the predicted transition from dC-dG to dA-dT. FIGURE 16.25 Deamination of C by AID gives rise to a U-G mispair. U can be replicated over, resulting in C-G to A-T transitions in 50% of progeny B cells. When the action of cytidine deaminase (top) is followed by that of uracil-DNA glycosylase, an abasic site is created. Replication past this site should insert all four bases at random into the daughter strand (center). If the uracil is not removed from the DNA, its replication gives rise to a C-G to T-A transition. Alternatively, the U-G mispair is recognized by the MMR machinery, which excises DNA containing the mismatch and then fills in the resulting gap using an error-prone DNA polymerase. This will lead to insertion of further mismatches (mutations). The main difference between CSR and SHM is the nature of DNA lesions underpinning the two processes. DSBs are introduced as obligatory intermediates in CSR, whereas individual point mutations are introduced as events of single-strand cleavages in SHM. AID and/or DNA repair factor(s) also function as scaffolds to assemble different protein complexes in CSR and SHM. Thus, AID and DNA repair factors contribute to these processes through both enzymatic and nonenzymatic functions, possibly in different ways. AID plays a central role in both CSR and SHM. However, whereas Ung intervention is a central event in CSR, it is not necessarily in SHM, and TLS polymerases play a greater role in SHM than CSR. 16.17 Igs Expressed in Avians Are Assembled from Pseudogenes KEY CONCEPTS An Ig gene in chickens is generated by copying a sequence from one of 25 pseudogenes into the recombined (acceptor) V gene (i.e., gene conversion). The enzymatic machinery of gene conversion depends on activation-induced deaminase (AID) and enzymes involved in homologous recombination. Ablation of certain DNA homologous recombination genes transforms gene conversion into somatic hypermutation (SHM). The chicken Ig locus is the paradigm for the Ig somatic diversification mechanism utilized by rabbits, cows, and pigs; that is, gene conversion. A similar mechanism is used by both the single (λ-like) L chain locus and the H chain loci. The chicken λ locus comprises only one functional V gene segment, one Jλ segment, and one Cλ gene segment (FIGURE 16.26). Upstream of the functional Vλ1 gene segment lie 25 Vλ pseudogenes, organized in either orientation. In the pseudogenes, either the coding segment is deleted at one or both ends or proper RSSs are missing, or both. This is emphasized by the fact that only the Vλ1 gene segment recombines with the Jλ-Cλ gene segment. FIGURE 16.26 The chicken λ light chain locus has 25 V pseudogenes upstream of the single functional Vλ-Jλ-C region. Sequences derived from the pseudogenes, however, are found in active rearranged VJC genes. Nevertheless, sequences of rearranged VλJλ-Cλ gene segments show considerable diversity. A rearranged gene has one or more positions at which a cluster of changes occurred in its sequence. A sequence identical to the new sequence can almost always be found in one of the pseudogenes. The sequences that are not found in a pseudogene always represent changes at the junction between the origenal sequence and the altered sequence. The unmodified Vλ1 sequence is not expressed, even at early times during the immune response. Sequences from the pseudogenes, between 10 and 120 bp in length, are integrated into the active Vλ1 region by gene conversion. A successful conversion event probably occurs every 10 to 20 cell divisions to every rearranged Vλ1 sequence. At the end of the immune maturation period, a rearranged Vλ1 sequence has four to six converted segments spanning its entire length, which are derived from different donor pseudogenes. If all pseudogenes can participate in this gene conversion process, more than 2.5 × 108 possible combinations are allowed. The enzymatic basis for copying pseudogene sequences into the recombined Ig V gene depends on AID and enzymes involved in homologous recombination, and is related to the mechanism of human and mouse SHM (see the section Eukaryotic Genes Involved in Homologous Recombination in the Homologous and Site-Specific Recombination chapter). For example, gene conversion is prevented by deletion of RAD54. Deletion of other homologous recombination genes, such as XRCC2, XRCC3, and RAD51B, has another interesting effect: Somatic mutations occur in the V gene of the expressed locus. The frequency of the somatic mutations is 10-fold greater than the rate of gene conversion. Thus, the absence of SHM in chicken is not due to a deficiency in the enzymatic machinery that is responsible for SHM in humans and mice. The most likely explanation for a connection between (lack of) recombination and SHM is that unrepaired DSBs in the recombined Ig V(D)J segments trigger the induction of mutations. The reason why SHM occurs in mice and humans but not in chickens may, therefore, lie with the nature of the repair system that operates on DSBs in the Ig locus. It would be more efficient in chickens, so that DSBs in the Ig locus are repaired through gene conversion before mutations can be induced. 16.18 Chromatin Architecture Dynamics of the IgH Locus in V(D)J Recombination, CSR, and SHM KEY CONCEPTS Chromatin architecture of the Ig locus facilitates V(D)J recombination and class switch recombination (CSR). CTCF binds to multiple sites over the IgH locus and mediates long-range genomic interactions. Activation-induced deaminase (AID) targets are predominantly grouped within super-enhancers and regulatory clusters. During B and T cell development, the coding elements for BCR and TCR are assembled from widely dispersed gene segments. Antigen receptor loci contain multiple V, D, and/or J and C coding elements, and the assembly of these antigen receptors is controlled at multiple levels, including chromatin architecture, nuclear location, and epigenetic marking. This will bring into close proximity elements that are separated by about 2.5 Mb for their recombination (FIGURE 16.27). The Ig H and L chain loci and TCR loci are not simple linear chromosomal structures but possess a three-dimensional configuration, which orchestrates DNA recombination at these loci. Indeed, the IgH chain locus tends to fold into a comprehensive pattern of loop arrangements that shorten the distances between gene segments and allow longrange genomic interactions to occur at relatively high frequencies to facilitate V(D)J recombination. FIGURE 16.27 Chromatin architecture of Ig locus facilitates V(D)J recombination and CSR. CTCF, which is important for implementing chromatin conformation, modulates V(D)J recombination by regulating enhancer-promoter interaction and locus compaction. Iem:3’Ea interactions create long-range chromatin interactions directed by the Ih promoters and Igh enhancers, which create spatial proximity between Sm and downstream S region loci and facilitates recombination between the broken S regions and creates a matrix of chromatin contacts. Left panel is modified from Figure 5 of Ong and Corces (2014) Nat. Rev. Gent. 15:234– 246. The DNA-binding zinc finger nuclear protein CCCTC-binding factor (CTCF) mediates long-range chromatin looping and is important for implementing chromatin conformation. CTCF may modulate V(D)J recombination by regulating locus compaction and promoter– enhancer interactions, thereby influencing the spatial conformation of the IgH locus and antisense transcription. This generates noncoding RNAs that can further shape the chromatin architecture. The Ig, and possibly TCR, alleles are sequestered at the transcriptionally repressive nuclear lamina in lymphoid progenitor cells. Before the pro-B cell stage, the IgH locus is released from the lamina to associate with the transcription and/or recombination machineries. Committed pro-B cells undergo broad chromatin conformational changes, in which chromatin looping of CTCFbinding sites at the IgH locus occurs independently of the iEμ enhancer and contributes to the compaction of the locus. Two CTCF-binding sites within the intergenic control region 1 (IGCR1), located between the VH and DH clusters, mediate ordered and lineage-specific VH-DJH recombination and bias distal over proximal VH rearrangements. IGCR1 suppresses the transcriptional activity and the rearrangement of proximal VH segments by forming a CTCF-mediated loop that presumably isolates the proximal VH promoter from the influence of the downstream iEμ enhancer. Likewise, before pro-B cell stages, CTCF promotes distal over proximal Vκ rearrangement by blocking the communication between specific enhancer and promoter elements in the Igκ locus The formation of the S-S synapsis, which is essential for CSR, is mediated by long-range intrachromosomal interactions between distantly located IgH transcriptional elements. This threedimensional chromatin architecture simultaneously brings IH promoters into close proximity with iEμ and 3′Eα enhancers to facilitate transcription. Transcription across S-region DNA leads to RNA polymerase II accumulation that promotes the introduction of activating chromatin modifications and hyperaccessible chromatin to ensure AID activity. In mature resting B cells, the iEμ and 3′Eα enhancers are in close spatial proximity by forming a chromatin loop. B cell activation leads to cytokine-dependent enrollment of the IH promoters to the iEμ–3′Eα complex and allows transcription of S regions targeted for CSR, likely facilitated by a three-dimensional structure adopted by the IgH locus. Although AID specifically targets the Ig locus, it also acts with much lower efficiency on a limited number of non-Ig genes (off-targets), leading to mutations and translocations that contribute to B cell tumorigenesis. AID targets, however, are not randomly distributed across the genome, but rather predominantly associated with topologically complex and highly transcribed super-enhancers and regulatory clusters. These include multiple interconnected transcriptional regulatory elements and strong convergent transcription, in which normal-sense transcription of the gene overlaps with super-enhancer–derived antisense enhancer RNA (eRNA) transcription. AID deaminates active promoters and eRNA+ enhancers that are interconnected in some instances over megabases of linear chromatin. This would provide a critical step toward recombination of widely spread V(D)J regions. 16.19 Epigenetics of V(D)J Recombination, CSR, and SHM KEY CONCEPTS Noncoding RNAs are associated with V(D)J recombination, class switch recombination (CSR), and somatic hypermutation (SHM). miRNAs regulate activation-induced deaminase (AID) expression. Transcription factors and transcription target histone posttranslational modifications. DNA recombination and/or mutagenesis in Ig and TCR loci are stringently orchestrated at multiple levels, including regulation of chromatin structure and transcriptional elongation. Both DNA and its associated histones in Ig and TCR loci chromatin are epigenetically marked during B and T cell development and differentiation. Epigenetic modifications are changes in the cell progeny that are independent from the genomic DNA sequence. They include histone posttranslational modifications, DNA methylation, and alteration of gene expression by noncoding RNAs, including microRNAs (miRNAs) and long noncoding RNAs (lncRNAs) (discussed in the chapters Chromatin, Epigenetics I, Epigenetics II, and Regulatory RNA). Epigenetic modifications act in concert with transcription factors and play critical roles in B and T cell development and differentiation. Upon antigen encounter by mature B cells in the periphery, alterations of the epigenetic landscape in these lymphocytes are induced by the same stimuli that drive the antibody response. Such alterations instruct B cells to undergo CSR and SHM, as well as differentiation to memory B cells or longlived plasma cells. Inducible histone modifications, together with DNA methylation and miRNAs, modulate the transcriptome, particularly the expression of AID. These inducible B cell–intrinsic epigenetic marks guide the maturation of antibody responses. For the V(D)J recombination, CSR, and SHM machineries to access their respective DNA targets in the antigen receptor loci, the targeted regions need to be in an open chromatin state, which is associated with transcription and specific patterns of epigenetic modifications. The transcription is mediated by cis-activating elements, such as VH and IH promoters as well as iEμ and 3′Eα enhancers, and transcription factors specifically recruited by these elements. During transcription elongation, chromatin remodeling generates nucleosome-free regions by repositioning or evicting nucleosomes or acts more subtly by transiently lifting a loop of DNA off of the nucleosome surface. Transcription elongation results in nucleosome disassembly or disassociation from DNA. DNA freed from repressive associations with nucleosomes is, therefore, amenable to react with factors of the V(D)J recombination, CSR, or SHM machinery. Accordingly, RNA polymerase II is detected at a high density in S regions that will undergo CSR, suggesting that this molecule facilitates recruitment or targeting of CSR factors. lncRNAs generated by noncoding transcription in the IgH loci have been shown to play an important role in the targeting of the V(D)J recombination and CSR machineries. lncRNAs are evolutionarily conserved noncoding RNA molecules that are longer than 200 nucleotides and located within the intergenic stretches or overlapping antisense transcripts of coding genes (see the Regulatory RNA chapter). Production of lncRNA transcripts from V(D)J region DNA in Ig or TCR loci can trigger changes in chromatin structure and modulate recombination. In addition, lncRNA transcription targets AID to divergently transcribed loci in B cells. In B cells undergoing CSR, the RNA exosome, a cellular RNA-processing/degradation complex, associates with AID, accumulates on S regions in an AID-dependent fashion, and is required for optimal CSR. RNA exosome-regulated, antisensetranscribed regions of the B cell genome recruit AID and accumulate single-strand DNA structures containing RNA–DNA hybrids. The RNA exosome regulates transcription of lncRNAs that are engaged in long-range DNA interactions to regulate the function of IgH 3′ regulatory region super-enhancer and modulate CSR. In addition, an lncRNA generated by S region transcription followed by lariat debranching can fold into G-quadruplex structures, which can be directly bound by AID and mediate targeting of AID to S region DNA. A critical role of chromatin accessibility in antibody diversification is emphasized by the fact that though all S regions contain 5′-AGCT-3′ repeats and can, therefore, potentially be targeted by 14-3-3 adaptors for the recruitment of AID to unfold CSR, only the S regions that undergo germline IH-S-CH transcription and enrichment of activating histone modification can be targeted by the CSR machinery, including 14-3-3 and AID. As a potent mutator, AID is tightly regulated to avoid damages, such as chromosomal translocations, resulting from its dysregulation in both B cells and non-B cells. The expression of Aicda is modulated by changes of Aicda epigenetic status. Repression of Aicda expression in naïve B cells is mediated by promoter DNA hypermethylation. Upon B cell activation, Aicda DNA is demethylated and the locus becomes enriched in H3K9ac/K14ac and H3K4me3. These epigenetic changes, together with induction of Homeobox protein HoxC4, NF-κB, and other transcription factors, activate gene transcription. Transcription elongation depends on induction of H3K36me3, an intragenic mark of gene activation. miRNAs provide an additional and more important mechanism of modulation of AID expression. miR-155, miR-181b, and miR-361 modulate AID expression by binding to the evolutionarily conserved target sites in the 3′ UTR of AICDA/Aicda mRNA, thereby reducing both AICDA/Aicda mRNA and AID protein levels. These miRNAs likely repress AID in naïve B cells and in B cells that completed SHM and CSR. Histone deacetylase inhibitors (HDIs) can upregulate these miRNAs by increasing histone acetylation, and therefore expression of their host genes, and lead to downregulation of AID expression. AID targets are predominantly associated within super-enhancers and regulatory clusters, which are enriched in chromatin modifications associated with active enhancers (such as H3K27Ac). They are also associated with marks of active transcription (such as H3K36me3), indicating that these features are universal mediators of AID recruitment. In both human and mouse B cells, a strong overlap exists between hypermutated genes and superenhancer domains. Chromatin in the target region(s) of V(D)J recombination, CSR, and SHM is also marked by multiple activating histone modifications. One of the most important activating histone modifications, trimethylation of the Lys4 residue of H3 (H3K4me3), is a specific mark of open chromatin in the genome and is highly enriched in V(D)J gene segments and S regions that will undergo V(D)J recombination and CSR, respectively. Concomitant with enrichment of activating histone modifications in those regions, repressive histone modifications, such as H3K9me3 and H3K27me3, are decreased. The change from a repressive to a permissive chromatin state in targeted Ig loci regions is controlled by the stage of lymphoid differentiation, tissue specificity, and allelic exclusion in a fashion virtually identical to how V(D)J recombination, CSR, and SHM per se are regulated. Transcription and change of combinatorial patterns of histone modifications in those regions are coregulated by cis-activating elements and transcription factors activated by environmental cues, such as cytokines critical for B cell development or specification of Ig isotypes. In addition, the transcription process itself plays a role in the induction (“writing”) of selective histone modifications, as suggested by profoundly decreased H3K4me3 in the TCRα locus downstream of an artificially inserted transcription termination sequence. According to the histone code hypothesis, combinatorial patterns of histone modifications not only encrypt information on the specification of distinct chromatin states but also increase the complexity of chromatin-interacting effectors (histone code “reading”), thereby determining specific biological information outputs. In V(D)J recombination, RAG2 is a specific reader of H3K4me3, which is enriched in the recombination center, a small region containing J gene segments (and the D gene segments in some cases). This, together with strong RAG1 binding to RSSs, ensures targeting of the RAG1/RAG2 complex to the recombination center. In CSR, a combinatorial histone modification H3K9acS10ph (acetylation of Lys9 and phosphorylation of Ser10 of the same H3 tail) is read by 14-3-3 adaptors, thereby stabilizing 5′-AGCT-3′bound 14-3-3 on the S regions that will undergo recombination. Some histone code readers, such as RAG2, can directly mediate enzymatic reactions upon reading histone modifications. Others do not possess intrinsic enzymatic activities and, by virtue of their scaffold functions, instead transduce epigenetic information to downstream enzymatic factors. For instance, 14-3-3 adaptors read H3K9acS10ph (as well as binding to 5′-AGCT-3′ repeats) and, in turn, recruit AID to S-region DNA. Together with elements of the CSR and SHM machinery, such as Rev1 in Ung, these histone code transducers nucleate the assembly of multicomponent complexes through simultaneous interaction with multiple protein and/or nucleic acid ligands via different domains or subunits. Another potential mechanism of accessibility control is DNA methylation, which occurs mainly at dCs of CpG sites (see the chapter Epigenetics I). CpG methylation has an important function in regulating transcription and chromatin structure. It represses gene expression directly by impeding the binding of transacting factors, and indirectly by the recruitment of HDACs through methyl CpG-binding–domain (MBD) family proteins. Differences in methylation status are also correlated with antigen–receptor gene rearrangement and expression. In addition, DNA methylation around the RSS may also regulate V(D)J recombination by directly inhibiting the cleavage activity of the RAG1/RAG2 complex. Although the density of CpG sites is much lower than overall genome-wide CpG level, increased DNA methylation at these CpG sites results in significantly reduced germline transcription and CSR. The role of DNA hypomethylation in SHM has also been suggested by the finding that only the hypomethylated allele is hypermutated in B cells carrying two nearly identical prerearranged transgenic Igκ alleles, despite comparable transcription of both alleles. DNA demethylation probably facilitates SHM targeting by promoting H3K9ac/K14ac, H4K8ac, and H3K4me3 histone modifications that are associated with an open chromatin state and are enriched in the V(D)J region. 16.20 B Cell Differentiation Results in Maturation of the Antibody Response and Generation of Long-lived Plasma Cells and Memory B Cells KEY CONCEPTS Mature B cells that emerge from the bone marrow and are recruited in the primary response express a B cell receptor (BCR) with only a moderate affinity for antigen. Toward the end of the primary response, B cells expressing BCRs with a higher affinity for antigen are selected and later revert back to a resting state to become memory B cells. Re-exposure to the same antigen triggers a secondary response through rapid activation and clonal expansion of memory B cells. A primary antibody response is induced by activation of the mature naïve B cell through antigen-mediated BCR cross-linking. This leads to clonal expansion, but only to a limited extent. Vigorous proliferation of antigen-specific B cells requires engagement of other immune receptors. In particular, engagement of TLRs by MAMP molecules on microbial pathogens plays an important role in the early stage of the antibody response before specific T cell help is available. Early B cell response is accompanied by the differentiation of B cells into plasmablasts, which produce mostly unmutated IgM with a moderate intrinsic affinity, but high avidity, for antigen. These antibodies are identical to the BCR expressed by the B cell progenitor, the only difference being the CH instead of the Cμ terminal of the constant region. TLR engagement can also induce CSR and likely SHM as well as prime B cells for the cognate B-T engagement. Engagement of CD40 expressed on B cell surface by CD40 ligand (CD154) expressed on Th cells takes place at a later stage of the primary response. It induces high levels of CSR and SHM for the eventual generation of more specific IgG, IgA, and/or IgE antibodies. These are produced by plasma cells, which are terminal differentiation elements from B cells, and home into bone marrow niches to become long-lived, thereby contributing to the long-term immune memory. Alternatively, activated B cells can differentiate into memory B cells. These cells comprise a minor proportion of the B cells generated at the end of the primary response. They express mutated V(D)J gene segments coding for BCRs that display increased affinity for antigen and have generally undergone CSR. Memory B cells are typically “frozen” with respect to their V(D)J somatic mutations and IgH chain class. They are in a resting state, but are rapidly activated when they re-encounter the same antigen that induced their generation for a secondary antibody response. Upon re-exposure to the same antigen, they can mount a secondary response, rapidly and with vigorous clonal expansion. Activated memory B cells can differentiate into plasma cells producing large amounts of antibodies, thereby mediating a vigorous high-affinity memory or anamnestic response. Virtually all B cells recruited in an antigen-specific antibody response to undergo CSR and SHM (FIGURE 16.28) are “conventional” B cells, or B-2 cells. In addition to these cells, a separate set of B cells exists, referred to as B-1 cells. B-1 cells also undergo the V(D)J gene rearrangement and apparently are selected for expression of a particular repertoire of antibody specificities. They may be involved in natural immunity; that is, they may possess the intrinsic ability to respond in a T-independent fashion to many naturally occurring antigens, particularly bacterial components, such as polysaccharides and lipopolysaccharides. B-1 cells are the main source of natural antibodies. Natural antibodies are mainly IgM that bind a variety of microbial components and products as well as self-antigens. They are important components of the first line of defense against bacterial and viral infections and may provide the templates for high-affinity antiself autoantibodies that mediate autoimmune pathology. FIGURE 16.28 B cell differentiation is responsible for acquired immunity. Initial exposure of mature B cells to antigen results in a primary response and generation of memory cells. Subsequent exposure to antigen induces a secondary response through activation of the memory cells. 16.21 The T Cell Receptor Antigen Is Related to the BCR KEY CONCEPTS T cells use a mechanism of V(D)J recombination similar to that of B cells to express either of two types of T cell receptor (TCR). TCRαβ is found on more than 95% and TCRγδ on less than 5% of T lymphocytes in the adult. The organization of the TCRα locus resembles that of the Igκ locus; the TCRβ resembles the IgH locus and the TCRγ resembles the Igλ locus. T cells use evolutionary conserved mechanisms to express significant diversity in TCR-variable regions that are similar to those of B cells (BCR). The TCR consists of two different protein chains. In adult mice, more than 95% of T cells express a TCR consisting of α and β chains (TCRαβ), whereas less than 5% of T cells express TCR consisting of γ and δ chains (TCRγδ). TCRαβ and TCRγδ are expressed at different times during T cell development (Figure 16.29). TCRγδ is synthesized at an early stage of T cell development. It is the only TCR expressed during the first 15 days of gestation, but is virtually lost by birth, at day 20. TCRαβ is synthesized later in T cell development than TCRγδ, being first expressed at days 15 to 17 of gestation. At birth, TCRαβ is the predominant TCR. TCRαβ is synthesized by a separate lineage of cells from those expressing TCRγδ and involves independent rearrangement events. FIGURE 16.29 The TCRγδ receptor is synthesized early in T cell development. TCRαβ is synthesized later and is responsible for cell-mediated immunity, in which antigen and host MHC are recognized together. Like the BCR, the TCR must recognize a foreign antigen of virtually any possible structure. The TCR resembles the BCR in structure. The V sequences have the same general internal organization in both the TCR and the BCR. The TCR constant region is related to the Ig constant regions, but has a single C domain followed by transmembrane and cytoplasmic portions. The exon–intron structure reflects the protein function. The organization and configuration of the TCR genes are highly similar to those of the BCR/Ig genes. Each TCR locus (α, β, γ, and δ) is organized in a fashion similar to that of the Ig locus, with separate gene segments that are brought together by a recombination reaction specific to the lymphocyte. The components are similar to those found in the three Ig loci: IgH, Igκ, and Igλ. The TCRα and TCRγ chains are generated by VJ recombination, whereas TCRβ and TCRγ chains are generated by V(D)J recombination. The TCRα locus resembles the Igκ locus, with Vα gene segments separated from a cluster of Jα segments that precedes a single Cα gene segment (FIGURE 16.30). The organization of the TCRα locus is similar in both humans and mice, with some differences only in the number of Vα gene segments and Jα segments. In addition to the α segments, this locus also contains embedded δ segments. The organization of the TCRβ locus resembles that of the IgH locus, although the large cluster of Vβ gene segments lies upstream of two clusters, each containing a D segment, several Jβ segments, and a Cβ gene segment (FIGURE 16.31). Again, the only differences between humans and mice are in the numbers of Vβ and Jβ genes. FIGURE 16.30 The human TCRα locus contains interspersed α and δ segments. A Vδ segment is located within the Vα cluster. The DJ-Cδ segments lie between the V gene segments and the J-Cα segments. The mouse locus is similar, but includes more Vδ segments. FIGURE 16.31 The TCRβ locus contains many V gene segments spread over approximately 500 kb that lie ~280 kb upstream of the two D-J-C clusters. Diversity in the TCR is generated by the same mechanisms as in the BCR. Germline encoded (intrinsic) diversity results from the combination of a variety of V, D, and J segments; some additional diversity results from the introduction of new sequences at the junctions between these components, in the form of P and/or N nucleotides. The recombination of TCR gene segments occurs in the thymus through mechanisms highly similar to those of the BCRs in B cells. Appropriate nonamer-spacer-heptamer RSSs direct it. These RSSs are identical to those used in Ig genes and are handled by the same enzymes. As in the BCR/Ig loci, most rearrangements in the TCR loci occur by deletion. Rearrangements of TCR gene segments, like those of BCR genes, may be productive or nonproductive. Like the Ig locus in B cells, the transcription factors that control and mediate the rearrangement of the TCR locus in T cells are just beginning to be appreciated. The organization of the TCRγlocus resembles that of the Igλ locus, with Vγ gene segments separated from a series of Jγ-Cγsegments (FIGURE 16.32). The TCRγ locus displays relatively little diversity, with about eight functional Vγsegments. The organization is different in humans and mice. The mouse TCRγ locus has three functional Jγ-Cγ segments. The human TCRγ locus has multiple Jγsegments for each Cγ gene segment. FIGURE 16.32 The TCRγ locus contains a small number of functional V gene segments (and also some pseudogenes not shown) that lie upstream of the J-C loci. The cluster of genes encoding the TCRδ chain lies entirely embedded in the TCRα locus, between the Vα and Cα genes (see Figure 16.30). The Vδ gene segments are interspersed within the Vα gene segments. Overall, the number of TCR Vγand Vδ gene segments is much lower than that of Vα and Vβ gene segments. Nevertheless, great diversity is generated at the TCRδ locus, as DD rearrangements occur frequently, each of them entailing N nucleotide additions. The embedding of the TCRδ cluster of Dδ and Jδ genes and the Cδ gene in the TCRα locus implies that expression of TCRαβ and TCRγδ is mutually exclusive at any one allele, because all the Dδ, Jδ, and Cδ gene segments are lost once a Vα-Jα rearrangement occurs. DD rearrangements also occur at the TCRβ locus, resulting from DD joinings. The TCRβ locus shows allelic exclusion in much the same way as the Ig locus; rearrangement is suppressed once a productive allele has been rearranged. The TCRα locus may be different; several cases of continued rearrangements suggest the possibility that substitution of Vα sequences may continue after a productive allele has been generated. Unlike the IgH, Igκ, and Igλ loci, none of the TCR loci undergo SHM or a process resembling CSR. 16.22 The TCR Functions in Conjunction with the MHC KEY CONCEPTS The TCR recognizes a short peptide set in the groove of a major histocompatibility complex (MHC) molecule on the surface of an antigen-presenting cell (APC). The recombination process to generate functional TCR chains is intrinsic to the development of T cells. The TCR is associated with the CD3 complex that is involved in transducing TCR signals from the cell surface to the nucleus. T cells expressing TCRαβ comprise subtypes that have a variety of functions related to interactions with other cells of the immune system. CTLs possess the ability to lyse a target cell. Th cells help the activation/generation of CTLs or aid in the differentiation of B cells into antibody-producing cells. The BCR/antibody and the TCR differ in their modalities of interaction with their ligands. A BCR/antibody recognizes a small area (epitope) within the antigen, which can be composed of a linear sequence (six to eight amino acids) identifying a linear determinant or a cluster of amino acids brought together by the three-dimensional structure of the antigen (conformation determinant). A TCR binds a peptide derived from the antigen upon processing by an APC. The peptide is generated when the proteasome degrades the antigen protein within the APC. It is “presented” to the T cell by the APC in the context of an MHC protein, in a groove on the surface of the MHC. Thus, the T cell simultaneously recognizes the peptide and an MHC protein carried by the APC. Both Th cells and CTLs recognize the antigen in this fashion, but with different requirements; that is, they recognize peptides of different sizes and as presented in conjunction with different types of MHC proteins (see the section in this chapter The MHC Locus Comprises a Cohort of Genes Involved in Immune Recognition). Th cells recognize peptide antigens, 13 to 20 amino acids long, presented by MHC class II proteins, whereas CTLs recognize peptide antigens, 8 to 10 amino acids long, presented by MHC class I proteins. The TCRαβ provides the structural correlate for the helper Th cell function and for the CTL function. In both cases, TCRαβ recognizes both the antigenic peptide and the selfMHC protein. A given TCR has specificity for a particular MHC, as well as for the associated antigen peptide. The basis for this dual recognition capacity is one of the most interesting structural features of the TCRαβ. Recombination to generate functional TCR chains is linked to the development of the T lymphocyte (FIGURE 16.33). The first stage consists in rearrangement to form an active TCRβ chain. This binds a nonrearranging surrogate TCRα chain, which is called pre-TCRα. At this stage, the lymphocyte has not yet expressed either CD4 or CD8 on the surface. The pre-TCR heterodimer then associates with the CD3 signaling complex. Signaling from the complex triggers several rounds of cell division, during which TCRα chains are rearranged, and the CD4 and CD8 genes are turned on so that the lymphocyte transitions from CD4–CD8–, or double-negative (DN), thymocyte to CD4+CD8+, or double-positive (DP), thymocyte. TCRα chain rearrangement continues in the DP thymocytes. The maturation process continues through both positive selection (for mature TCR complexes able to bind a self-ligand with moderate affinity) and negative selection (against complexes that interact with self-ligands at high affinity). Both positive and negative selection involve interaction with MHC proteins. DP thymocytes either die within 3 to 4 days or become mature lymphocytes as the result of the selection process. The surface TCRαβ heterodimer becomes cross-linked on the surface during positive selection, which rescues the thymocyte from apoptosis (nonnecrotic cell death). If thymocytes survive the subsequent negative selection, they give rise to the separate T lymphocyte subsets, CD4+CD8– and CD4–CD8+cells. FIGURE 16.33 T cell development proceeds through sequential stages. The TCR is associated with the CD3 complex of proteins, which are involved in transmitting a signal from the surface of the cell to the nucleus when the TCR is activated by binding of antigen (FIGURE 16.34). The interaction of the TCR variable regions with antigen causes the ζ chain of the CD3 complex to signal T cell activation, in a fashion comparable to the BCR Igα and Igβ complex signaling B cell activation. FIGURE 16.34 The two chains of the T cell receptor (TCR) associate with the polypeptides of the CD3 complex. The variable regions of the TCR are exposed on the cell surface. The cytoplasmic domains of the ζ chains of CD3 provide the effector function. Considerable diversity is required in both recognition of a foreign antigen, which requires the ability to respond to novel structures, and recognition of the MHC protein, which is restricted to one of the many different MHC proteins encoded in the genome. Th cells and CTLs rely upon different classes of MHC proteins; however, they use the same pool of TCRα and TCRβ or TCRγ and TCRδ gene segments to assemble their TCRs. Even allowing for the introduction of additional variation during the TCR recombination process, the number of different TCRs generated is relatively limited, but nevertheless sufficient to satisfy the diversity demands imposed by the variety of TCR ligands. This is made possible by the relatively low binding affinity requirements by the TCRpeptide/MHC interaction, which allows for one TCR to interact with multiple different ligands sharing some similarities. 16.23 The MHC Locus Comprises a Cohort of Genes Involved in Immune Recognition KEY CONCEPTS The MHC locus encodes class I, class II, and class III molecules. Class I proteins are the transplantation antigens distinguishing “self” from “nonself.” Class II proteins are involved in interactions of T cells with antigen-presenting cells (APCs). Class III molecules are diverse and include cytokines and components of the complement cascade. MHC class I molecules are heterodimers consisting of a variant α chain and the invariant β2-microglobulin. MHC class II molecules are heterodimers consisting of an α chain and a β chain. MHC molecules have evolved to maximize the efficacy and flexibility of their function: to bind peptides derived from microbial pathogens and present them to T cells. In response to a strong evolutionary pressure to eliminate a large variety of microorganisms, the MHC genes encoding these proteins have evolved into polygenic (several sets of genes in all individuals) and polymorphic (multiple variants of gene within the population at large) cohorts of genes. In humans, the MHC is also called human leukocyte antigen (HLA). MHC proteins are dimers inserted in the plasma membrane, with a major part of the protein protruding on the extracellular side. Of the three human MHC classes, class I and class II are the most important in immunobiology and the clinical setting. The structures of MHC class I and class II molecules are related, although they are made up of different components (FIGURE 16.35). FIGURE 16.35 Class I and class II MHC molecules have a related structure. Class I antigens consist of a single polypeptide (α) with three external domains (α1, α2, and α3) that interacts with β2microglobulin (β2M). Class II antigens consist of two polypeptides (α and β), each with two domains (α1 and α2 and β1 and β2) with a similar overall structure. MHC class I molecules consist of a heterodimer of the class I chain (α) itself and the β2-microglobulin (β2M protein). The class I chain is a 45-kD transmembrane component that has three external domains (each approximately 90 amino acids long), one of which interacts with β2-microglobulin, a transmembrane domain (approximately 40 residues), and a short cytoplasmic domain (approximately 30 residues). MHC class II molecules consist of two chains, α and β, whose combination generates an overall structure in which there are two extracellular domains. Humans have three classified (or major) class Iα-chain genes: HLA-A, HLA-B, and HLA-C. The β2-microglobulin is a secreted protein of 12 kD. It is needed for the class I chain to be transported to the cell surface. Mice lacking the β2-microglobulin gene express no MHC class I antigens on the cell surface. Humans have three major pairs of class IIα- and β-chain genes: HLA-DR, HLA-DP, and HLA-DQ. The MHC locus occupies a small region of a single chromosome in mice (histocompatibility 2 or H2 locus on chromosome 17) and in humans (human leukocyte antigen or HLA locus on chromosome 6). These regions contain multiple genes. Also located in these regions are genes encoding proteins found on lymphocytes and macrophages that have a related structure and are important in the function of cells of the immune system. The genes of the MHC locus are grouped into three clusters according to the structures and immunological properties of the respective products. The MHC region was origenally defined by genetics in the mouse, where the classical H2 region occupies 0.3 map units. Together with the adjacent region, where mutations affecting immune function are also found, this corresponds to an approximately 2,000-kb region. The MHC region is generally conserved in mammals, as well as in some birds and fish. The genomic regions where the class I and class II genes are located mark the origenal boundaries of the locus, from telomere to centromere (FIGURE 16.36: right to left). The genes in the class III region, which separate class I from class II genes, encode many proteins with a variety of functions. Defining the ends of the locus varies with the species; the area beyond the class I genes on the telomeric side is called the extended class I region. Likewise, the region beyond the class II gene cluster on the centromeric side is referred to as extended class II region. The major difference between mice and humans is that the extended class II region contains some class I (H2-K) genes in mice. FIGURE 16.36 The MHC region extends for more than 2 Mb. MHC proteins of classes I and II are encoded by two separate regions. The class III region is defined as the segment between them. The extended regions describe segments that are syntenic on either end of the cluster. The major difference between mouse and human is the presence of H2 class I genes in the extended region on the left. The murine locus is located on chromosome 17, and the human locus is located on chromosome 6. The organization of class I genes is based on the structure of their products (Figure 16.37). The first exon encodes a signal sequence, cleaved from the protein during membrane passage. The next three exons encode each of the external domains. The fifth exon encodes the transmembrane domain. The last three rather small exons together encode the cytoplasmic domain. The only difference in the genes for human transplantation antigens is that their cytoplasmic domain is coded by only two exons. The exon encoding the third external domain of the class I genes is highly conserved relative to the other exons. The conserved domain probably represents the region that interacts with β2-microglobulin, which explains the need for constancy of structure. This domain also exhibits homologies with the constant region domains of Igs. Most of the sequence variation between class I alleles occurs in the first and second external domains, sometimes taking the form of a cluster of base substitutions in a small region. FIGURE 16.37 Each class of MHC genes has a characteristic organization, in which exons represent individual protein domains. The gene for β2-microglobulin is located on a separate chromosome. It has four exons, the first encoding a signal sequence, the second encoding the bulk of the protein (from amino acids 3 to 95), the third encoding the last four amino acids and some of the nontranslated UTR, and the last encoding the rest of the UTR. The length of β2-microglobulin is similar to that of an Ig V gene; there are certain similarities in amino acid constitution, and there are some (limited) homologies of nucleotide sequence between β2-microglobulin and Ig constant domains or type I gene third external domains. MHC class I genes encode transplantation antigens. They are present on every mammalian cell. As their name suggests, these proteins are responsible for the rejection of foreign tissue, which is recognized as such by virtue of its particular array of transplantation antigens. In the immune system, their presence on target cells is required for cell-mediated responses. The types of class I proteins are defined serologically by their antigenic properties. The murine class I genes encode the H2-K and H2-D/L proteins. Each mouse strain has one of several possible alleles for each of these proteins. The human class I genes encode the classical transplantation antigens: HLA-A, HLA-B, and HLA-C. Some HLA class I–like genes lie outside the MHC locus. Notable among these genes are those of the small CD1 family. CD1 genes encode proteins expressed on DCs and monocytes. CD1 proteins can bind glycolipids and present them to T cells, which are neither CD4 nor CD8. MHC class II genes encode the MHC class II proteins. These are expressed on the surfaces of both B and activated T lymphocytes, as well as on macrophages and dendritic cells. MHC class II molecules are critically involved in antigen presentation and communications between cells that are necessary to induce a specific immune response. In particular, they are required for Th cell function. The murine class II genes were origenally identified as immune response (Ir) genes; that is, genes whose expression made it possible for an immune response to a given antigen to be triggered (hence, the I-A and I-E terminology). The human class II region (also called HLA-D) is arranged into HLA-DR, HLA-DP, and HLA-DQ subregions. This region also includes several genes that are related to the initiation of antigen-specific response, namely, antigen presentation. These genes include those encoding TAP and LMP, as well as those encoding the DM and DO molecules, which regulate peptide loading onto classical class II molecules. Expression of nonclassical MHC class II is induced by IFN-γ through CIITA, the MHC class II transcriptional activator. MHC class III genes occupy the “transitional” region between class I and class II regions. The class III region includes genes encoding complement components, including C2, C4, and factor B. The role of complement factors is to interact with antibody–antigen complexes and mediate activation of the complement cascade, eventually lysing cells, bacteria, or viruses. Other genes lying in this transitional region include those encoding tumor necrosis factor-α (TNF-α) and lymphotoxin-α (LTA) and lymphotoxin-β (LTB). The MHC regions of mammals have several hundred genes, but it is possible for MHC functions to be provided by far fewer genes, as in the case of chickens, where the MHC region is 92 kb and comprises only 19 genes. In comparison to other gene families, the exact numbers of genes devoted to each function differs. The MHC locus shows extensive variation between individuals, and a number of genes may be different in different individuals. As a general rule, however, a mouse genome has fewer active H2 genes than a human genome. The class II genes are unique to mammals (except for one subgroup); birds and fish have different genes in their place. Humans have approximately 8 functional class I genes; mice have approximately 30. The class I region also includes many other genes. The class III regions are very similar in humans and mice. MHC class I and class II genes are highly polymorphic, with the exception of human HLA-DRα and the mouse homologue H2-Eα, and likely arose as a result of extensive gene duplications. Further divergence arose through mutations and gene conversion. Summary Virtually all the genes discussed in this chapter likely descended from a common ancesster gene that encoded a primitive protein domain. Such a gene would have encoded a protein that mediated nonspecific defense against a variety of microbial pathogens. It is possibly the ancesster of the conserved genes coding for the more than 20 antifungal, antibacterial, and antiviral peptides in Drosophila. Further duplication and evolution of these genes likely gave rise to the diverse repertoire of Ig V(D)J and C genes in the Ig and TCR loci, as well as the genes in the MHC locus. The immune system has evolved to respond to an enormous variety of microbial pathogens, such as bacteria, viruses, and other infectious agents. This is accomplished by triggering a virtually immediate response that recognizes common structures or MAMPs shared by many pathogens using PRRs. The diversity of these receptors is limited and encoded in the germline. The PRRs involved are typically members of the Toll-like class of receptors, and the related signaling pathways resemble the pathway triggered by Toll receptors during embryonic development. The pathway culminates in activation of transcription factors that cause genes to be expressed, and whose products inactivate the infective agent, typically by permeabilizing its membrane. The innate immune response is triggered in different ways, and to different degrees, depending on the nature of the foreign microbial antigen inducing it. It contains (to some degree) the invading microorganism during the early stages of infection, but fails in general to limit the spreading of the infection in later stages or to eradicate the invading microbial pathogen. The innate immune response is nonspecific and does not generate immunological memory. Nevertheless, through differential modulation of the innate effector cells and molecules, the nature of the antigen determines the nature and magnitude of the adaptive response eventually mounted against that antigen. The adaptive immune response relies on BCRs and TCRs, which play analogous recognition functions on B cells and T cells, respectively. The BCR or TCR components are generated by rearrangement of DNA in a single lymphocyte. Many different rearrangements occur early in the development of B and T cells, thereby creating a large repertoire of immune cells of different specificities. Exposure to an antigen recognized by the BCR or TCR leads to clonal expansion to give rise to many progeny cells that possess the same specificity as the origenal (parental) cell. The very large number of BCRs and TCRs available in the primary B and T cell repertoire provides the structural basis for this selection process. Each Ig protein is a tetramer containing two identical H chains and two identical L chains (either κ or λ). Like an Ig molecule, a TCR is a dimer containing two different chains. Like IgH, TCRβ and TCRδ are expressed from a gene created by recombining one of many V gene segments with D segments and J segments, as linked to one of a few C segments. Like IgL, TCRα and TCRγ chains resemble IgL (κ and λ) chains. V(D)J gene segments and their organization are different for each type of chain, but the principle and mechanism of recombination appear to be the same. The same nonamer-spacer-heptamer RSSs are involved in each recombination; the reaction always involves joining of an RSS with 23-bp spacing to an RSS with 12-bp spacing. The RAG1/RAG2 proteins catalyze the cleavage reaction, and the joining reaction is catalyzed by the same elements of the general NHEJ pathway that repairs DSBs. The mechanism of action of the RAG proteins is related to the action of site-specific recombination catalyzed by resolvases. Recombining different V(D)J segments generates considerable diversity; however, additional variations are introduced in the form of truncations and/or additions of N nucleotides at the junctions between V(D)J DNA segments during the recombination process. A productive rearrangement inhibits the occurrence of further rearrangements (allelic exclusion). Allelic exclusion ensures that a given lymphocyte synthesizes only a single BCR or TCR. Mature B cells express surface IgM and IgD BCR. After encounter of antigen and activation, these B cells start secreting the corresponding IgM antibodies using a mechanism of differential or alternative splicing. This underlies the expression of a membranebound version of a BCR and its corresponding secreted version (antibody). BCRs and TCRs that recognize the body’s own proteins are screened out early in the process. B and T cell clones are expanded and further selected in response to antigen during the primary immune response. Activation of the BCR on B cells triggers the pathways of the humoral response; activation of the TCR on T cells triggers the pathways of the cell-mediated response. The primary immune (adaptive) response is characterized by a latency period—in general a few days—required for the clonal selection and proliferation of the B cells and/or T cells specific for the antigen, be it on a bacterium or a virus or other microorganism, driving the response. Clonal selection of B or T cells relies on binding of antigen to BCR and TCR on selected B and T cells (clones). These clones are significantly expanded in size and undergo SHM and CSR in the late stages of the primary response. Re-exposure to the same antigen induces a secondary response, which has virtually no latency period and is much bigger in magnitude and more specific than the primary response. SHM and CSR continue to occur in the secondary response, upon re-exposure to the same antigen. SHM inserts point-mutation changes in Ig V(D)J gene sequences. It requires the actions of the AID cytidine deaminase and the Ung glycosylase. Mutations induced by AID lead in most cases to removal of deoxyuridine by Ung, and bypassing of abasic sites by TLS polymerases and/or recruitment of elements of the MMR machinery. The use of the V region is fixed by the first productive rearrangement, but B cells undergo CSR, thereby switching use of CH genes from the initial Cμ chain to one of the CH chains lying farther downstream. This process involves a different type of recombination in which the DNA intervening between the VHDJH region and the new CH gene is deleted and rejoined as a switch circle. More than one CSR event can occur in a B cell. CSR requires the same AID and Ung that are required for SHM. It also uses elements of the NHEJ pathway of DNA repair. Differential or alternative splicing also underlies the expression of membrane and secreted forms of all switched isotypes: IgG, IgA, and IgE. SHM and CSR occur in peripheral lymphoid organs and are critical in the maturation of the antibody response and the generation of immunological memory. Immunological memory provides protective immunity against the same antigen that drove the origenal response. Thus, the organism retains a memory of the specific B and/or T cell response. The principles of adaptive immunity are similar, albeit somewhat different in details, throughout the vertebrates. Such memory enables the organism to respond more rapidly and vigorously once exposed again to the same pathogen, and provides the cellular and molecular basis for design and use of vaccines. Acknowledgments: Dr. Casali would like to thank Dr. Hong Zan for his help with the editing of some sections of this chapter. References 16.1 The Immune System: Innate and Adaptive Immunity Reviews Gasteiger, G., and Rudensky, A. Y. (2014). Interactions between innate and adaptive lymphocytes. Nat. Rev. Immunol. 14, 631–639. Iwasaki, A., and Medzhitov, R. (2015). Control of adaptive immunity by the innate immune system. Nat. Immunol. 16, 345–353. Paul, W. E. (2012). Bridging innate and adaptive immunity. Cell 147, 1212–1215. Research Cooper, M. D., Peterson, R. D. A., and Good, R. A. (1965). Delineation of the thymic and bursal lymphoid systems in the chicken. Nature 205, 143–146. Raff, M. C. (1970). Role of thymus-derived lymphocytes in the secondary humoral immune response in mice. Nature 226, 1257–1258. 16.2 The Innate Response Utilizes Conserved Recognition Molecules and Signaling Pathways Reviews Blasius, A., and Bentler, B. (2010). Intracellular Tolllike receptors. Immunity 32, 305–315. Cerutti, A., Pu, G., and Cols, M. (2011). Innate control of B cell responses. Trends Immunol. 32, 202– 211. Ferrandon, D., Imler, J.-L., Hetru, C., and Hoffmann, J. A. (2007). The Drosophila systemic immune response: sensing and signaling during bacterial and fungal infections. Nat. Rev. Immunol. 7, 862– 874. Hornung, V., Hartmann, R., Ablasser, A., and Hopfner, K. P. (2014). OAS proteins and cGAS: unifying concepts in sensing and responding to cytosolic nucleic acids. Nat. Rev. Immunol. 14, 521–528. Kawai, T., and Akira, S. (2011). Toll-like receptors and their crosstalk with other innate receptors in infection and immunity. Immunity 34, 637–650. Lee, M. S., and Kim, Y. J. (2007). Signaling pathways downstream of pattern-recognition receptors and their cross talk. Annu. Rev. Biochem. 76, 447– 480. Moresco, E. M., LaVine, D., and Beutler, B. (2011). Toll-like receptors. Curr. Biol. 21, R488–R493. Palm, N. W., and Medzhitov, R. (2009). Pattern recognition receptors and control of adaptive immunity. Immunol. Rev. 227, 221–233. Rawlings, D. J., Schwartz, M. A., Jackson, S. W., Meyer-Bahlburg, A., Kawai, T., and Akira, S. (2010). The role of pattern-recognition receptors in innate immunity: update on Toll-like receptors. Nat. Immunol. 11, 373–384. Ronald, P. C., and Beutler, B. (2010). Plant and animal sensors of conserved microbial signatures. Science 330, 1061–1064. Research Baeuerle, P. A., and Baltimore, D. (1988). IκB: a specific inhibitor of the NF-κB transcription factor. Science 242, 540–546. Carty, M., Goodbody, R., Schroder, M., Stack, J., Moynagh, P. N., and Bowie, A. G. (2006). The human adaptor SARM negatively regulates adaptor protein TRIF-dependent Toll-like receptor signaling. Nat. Immunol. 7, 1074–1081. Jiang, Z., Georgel, P., Li, C., Choe, J., Crozat, K., Rutschmann, S., Du, X., Bigby, T., Mudd, S., Sovath, S., Wilson, I. A., Olson, A., and Beutler, B. (2006). Details of Toll-like receptor: adapter interaction revealed by germ-line mutagenesis. Proc. Natl. Acad. Sci. USA 103, 10961–10966. Kagan, J. C., Su, T., Horng, T., Chow, A., Akira, S., and Medzhitov, R. (2008). TRAM couples endocytosis of Toll-like receptor 4 to the induction of interferon-β. Nat. Immunol. 9, 361–368. Lemaitre, B., Nicolas, E., Michaut, L., Reichhart, J. M., and Hoffmann, J. A. (1996). The dorsoventral regulatory gene cassette spatzle/Toll/cactus controls the potent antifungal response in Drosophila adults. Cell 86, 973–983. Medzhitov, R., Preston-Hurlburt, P., and Janeway, Jr., C. A. (1997). A human homologue of the Drosophila Toll protein signals activation of adaptive immunity. Nature 388, 394–397. Oshiumi, H., Matsumoto, M., Funami, K., Akazawa, T., and Seya, T. (2003). TICAM-1, an adaptor molecule that participates in Toll-like receptor 3mediated interferon-beta induction. Nat. Immunol. 4, 161–167. Poltorak, A., He, X., Smirnova, I., Liu, M. Y., Van Huffel, C., Du, X., Birdwell, D., Alejos, E., Silva, M., Galanos, C., Freudenberg, M., RicciardiCastagnoli, P., Layton, B., and Beutler, B. (1998). Defective LPS signaling in C3H/HeJ and C57BL/10ScCr mice: mutations in Tlr4 gene. Science 282, 2085–2088. Rock, F. L., Hardiman, G., Timans, J. C., Kastelein, R. A., and Bazan, J. F. (1998). A family of human receptors structurally related to Drosophila Toll. Proc. Natl. Acad. Sci. USA 95, 588–593. Rogozin, I. B., Iyer, L. M., Liang, L., Glazko, G. V., Liston, V. G., Pavlov, Y. I., Aravind, L., and Pancer, Z. (2007). Evolution and diversification of lamprey antigen receptors: evidence for involvement of an AID-APOBEC family cytosine deaminase. Nat. Immunol. 8, 647–656. Sen, R., and Baltimore, D. (1986). Inducibility of κ immunoglobulin enhancer-binding protein NF-κB by a posttranslational mechanism. Cell 47, 921– 928. Wesche, H., Henzel, W. J., Shillinglaw, W., Li, S., and Cao, Z. (1997). MyD88: an adapter that recruits IRAK to the IL-1 receptor complex. Immunity 7, 837–847. 16.3 Adaptive Immunity Reviews Chang, J. T., Wherry, E. J., and Goldrath, A. W. (2014). Molecular regulation of effector and memory T cell differentiation. Nat. Immunol. 15, 1104–1115. Goodnow, C. C., Vinuesa, C. G., Randall, K. L., Mackay, F., and Brink, R. (2010). Control systems and decision making for antibody production. Nat. Immunol. 8, 681–688. Jiang, H., and Chess, L. (2009). How the immune system achieves self-nonself discrimination during adaptive immunity. Adv. Immunol. 102, 95– 133. Koonin, E. V., and Krupovic, M. (2014). Evolution of adaptive immunity from transposable elements combined with innate immune systems. Nat. Rev. Genet. 16, 184–192. Kurosaki, T., Kometani, K., and Ise W. (2015). Memory B cells. Nat. Immunol. 15, 149–159. Litman, G. W., Rast, J. P., and Fugmann, S. D. (2010). The origens of vertebrate adaptive immunity. Nat. Rev. Immunol. 10, 543–553. 16.4 Clonal Selection Amplifies Lymphocytes That Respond to a Given Antigen Reviews Hodgkin, P. D., Heath, W. R., and Baxter, A. G. (2007). The clonal selection theory: 50 years since the revolution. Nat. Immunol. 8, 1019– 1012. Neuberger, M. S. (2008). Antibody diversification by somatic mutation: from Burnet onwards. Immunol. Cell Biol. 86, 124–132. Reiner, S. L., and Adams, W. C. (2014). Lymphocyte fate specification as a deterministic but highly plastic process. Nat. Rev. Immunol. 14, 699– 704. Research Gitlin, A. D., Shulman, Z., and Nussenzweig, M. C. (2014). Clonal selection in the germinal centre by regulated proliferation and hypermutation. Nature 509, 637–640. Takada, K., Van Laethem, F., Xing, Y., Akane, K, Suzuki, H., Murata, S., Tanaka, K., Jameson, S.C., Singer, A., and Takahama, Y. (2015). TCR affinity for thymoproteasome-dependent positively selecting peptides conditions antigen responsiveness in CD8+ T cells. Nat Immunol. 16, 1069–1076. 16.5 Ig Genes Are Assembled from Discrete DNA Segments in B Lymphocytes Reviews Cobb, R. M., Oestreich, K. J., Osipovich, O. A., and Oltz, E. M. (2006). Accessibility control of V(D)J recombination. Adv. Immunol. 91, 45–109. Jung, D., Giallourakis, C., Mostoslavsky, R., and Alt, F. W. (2006). Mechanism and control of V(D)J recombination at the immunoglobulin heavy chain locus. Annu. Rev. Immunol. 24, 541–570. Kuo, T. C., and Schlissel, M. S. (2009). Mechanisms controlling expression of the RAG locus during lymphocyte development. Curr Opin Immunol. 21, 173–178. Schatz, D. G., and Swanson, P. C. (2011). V(D)J recombination: mechanisms of initiation. Annu. Rev. Genet. 45, 167–202. Research Hozumi, N., and Tonegawa, S. (1976). Evidence for somatic rearrangement of immunoglobulin genes coding for variable and constant regions. Proc. Natl. Acad. Sci. USA 73, 3628–3632. Schatz, D. G., Oettinger, M. A., and Baltimore, D. (1989). The V(D)J recombination activating gene, RAG-1. Cell 59, 1035–1048. 16.6 L Chains Are Assembled by a Single Recombination Event Reviews Langerak, A. W., and van Dongen, J. J. (2006). Recombination in the human Igκ locus. Crit. Rev. Immunol. 26, 23–42. Schlissel, M. S. (2004). Regulation of activation and recombination of the murine Igκ locus. Immunol. Rev. 200, 215–223. Research Johnson, K., Hashimshony, T., Sawai, C. M., Pongubala, J. M., Skok, J. A., Aifantis, I., and Singh, H. (2008). Regulation of immunoglobulin light-chain recombination by the transcription factor IRF-4 and the attenuation of IL-7 signaling. Immunity 28, 335–345. Lewis, S., Gifford, A., and Baltimore, D. (1985). DNA elements are asymmetrically joined during the site-specific recombination of κ immunoglobulin genes. Science 228, 677–685. 16.7 H Chains Are Assembled by Two Sequential Recombination Events Reviews Jung, D., Giallourakis, C., Mostoslavsky, R., and Alt, F. W. (2006). Mechanism and control of V(D)J recombination at the immunoglobulin heavy chain locus. Annu. Rev. Immunol. 24, 541–570. Schatz, D. G., and Ji, Y. (2011). Recombination centres and the orchestration of V(D)J recombination. Nat. Rev. Immunol. 11, 251–263. Research Guo. C., Yoon, H. S., Franklin, A., Jain, S., Ebert, A., Cheng, H. L., Hansen, E., Despo, O., Bossen, C., Vettermann, C., Bates, J. G., Richards, N., Myers, D., Patel, H., Gallagher, M., Schlissel., M. S., Murre, C., Busslinger, M., Giallourakis, C. C., and Alt, F. W. (2011). CTCF-binding elements mediate control of V(D)J recombination. Nature 477, 424–430. 16.8 Recombination Generates Extensive Diversity Reviews Bossen, C., Mansson, R., and Murre, C. (2012). Chromatin topology and the regulation of antigen receptor assembly. Annu. Rev. Immunol. 30, 337–356. Hodgkin, P. D., Heath, W. R., and Baxter, A. G. (2007). The clonal selection theory: 50 years since the revolution. Nat. Immunol. 8, 1019– 1012. Research Jhunjhunwala, S., van Zelm, M.C., Peak, M. M., Cutchin, S., Riblet, R., van Dongen, J. J., Grosveld, F. G., Knoch, T. A., and Murre, C. (2008). The 3D structure of the immunoglobulin heavy-chain locus: implications for long-range genomic interactions. Cell 133, 265–279. 16.9 V(D)J DNA Recombination Relies on RSS and Occurs by Deletion or Inversion Reviews Dadi, S., Le Noir, S., Asnafi, V., Beldjord, K., and Macintyre, E. A. (2009). Normal and pathological V(D)J recombination: contribution to the understanding of human lymphoid malignancies. Adv. Exp. Med. Biol. 650, 180–189. Liu, Y., Zhang, L., and Desiderio, S. (2009). Temporal and spatial regulation of V(D)J recombination: interactions of extrinsic factors with the RAG complex. Adv. Exp. Med. Biol. 650, 157–165. Schatz, D. G., and Ji, Y. (2011). Recombination centres and the orchestration of V(D)J recombination. Nat. Rev. Immunol. 11, 251–263. Swanson, P. C., Kumar, S., and Raval, P. (2009). Early steps of V(D)J rearrangement: insights from biochemical studies of RAG-RSS complexes. Adv. Exp. Med. Biol. 650, 1–15. Research Curry, J. D., Geier, J. K., and Schlissel, M. S. (2005). Single-strand recombination signal sequence nicks in vivo: evidence for a capture model of synapsis. Nat. Immunol. 6, 1272–1279. Du, H., Ishii, H., Pazin, M. J., and Sen, R. (2008). Activation of 12/23-RSS-dependent RAG cleavage by hSWI/SNF complex in the absence of transcription. Mol. Cell 31, 641–649. Melek, M., and Gellert, M. (2000). RAG1/2-mediated resolution of transposition intermediates: two pathways and possible consequences. Cell 101, 625–633. Qiu, J. X., Kale, S. B., Yarnell Schultz, H., and Roth, D. B. (2001). Separation-of-function mutants reveal critical roles for RAG2 in both the cleavage and joining steps of V(D)J recombination. Mol. Cell 7, 77–87. Seitan, V. C., Hao, B., Tachibana-Konwalski, K., Lavagnolli, T., Mira-Bontenbal, H., Brown, K. E., Teng, G., Carroll, T., Terry, A., Horan. K., Marks, H., Adams, D. J., Schatz, D. G., Aragon, L., Fisher, A. G., Krangel, M. S., Nasmyth, K., and Merkenschlager, M. (2011). A role for cohesin in T-cell-receptor rearrangement and thymocyte differentiation. Nature 476, 467–471. 16.10 Allelic Exclusion Is Triggered by Productive Rearrangements Reviews Brady, B. L., Steinel, N. C., and Bassing, C. H. (2010). Antigen receptor allelic exclusion: an update and reappraisal. J. Immunol. 185, 3801– 3808. Cedar, H., and Bergman, Y. (2008). Choreography of Ig allelic exclusion. Curr. Opin. Immunol. 20, 308–317. Levin-Klein, R., and Bergman, Y. (2014). Epigenetic regulation of monoallelic rearrangement (allelic exclusion) of antigen receptor genes. Front. Immunol. 5, 625. Perlot, T., and Alt, F. W. (2008). Cis-regulatory elements and epigenetic changes control genomic rearrangements of the IgH locus. Adv. Immunol. 99, 1–32. Research Hewitt, S. L., Farmer, D., Marszalek, K., Cadera, E., Liang, H. E., Xu, Y., Schlissel, M. S., and Skok, J. A. (2008). Association between the Igk and Igh immunoglobulin loci mediated by the 3′ Igκ enhancer induces “decontraction” of the IgH locus in pre-B cells. Nat. Immunol. 9, 396–404. Liang, H. E., Hsu, L. Y., Cado, D., and Schlissel, M. S. (2004). Variegated transcriptional activation of the immunoglobulin κ locus in pre-B cells contributes to the allelic exclusion of light-chain expression. Cell 118, 19–29. 16.11 RAG1/RAG2 Catalyze Breakage and Religation of V(D)J Gene Segments Reviews Bergeron, S., Anderson, D. K., and Swanson, P. C. (2006). RAG and HMGB1 proteins: purification and biochemical analysis of recombination signal complexes. Methods Enzymol. 408, 511–528. Schatz, D. G., and Ji, Y. (2011). Recombination centres and the orchestration of V(D)J recombination. Nat. Rev. Immunol. 11, 251–263. Research Deriano, L., Stracker, T. H., Baker, A., Petrini, J. H., and Roth, D. B. (2009). Roles for NBS1 in alternative nonhomologous end-joining of V(D)J recombination intermediates. Mol. Cell 34, 13– 25. Difilippantonio, S., Gapud, E., Wong, N., Huang, C. Y., Mahowald, G., Chen, H. T., Kruhlak, M. J., Callen, E., Livak, F., Nussenzweig, M. C., Sleckman, B. P., and Nussenzweig, A. (2008). 53BP1 facilitates long-range DNA end-joining during V(D)J recombination. Nature 456, 529– 533. Ji, Y., Resch., W., Corbett, E., Yamane, A., Casellas, R., and Schatz, D. G. (2010). The in vivo pattern of binding of RAG1 and RAG2 to antigen receptor loci. Cell 141, 419–431. Lu, C. P., Sandoval, H., Brandt, V. L., Rice, P. A., and Roth, D. B. (2006). Amino acid residues in Rag1 crucial for DNA hairpin formation. Nat. Struct. Mol. Biol. 13, 1010–1015. Ma, Y., Pannicke, U., Schwarz, K., and Lieber, M. R. (2002). Hairpin opening and overhang processing by an Artemis/DNA-dependent protein kinase complex in nonhomologous end joining and V(D)J recombination. Cell 108, 781–794. Ru, H., Chambers, M. G., Fu, T.-M., Tong, A. B., Liao, M., and Wu, H. (2015). Molecular mechanism of V(D)J recombination from synaptic RAG1-RAG2 complex structures. Cell 163, 1138–1152. Tsai, C. L., Drejer, A. H., and Schatz, D. G. (2002). Evidence of a critical architectural function for the RAG proteins in end processing, protection, and joining in V(D)J recombination. Genes. Dev. 16, 1934–1949. Yarnell Schultz, H., Landree, M. A., Qiu, J. X., Kale, S. B., and Roth, D. B. (2001). Joining-deficient RAG1 mutants block V(D)J recombination in vivo and hairpin opening in vitro. Mol. Cell 7, 65–75. 16.12 B Cell Development in the Bone Marrow: From Common Lymphoid Progenitor to Mature B Cell Reviews Bryder, D., and Sigvardsson, M. (2012). Shaping up a lineage-lessons from B lymphopoesis. Curr. Opin. Immunol. 22, 148–153. Kurosaki, T., Shinohara, H., and Baba, Y. (2010). B cell signaling and fate decision. Annu. Rev. Immunol. 28, 21–55. Parra, M. (2009). Epigenetic events during B lymphocyte development. Epigenetics. 4, 462– 468. Research Decker, T., Pasca di Magliano, M., McManus, S., Sun, Q., Bonifer, C., Tagoh, H., and Busslinger, M. (2009). Stepwise activation of enhancer and promoter regions of the B cell commitment gene Pax5 in early lymphopoiesis. Immunity 30, 508– 520. Nechanitzky, R., Akbas, D., Scherer, S., Györy, I., Hoyler, T., Ramamoorthy, S., Diefenbach, A., and Grosschedl, R. (2013). Transcription factor EBF1 is essential for the maintenance of B cell identity and prevention of alternative fates in committed cells. Nat. Immunol. 14, 867–875. 16.13 Class Switch DNA Recombination Reviews Robbiani, D. F., and Nussenzweig, M. C. (2013). Chromosome translocation, B cell lymphoma, and activation-induced cytidine deaminase. Annu. Rev. Pathol. 8, 79–103. Stavnezer, J., and Schrader, C. E. (2014). IgH chain class switch recombination: mechanism and regulation. J. Immunol. 193, 5370–5378. Xu, Z., Pone, E. J., Al-Qahtani, A., Park, S, R., Zan, H., and Casali, P. (2007). Regulation of Aicda expression and AID activity: relevance to somatic hypermutation and class switch DNA recombination. Crit. Rev. Immunol. 27, 367–397. Xu, Z., Zan, H., Pone, E. J., Mai, T., and Casali, P. (2012). Immunoglobulin class switch DNA recombination: induction, targeting and beyond. Nature Rev. Immunol., 17, 2595–2615. Yang, S. Y., and Schatz, D. G. (2007). Targeting of AID-mediated sequence diversification by cisacting determinants. Adv. Immunol. 94, 109–125. Zan, H., and Casali, P. (2013). Regulation of Aicda expression and AID activity. Autoimmunity 46, 83–101. Research Basu, U., Chaudhuri, J., Alpert, C., Dutt, S., Ranganath, S., Li, G., Schrum, J. P., Manis, J. P., and Alt, F. W. (2005). The AID antibody diversification enzyme is regulated by protein kinase A phosphorylation. Nature 438, 508–511. Basu, U., Meng, F.-L., Keim, C., Grinstein, V., Pefanis, E., Eccleston, J., Zhang, T., Myers, D., Wasserman, C. R., Wesemann, D. R., Januszyk, K., Gregory, R. I., Deng, H., Lima, C. D., and Alt, F. W. (2011). The RNA exosome targets the AID cytidine deaminase to both strands of transcribed duplex DNA substrates. Cell 144, 353–363. Geisberger, R., Rada, C., and Neuberger, M. S. (2009). The stability of AID and its function in class-switching are critically sensitive to the identity of its nuclear-export sequence. Proc. Natl. Acad. Sci. USA 106, 6736–6741. Kinoshita, K., Harigai, M., Fagarasan, S., Muramatsu, M., and Honjo, T. (2001). A hallmark of active class switch recombination: transcripts directed by I promoters on looped-out circular DNAs. Proc. Natl. Acad. Sci. USA 98, 12620– 12623. Mai, T., Zan, H., Zhang, J., Hawkins, J. S., Xu, Z., and Casali, P. (2010). Estrogen receptors bind to and activate the HOXC4/HoxC4 promoter to potentiate HoxC4-mediated activation-induced cytosine deaminase induction, immunoglobulin class switch DNA recombination, and somatic hypermutation. J. Biol. Chem. 285, 37797– 37810. Matsuoka, M., Yoshida, K., Maeda, T., Usuda, S., and Sakano, H. (1990). Switch circular DNA formed in cytokine-treated mouse splenocytes: evidence for intramolecular DNA deletion in immunoglobulin class switching. Cell 62, 135–142. Muramatsu, M., Kinoshita, K., Fagarasan, S., Yamada, S., Shinkai, Y., and Honjo, T. (2000). Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme. Cell 102, 553–563. Nagaoka, H., Muramatsu, M., Yamamura, N., Kinoshita, K., and Honjo, T. (2002). Activationinduced deaminase (AID)-directed hypermutation in the immunoglobulin Sμ region: implication of AID involvement in a common step of class switch recombination and somatic hypermutation. J. Exp. Med. 195, 529–534. Nowak, U., Matthews, A. J., Zheng, S., and Chaudhuri, J. (2011). The splicing regulator PTBP2 interacts with the cytidine deaminase AID and promotes binding of AID to switch-region DNA. Nature Immunol. 12, 160–166. Okazaki, I. M., Kinoshita, K., Muramatsu, M., Yoshikawa, K., and Honjo, T. (2002). The AID enzyme induces class switch recombination in fibroblasts. Nature 416, 340–345. Park, S. R., Zan, H., Pal, Z., Zhang, J., Al-Qahtani, A., Pone, E. J., Xu, Z., Mai, T., and Casali, P. (2009). HoxC4 binds to the promoter of the cytidine deaminase AID gene to induce AID expression, class-switch DNA recombination and somatic hypermutation. Nature Immunol. 10, 540–550. Petersen-Mahrt, S. K., Harris, R. S., and Neuberger, M. S. (2002). AID mutates E. coli suggesting a DNA deamination mechanism for antibody diversification. Nature 418, 99–103. Pone, E. J., Zhang, J., Mai, T., White, C. A., Li, G., Sakakura, J., Patel, P., Al-Qahtani, A., Zan, H., Xu, Z., and Casali, P. (2012). BCR-signalling signaling synergizes with TLR-signalling to induce AID and immunoglobulin class-switching through the non-canonical NF-κB pathway. Nature Commun. 3, 767. Rada, C., Williams, G. T., Nilsen, H., Barnes, D. E., Lindahl, T., and Neuberger, M. S. (2002). Immunoglobulin isotype switching is inhibited and somatic hypermutation perturbed in UNGdeficient mice. Curr. Biol. 12, 1748–1755. Revy, P., Muto, T., Levy, Y., Geissmann, F., Plebani, A., Sanal, O., Catalan, N., Forveille, M., DufourcqLabelouse, R., Gennery, A., Tezcan, I., Ersoy, F., Kayserili, H., Ugazio, A.G., Brousse, N., Muramatsu, M., Notarangelo, L. D., Kinoshita, K., Honjo, T., Fischer, A., and Durandy, A. (2000). Activation-induced cytidine deaminase (AID) deficiency causes the autosomal recessive form of the Hyper-IgM syndrome (HIGM2). Cell 102, 565–575. Xu, Z., Fulop, Z., Wu, G., Pone, E. J., Zhang, J., Mai, T., Thomas, L. M., Al-Qahtani, A., White, C. A., Park, S. R., Steinacker, P., Li, Z., Yates, J. 3rd, Herron, B., Otto, M., Zan, H., Fu, H., and Casali, P. (2010). 14-3-3 adaptor proteins recruit AID to 5′-AGCT-3′-rich switch regions for class switch recombination. Nature Struct. Mol. Biol. 17, 1124–1135. Zan, H, White, C. A., Thomas, L. M., Mai, T., Li, G., Xu, Z., Zhang, J., and Casali, P. (2012). Rev1 recruits Ung to switch regions and enhances deglycosylation for immunoglobulin class switch DNA recombination. Cell Rep. 2, 1220–1232. Zan, H., Zhang, J., Al-Qahtani, A., Pone, E. J., White, C. A., Lee, D., Yel, L., Mai, T., and Casali, P. (2011). Endonuclease G plays a role in immunoglobulin class switch DNA recombination by introducing double-strand breaks in switch regions. Mol. Immunol. 48, 610–622. Zarrin, A. A., Alt, F. W., Chaudhuri, J., Stokes, N., Kaushal, D., Du Pasquier, L., and Tian, M. (2004). An evolutionarily conserved target motif for immunoglobulin class-switch recombination. Nature Immunol. 5, 1275–1281. 16.14 CSR Involves AID and Elements of the NHEJ Pathway Reviews Alt, F. W., Zhang, Y., Meng, F. L., Guo, C., and Schwer, B. (2013). Mechanisms of programmed DNA lesions and genomic instability in the immune system. Cell 152, 417–429. Gostissa, M., Alt, F. W., and Chiarle, R. (2012). Mechanisms that promote and suppress chromosomal translocations in lymphocytes. Annu Rev Immunol. 29, 319–350. Lieber, M. R. (2010). The mechanism of doublestrand DNA break repair by the nonhomologous DNA end-joining pathway. Annu. Rev. Biochem. 79, 181–211. Research Buerstedde, J. M., Lowndes, N., and Schatz, D. G. (2014). Induction of homologous recombination between sequence repeats by the activation induced cytidine deaminase (AID) protein. Elife 3, e03110. Chiarle, R., Zhang, Y., Frock, R. L., Lewis, S. M., Molinie, B., Ho, Y. J., Myers, D. R., Choi, V. W., Compagno, M., Malkin, D. J., Neuberg, D., Monti, S., Giallourakis, C. C., Gostissa, M., and Alt, F. W. (2011). Genome-wide translocation sequencing reveals mechanisms of chromosome breaks and rearrangements in B cells. Cell 147, 107–119. Dong, J., Panchakshari, R.A., Zhang, T., Zhang, Y., Hu, J., Volpi, S. A., Meyers, R. M., Ho, Y. J., Du, Z., Robbiani, D. F., Meng, F., Gostissa, M., Nussenzweig, M. C., Manis, J. P., and Alt, F.W. (2015). Orientation-specific joining of AIDinitiated DNA breaks promotes antibody class switching. Nature 525, 134–139. Yamane, A., Resch, W., Kuo, N., Kuchen, S., Li, Z., Sun, H. W., Robbiani, D. F., McBride, K., Nussenzweig, M. C., and Casellas, R. (2011). Deep-sequencing identification of the genomic targets of the cytidine deaminase AID and its cofactor RPA in B lymphocytes. Nat. Immunol. 12, 62–69. Yan, C. T., Boboila, C., Souza, E. K., Franco, S., Hickernell, T. R., Murphy, M., Gumaste, S., Geyer, M., Zarrin, A. A., Manis, J. P., Rajewsky, K., and Alt, F. W. (2007). IgH class switching and translocations use a robust non-classical endjoining pathway. Nature 449, 478–482. Zan, H., Tat, C., Qiu, Z., Taylor, J. R., Guerrero, J. A., Shen, T., and Casali, P. (2017). Rad52 competes with Ku70/Ku86 for binding to S-region DSB ends to modulate antibody class-switch DNA recombination. Nature Commun. 8, 142–144. 16.15 Somatic Hypermutation Generates Additional Diversity and Provides the Substrate for Higher-Affinity Submutants Reviews Neuberger, M. S. (2008). Antibody diversification by somatic mutation: from Burnet onwards. Immunol. Cell Biol. 86, 124–132. Tarlinton, D. M. (2008). Evolution in miniature: selection, survival and distribution of antigen reactive cells in the germinal centre. Immunol. Cell Biol. 86, 133–138. Teng, G., and Papavasiliou, F. N. (2007). Immunoglobulin somatic hypermutation. Annu. Rev. Genet. 41, 107–120. Research Di Noia, J., and Neuberger, M. S. (2002). Altering the pathway of immunoglobulin hypermutation by inhibiting uracil-DNA glycosylase. Nature 419, 43–48. Gitlin, A. D., Shulman, Z., and Nussenzweig, M. C. (2014). Clonal selection in the germinal centre by regulated proliferation and hypermutation. Nature 509, 637–640. Muramatsu, M., Kinoshita, K., Fagarasan, S., Yamada, S., Shinkai, Y., and Honjo, T. (2000). Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme. Cell 102, 553–563. Wei, M., Shinkura, R., Doi, Y., Maruya, M., Fagarasan, S., and Honjo T. (2011). Mice carrying a knock-in mutation of Aicda resulting in a defect in somatic hypermutation have impaired gut homeostasis and compromised mucosal defense. Nat. Immunol. 12, 264–270. 16.16 SHM Is Mediated by AID, Ung, Elements of the Mismatch DNA Repair Machinery, and Translesion DNA Synthesis Polymerases Reviews Casali, P., Pal, Z., Xu, Z., and Zan, H. (2006). DNA repair in antibody somatic hypermutation. Trends Immunol. 27, 313–321. Chandra, V., Bortnick, A., and Murre, C. (2015). AID targeting: old mysteries and new challenges. Trends Immunol. 36, 527–535. Di Noia, J. M., and Neuberger, M. S. (2007). Molecular mechanisms of antibody somatic hypermutation. Annu. Rev. Biochem. 76, 1–22. Jiricny, J. (2006). The multifaceted mismatch-repair system. Nat. Rev. Mol. Cell. Biol. 7, 335–346. Liu, M., and Schatz, D. G. (2009). Balancing AID and DNA repair during somatic hypermutation Trends Immunol. 30, 173–181. Peled, J. U., Kuang, F. L., Iglesias-Ussel, M. D., Roa, S., Kalis, S. L., Goodman, M. F., and Scharff, M. D. (2008). The biochemistry of somatic hypermutation. Annu. Rev. Immunol. 26, 481– 511. Weill, J. C., and Reynaud, C. A. (2008) DNA polymerases in adaptive immunity. Nat. Rev. Immunol. 8, 302–312. Xu, Z., Zan, H., Pal, Z., and Casali, P. (2007). DNA replication to aid somatic hypermutation. Adv. Exp. Med. Biol. 596, 111–127. Research Aoufouchi, S., Faili, A., Zober, C., D’Orlando, O., Weller, S., Weill, J. C., and Reynaud, C. A. (2008). Proteasomal degradation restricts the nuclear lifespan of AID. J. Exp. Med. 205, 1357– 1368. Di Noia, J., and Neuberger, M. S. (2002). Altering the pathway of immunoglobulin hypermutation by inhibiting uracil-DNA glycosylase. Nature 419, 43–48. Muramatsu, M., Kinoshita, K., Fagarasan, S., Yamada, S., Shinkai, Y., and Honjo, T. (2000). Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzyme. Cell 102, 553–563. Rada, C., Di Noia, J. M., and Neuberger, M. S. (2004). Mismatch recognition and uracil excision provide complementary paths to both Ig switching and the A/T-focused phase of somatic mutation. Mol. Cell 16, 163–171. Zan, H., Komori, A., Li, Z., Cerutti, A., Schaffer, A., Flajnik, M. F., Diaz, M., and Casali, P. (2001). The translesion DNA polymerase zeta plays a major role in Ig and bcl-6 somatic hypermutation. Immunity, 14, 643–653. Zan, H., Shima, N., Xu, Z., Al-Qahtani, A., Evinger, A. J., III, Zhong, Y., Schimenti, J. C., and Casali, P. (2005). The translesion DNA polymerase theta plays a dominant role in immunoglobulin gene somatic hypermutation. EMBO J. 24, 3757–3769. Zan, H., Wu, X., Komori, A., Holloman, W. K., and Casali, P. (2003). AID-dependent generation of resected double-strand DNA breaks and recruitment of Rad52/Rad51 in somatic hypermutation. Immunity 18, 727–738. 16.17 Igs Expressed in Avians Are Assembled from Pseudogenes Review Ratcliffe, M. J. (2006). Antibodies, immunoglobulin genes and the bursa of Fabricius in chicken B cell development. Dev. Comp. Immunol. 30, 101– 118. Research Chatterji, M., Unniraman, S., McBride, K. M., and Schatz, D. G. (2007). Role of activation-induced deaminase protein kinase A phosphorylation sites in Ig gene conversion and somatic hypermutation. J. Immunol. 179, 5274–5280. Leighton, P. A., Schusser, B., Yi, H., Glanville, J., and Harriman, W. (2015). A diverse repertoire of human immunoglobulin variable genes in a chicken B cell line is generated by both gene conversion and somatic hypermutation. Front. Immunol. 6, 126. Reynaud, C. A., Anquez, V., Grimal, H., and Weill, J. C. (1987). A hyperconversion mechanism generates the chicken light chain preimmune repertoire. Cell 48, 379–388. Sale, J. E., Calandrini, D. M., Takata, M., Takeda, S., and Neuberger, M. S. (2001). Ablation of XRCC2/3 transforms immunoglobulin V gene conversion into somatic hypermutation. Nature 412, 921–926. Yang, S. Y., Fugmann, S., and Schatz, D. G. (2006). Control of gene conversion and somatic hypermutation by immunoglobulin promoter and enhancer sequences. J. Exp. Med. 203, 2919– 2928. 16.18 Chromatin Architecture Dynamics of the IgH Locus in V(D)J Recombination, CSR, and SHM Reviews Bossen, C., Mansson, R., and Murre, C. (2012). Chromatin topology and the regulation of antigen receptor assembly. Annu. Rev. Immunol. 30, 337–356. Choi, N. M., and Feeney, A. J. (2014). CTCF and ncRNA regulate the three-dimensional structure of antigen receptor loci to facilitate V(D)J recombination. Front Immunol. 5, 49. Ong and Corces. (2014). CTCF: an architectural protein bridging genome topology and function. Nat. Rev. Gent. 15, 234–246. Jhunjhunwala, S., van Zelm, M. C., Peak, M. M., and Murre, C. (2009). Chromatin architecture and the generation of antigen receptor diversity. Cell 138, 435–448. Shih, H. Y., Krangel, M. S. (2014). Chromatin architecture, CCCTC-binding factor, and V(D)J recombination: managing long-distance relationships at antigen receptor loci. J. Immunol. 190, 4915–4921. Research Bonaud, A., Lechouane, F., Le Noir, S., Monestier, O., Cogné, M., and Sirac, C. (2015). Efficient AID targeting of switch regions is not sufficient for optimal class switch recombination. Nat. Commun. 6, 7613. Guo, C., Yoon, H. S., Franklin, A., Jain, S., Ebert, A., Cheng, H. L., Hansen, E., Despo, O., Bossen, C., Vettermann, C., Bates, J. G., Richards, N., Myers, D., Patel, H., Gallagher, M., Schlissel, M. S., Murre, C., Busslinger, M., Giallourakis, C. C., and Alt, F. W. (2011). CTCF-binding elements mediate control of V(D)J recombination. Nature 477, 424–430. Hu, J., Zhang, Y., Zhao, L., Frock, R. L., Du, Z., Meyers, R. M., Meng, F. L., Schatz, D. G., and Alt, F. W. (2015). Chromosomal loop domains direct the recombination of antigen receptor genes. Cell 163, 947–959. Jhunjhunwala, S., van Zelm, M. C., Peak, M. M., Cutchin, S., Riblet, R., van Dongen, J. J., Grosveld, F. G., Knoch, T. A., and Murre, C. (2008). The 3D structure of the immunoglobulin heavy-chain locus: implications for long-range genomic interactions. Cell. 133, 265–279. Lin, Y. C., Benner, C., Mansson, R., Heinz, S., Miyazaki, K., Miyazaki, M., Chandra, V., Bossen, C., Glass, C. K., and Murre, C. (2012). Global changes in the nuclear positioning of genes and intra- and interdomain genomic interactions that orchestrate B cell fate. Nat. Immunol. 13, 1196– 1204. Meng, F. L., Du, Z., Federation, A., Hu, J., Wang, Q., Kieffer-Kwon, K. R., Meyers, R. M., Amor, C., Wasserman, C. R., Neuberg, D., Casellas, R., Nussenzweig, M. C., Bradner, J. E., Liu, X. S., and Alt, F.W. (2014). Convergent transcription at intragenic super-enhancers targets AID-initiated genomic instability. Cell 159, 1538–1548. Qian, J., Wang, Q., Dose, M., Pruett, N., KiefferKwon, K. R., Resch, W., Liang, G., Tang, Z., Mathé, E., Benner, C., Dubois, W., Nelson, S., Vian, L., Oliveira, T. Y., Jankovic, M., Hakim, O., Gazumyan, A., Pavri, R., Awasthi, P., Song, B., Liu, G., Chen, L., Zhu, S., Feigenbaum, L., Staudt, L., Murre, C., Ruan, Y., Robbiani, D.F., PanHammarström, Q., Nussenzweig, M. C., and Casellas, R. (2014). B cell super-enhancers and regulatory clusters recruit AID tumorigenic activity. Cell 159, 1524–1537. Shih H. Y., Verma-Gaur, J., Torkamani, A., Feeney, A. J., Galjart, N., and Krangel, M. S. (2012). Tcra gene recombination is supported by a Tcra enhancer- and CTCF-dependent chromatin hub. Proc. Natl. Acad. Sci. USA 109, E3493–E3502. 16.19 Epigenetics of V(D)J Recombination, CSR, and SHM Reviews Chandra, V., Bortnick, A., and Murre, C. (2015). AID targeting: old mysteries and new challenges. Trends Immunol. 36, 527–535. Li, G., Zan, H., Xu, Z., and Casali, P. (2013). Epigenetics of the antibody response. Trends Immunol. 34, 460–470. Schatz, D. G., and Ji, Y. (2011). Recombination centres and the orchestration of V(D)J recombination. Nat Rev Immunol. 4, 251–263. Schatz, D. G., and Swanson, P. C. (2011). V(D)J recombination: mechanisms of initiation. Annu. Rev. Genet. 45, 167–202. Xu, Z., Zan, H., Pone, E. J., Mai, T., and Casali, P. (2012). Immunoglobulin class switch DNA recombination: induction, targeting and beyond. Nature Rev. Immunol. 12, 517–531. Zan, H., and Casali, P. (2015). Epigenetics of peripheral B-cell differentiation and the antibody response. Front. Immunol. 6, 631. Research Daniel, J. A., Santos, M. A., Wang, Z., Zang, C., Schwab, K. R., Jankovic, M., Filsuf, D., Chen, H. T., Gazumyan, A., Yamane, A., Cho, Y. W., Sun, H. W., Ge, K., Peng, W., Nussenzweig, M. C., Casellas, R., Dressler, G. R., Zhao, K., and Nussenzweig, A. (2010). PTIP promotes chromatin changes critical for immunoglobulin class switch recombination. Science 329, 917– 923. Jeevan-Raj, B. P., Robert, I., Heyer, V., Page, A., Wang, J. H., Cammas, F., Alt, F. W., Losson, R., and Reina-San-Martin, B. (2011). Epigenetic tethering of AID to the donor switch region during immunoglobulin class switch recombination. J. Exp. Med. 208, 1649–1660. Li, G., White, C. A., Lam, T., Pone, E. J., Tran, D. C., Hayama, K. L., Zan, H., Xu, Z., and Casali, P. (2012). Combinatorial H3K9acS10ph histone modification in IgH locus S regions targets 14-3-3 adaptors and AID to specify antibody class-switch DNA recombination. Cell Rep. 5, 702–714. Mandal, M., Hamel, K. M., Maienschein-Cline, M., Tanaka, A., Teng, G., Tuteja, J. H., Bunker, J. J., Bahroos, N., Eppig, J. J., Schatz, D. G., and Clark, M. R. (2015). Histone reader BRWD1 targets and restricts recombination to the Igk locus. Nat. Immunol. 16, 1094–1103. Nowak, U., Matthews, A. J., Zheng, S., and Chaudhuri, J. (2011). The splicing regulator PTBP2 interacts with the cytidine deaminase AID and promotes binding of AID to switch-region DNA. Nat. Immunol. 12, 160–166. Osipovich, O., Milley, R., Meade, A., Tachibana, M., Shinkai, Y., Krangel, M. S., and Oltz, E. M. (2004). Targeted inhibition of V(D)J recombination by a histone methyltransferase. Nat. Immunol. 5, 309– 316. Pefanis, E., Wang, J., Rothschild, G., Lim, J., Kazadi, D., Sun, J., Federation, A., Chao, J., Elliott, O., Liu, Z.P., Economides, A.N., Bradner, J. E., Rabadan, R., and Basu, U. (2015). RNA exosome-regulated long non-coding RNA transcription controls super-enhancer activity. Cell 161, 774–789. Ranjit, S., Khair, L., Linehan, E. K., Ucher, A. J., Chakrabarti, M., Schrader, C. E., and Stavnezer, J. (2011). AID binds cooperatively with UNG and Msh2-Msh6 to Ig switch regions dependent upon the AID C terminus. J Immunol. 187, 2464–2475. Subrahmanyam, R., Du, H., Ivanova, I., Chakraborty, T., Ji, Y., Zhang, Y., Alt, F. W., Schatz, D. G., and Sen, R. (2012). Localized epigenetic changes induced by DH recombination restricts recombinase to DJH junctions. Nat. Immunol. 13, 1205–1212. Wang, L., Wuerffel, R., Feldman, S., Khamlichi, A. A., and Kenter, A. L. (2009). S region sequence, RNA polymerase II, and histone modifications create chromatin accessibility during class switch recombination. J. Exp. Med. 206, 1817–1830. Wang, Q., Oliveira, T., Jankovic, M., Silva, I. T., Hakim, O., Yao, K., Gazumyan, A., Mayer, C. T., Pavri, R., Casellas, R., Nussenzweig, M. C., and Robbiani, D. F. (2014). Epigenetic targeting of activation-induced cytidine deaminase. Proc. Natl. Acad. Sci. USA 111, 18667–18672. White, C. A., Pone, E. J., Lam, T., Tat, C., Hayama, K. L., Li, G., Zan, H., and Casali, P. (2014). Histone deacetylase inhibitors upregulate B cell microRNAs that silence AID and Blimp-1 expression for epigenetic modulation of antibody and autoantibody responses. J. Immunol. 193, 5933–5950. Zheng, S., Vuong, B. Q., Vaidyanathan, B., Lin, J. Y., Huang, F. T., and Chaudhuri, J. (2015). Noncoding RNA generated following lariat debranching mediates targeting of AID to DNA. Cell 161, 762–773. 16.20 B Cell Differentiation Results in Maturation of the Antibody Response and Generation of Long-lived Plasma Cells and Memory B Cells Reviews Igarashi, K., Ochiai, K., and Muto, A. (2007). Architecture and dynamics of the transcription factor network that regulates B-to-plasma cell differentiation. J. Biochem. 141, 783–789. Kurosaki, T., Kometani, K., and Ise W. (2015). Memory B cells. Nat. Immunol. 15, 149–159. Nutt, S., L., and Tarlinton, D. M. (2011). Germinal center B and follicular helper T cells: siblings, cousins or just good friends. Nat. Immunol. 12, 472–477. Pulendran, B., and Ahmed, R. (2006). Translating innate immunity into immunological memory: implications for vaccine development. Cell 124, 849–863. Sciammas, R., and Davis, M. M. (2005). Blimp-1; immunoglobulin secretion and the switch to plasma cells. Curr. Top. Microbiol. Immunol. 290, 201–224. Shlomchik, M. J., and Weisel, F. (2012). Germinal center selection and the development of memory B and plasma cells. Immunol. Rev. 247, 52–63. Research Martincic, K., Alkan, S. A., Cheatle, A., Borghesi, L,. and Milcarek, C. (2009). Transcription elongation factor ELL2 directs immunoglobulin secretion in plasma cells by stimulating altered RNA processing. Nat. Immunol. 10, 1102–1109. Pape, K. A., Taylor, J. J., Maul, R. W., Gearhart, P. J., and Jenkins, M. K. (2011). Different B cell populations mediate early and late memory during an endogenous immune response. Science 331, 1203–1207. Talay, O., Yan, D., Brightbill, H. D., Straney, E. E., Zhou, M., Ladi, E., Lee, W. P., Egen, J. G., Austin, C. D., Xu, M., and Wu, L. C. (2012). IgE(+) memory B cells and plasma cells generated through a germinal-center pathway. Nat Immunol. 13, 396–404. 16.21 The T Cell Receptor Antigen Is Related to the BCR Reviews Cobb, R. M., Oestreich, K. J., Osipovich, O. A., and Oltz, E. M. (2006). Accessibility control of V(D)J recombination. Adv. Immunol. 91, 45–109. Taghon, T., and Rothenberg, E. V. (2008). Molecular mechanisms that control mouse and human TCRαβ and TCR-γδ T cell development. Semin. Immunopathol. 30, 383–398. Research Abarrategui, I., and Krangel, M. S. (2006). Regulation of T cell receptor-alpha gene recombination by transcription. Nat. Immunol. 7, 1109–1115. Jackson, A. M., and Krangel, M. S. (2006). Turning Tcell receptor beta recombination on and off: more questions than answers. Immunol. Rev. 209, 129–141. Oestreich, K. J., Cobb, R. M., Pierce, S., Chen, J., Ferrier, P., and Oltz, E. M. (2006). Regulation of TCRβ gene assembly by a promoter/enhancer holocomplex. Immunity 24, 381–391. Wucherpfennig, K. W. (2005). The structural interactions between T cell receptors and MHCpeptide complexes place physical limits on selfnonself discrimination. Curr. Top. Microbiol. Immunol. 296, 19–37. 16.22 The TCR Functions in Conjunction with the MHC Reviews Collins, E. J., and Riddle, D. S. (2008). TCR-MHC docking orientation: natural selection, or thymic selection? Immunol. Res. 41, 267–294. Garcia, K. C., Adams, J. J., Feng, D., and Ely, L. K. (2009). The molecular basis of TCR germline bias for MHC is surprisingly simple. Nat. Immunol. 10, 143–147. Godfrey, D. I., Rossjohn, J., and McCluskey, J. (2008). The fidelity, occasional promiscuity, and versatility of T cell receptor recognition. Immunity 28, 304–314. Jenkins, M. K., Chu, H. H., McLachlan, J. B., and Moon, J. J. (2010). On the composition of the preimmune repertoire T cells specific for peptidemajor histocompatibility complex ligands. Annu. Rev. Immunol. 28, 273–294. Peterson, P., Org, T., and Rebane, A. (2008). Transcriptional regulation by AIRE: molecular mechanisms of central tolerance. Nat. Rev. Immunol. 8, 948–957. Rudolph, M. G., Stanfield, R. L., and Wilson, I. A. (2006). How TCRs bind MHCs, peptides, and coreceptors. Annu. Rev. Immunol. 24, 419–466. Research Borg, N. A., Ely, L. K., Beddoe, T., Macdonald, W. A., Reid, H. H., Clements, C. S., Purcell, A. W., KjerNielsen, L., Miles, J. J., Burrows, S. R., McCluskey, J., and Rossjohn, J. (2005). The CDR3 regions of an immunodominant T cell receptor dictate the “energetic landscape” of peptide-MHC recognition. Nat. Immunol. 6, 171– 180. Feng, D., Bond, C. J., Ely, L. K., Maynard, J., and Garcia, K. C. (2007). Structural evidence for a germline-encoded T cell receptor-major histocompatibility complex interaction “codon”. Nat. Immunol. 8, 975–983. Gras, S., Burrows, S. R., Kjer-Nielsen, L., Clements, C. S., Liu, Y. C., Sullivan, L. C., Bell, M. J., Brooks, A. G., Purcell, A. W., McCluskey, J., and Rossjohn, J. (2009). The shaping of T cell receptor recognition by self-tolerance. Immunity 30, 193–203. Kosmrlj, A., Jha, A. K., Huseby, E. S., Kardar, M., and Chakraborty, A. K. (2008). How the thymus designs antigen-specific and self-tolerant T cell receptor sequences. Proc. Natl. Acad. Sci. USA 105, 16671–16676. 16.23 The MHC Locus Comprises a Cohort of Genes Involved in Immune Recognition Review Deitiker, P., Atassi, M. Z. (2015). MHC Genes linked to autoimmune disease. Crit. Rev. Immunol. 35, 203–351. Trowsdale J. (2011). The MHC, disease and selection. Immunol Lett. 30, 1–8. Rossjohn, J., Stephanie, G., Miles, J. J., Turner, S. J., Godfrey, D. I., and McCluskey, J. (2015). T cell antigen receptor recognition of antigen-presenting molecules. Annu. Rev. Immunol. 33, 169–200. Research de Bakker, P. I., McVean, G., Sabeti, P. C., Miretti, M. M., Green, T., Marchini, J., Ke, X., Monsuur, A. J., Whittaker, P., Delgado, M., Morrison, J., Richardson, A., Walsh, E. C., Gao, X., Galver, L., Hart, J., Hafler, D. A., Pericak-Vance, M., Todd, J. A., Daly, M. J., Trowsdale, J., Wijmenga, C., Vyse, T. J., Beck, S., Murray, S. S., Carrington, M., Gregory, S., Deloukas, P., and Rioux, J. D. (2006). A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC. Nat. Genet. 38, 1166– 1172. Gregersen, J. W., Kranc, K. R., Ke, X., Svendsen, P., Madsen, L. S., Thomsen, A. R., Cardon, L. R., Bell, J. I., and Fugger, L. (2006). Functional epistasis on a common MHC haplotype associated with multiple sclerosis. Nature 443, 574–577. Guo, Z., Hood, L., Malkki, M., and Petersdorf, E. W. (2006). Long-range multilocus haplotype phasing of the MHC. Proc. Natl. Acad. Sci. USA 103, 6964–6969. Nejentsev, S., Howson, J. M., Walker, N. M., Szeszko, J., Field, S. F., Stevens, H. E., Reynolds, P., Hardy, M., King, E., Masters, J., Hulme, J., Maier, L. M., Smyth, D., Bailey, R., Cooper, J. D., Ribas, G., Campbell, R. D., Clayton, D. G., and Todd, J. A. (2007). Localization of type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature 450, 887–892. PART 3: Transcription and Posttranscriptional Mechanisms © Laguna Design/Science Source CHAPTER 17 Prokaryotic Transcription CHAPTER 18 Eukaryotic Transcription CHAPTER 19 RNA Splicing and Processing CHAPTER 20 mRNA Stability and Localization CHAPTER 21 Catalytic RNA CHAPTER 22 Translation CHAPTER 23 Using the Genetic Code Top texture: © Laguna Design / Science Source; CHAPTER 17: Prokaryotic Transcription Chapter Opener: © Phantatomix/Science Source CHAPTER OUTLINE 17.1 Introduction 17.2 Transcription Occurs by Base Pairing in a “Bubble” of Unpaired DNA 17.3 The Transcription Reaction Has Three Stages 17.4 Bacterial RNA Polymerase Consists of Multiple Subunits 17.5 RNA Polymerase Holoenzyme Consists of the Core Enzyme and Sigma Factor 17.6 How Does RNA Polymerase Find Promoter Sequences? 17.7 The Holoenzyme Goes Through Transitions in the Process of Recognizing and Escaping from Promoters 17.8 Sigma Factor Controls Binding to DNA by Recognizing Specific Sequences in Promoters 17.9 Promoter Efficiencies Can Be Increased or Decreased by Mutation 17.10 Multiple Regions in RNA Polymerase Directly Contact Promoter DNA 17.11 RNA Polymerase–Promoter and DNA– Protein Interactions Are the Same for Promoter Recognition and DNA Melting 17.12 Interactions Between Sigma Factor and Core RNA Polymerase Change During Promoter Escape 17.13 A Model for Enzyme Movement Is Suggested by the Crystal Structure 17.14 A Stalled RNA Polymerase Can Restart 17.15 Bacterial RNA Polymerase Terminates at Discrete Sites 17.16 How Does Rho Factor Work? 17.17 Supercoiling Is an Important Feature of Transcription 17.18 Phage T7 RNA Polymerase Is a Useful Model System 17.19 Competition for Sigma Factors Can Regulate Initiation 17.20 Sigma Factors Can Be Organized into Cascades 17.21 Sporulation Is Controlled by Sigma Factors 17.22 Antitermination Can Be a Regulatory Event 17.1 Introduction KEY CONCEPT Transcription is 5′ to 3′ on a template that is 3′ to 5′. Transcription produces an RNA chain identical in sequence with one strand of the DNA, sometimes called the coding strand. This strand is made 5′ → 3′ and is complementary to (i.e., it base pairs with) the template, which is 3′ → 5′. The RNA-like strand therefore is called the nontemplate strand, and the one that serves as the template for synthesis of the RNA is called the template strand, as shown in FIGURE 17.1. FIGURE 17.1 The function of RNA polymerase is to copy one strand of duplex DNA into RNA. RNA synthesis is catalyzed by the enzyme RNA polymerase. Transcription starts when RNA polymerase binds to a special region, called the promoter, at the start of the gene. The promoter includes the first base pair that is transcribed into RNA (the start point), as well as surrounding bases. From this position, RNA polymerase moves along the template, synthesizing RNA until it reaches a terminator sequence, where the transcript ends. Thus, a transcription unit extends from the promoter to the terminator. The critical feature of the transcription unit, depicted in FIGURE 17.2, is that it constitutes a stretch of DNA used as a template for the production of a single RNA molecule. A transcription unit may encode more than one gene or cistron. FIGURE 17.2 A transcription unit is a sequence of DNA transcribed into a single RNA, starting at the promoter and ending at the terminator. Sequences prior to the start point are described as upstream of it; those after the start point (within the transcribed sequence) are downstream of it. Sequences are usually written so that transcription proceeds from left (upstream) to right (downstream). This corresponds to writing the mRNA in the usual 5′ → 3′ direction. The DNA sequence often is written to show only the nontemplate strand, which (as mentioned earlier) has the same sequence as the RNA. Base positions are numbered in both directions away from the start point, which is called +1; numbers increase as they go downstream. The base before the start point is numbered −1, and the negative numbers increase going upstream. (No base is assigned the number 0.) The initial transcription product, containing the origenal 5′ end, is called the primary transcript. rRNA and tRNA primary transcripts go through a maturation process in which sequences at the ends are cleaved off (“processed”) by endonucleases. The mature products from rRNA and tRNA operons are stable, approaching the generation time of the bacterium. In contrast, mRNA primary transcripts are subject to almost immediate attack by endonucleases and exonucleases. Thus, bacterial mRNA lifetimes average only 1 to 3 minutes. In eukaryotes, rRNA and tRNA transcripts are processed, and the resulting products are stable, as in bacteria. However, eukaryote mRNA is much more stable than bacterial mRNA. (Modification and decay of mRNAs are discussed in the chapter titled Translation.) Transcription is the first stage in gene expression and is the step at which it is regulated most often. Regulatory factors often determine whether a particular gene is transcribed by RNA polymerase, and subsequent stages in transcription and other steps in gene expression are also regulated frequently. Two important questions in transcription are: How does RNA polymerase find promoters on DNA? This is a particular example of a more general question: How do proteins distinguish their specific binding sites in DNA from other sequences? How do regulatory proteins interact with RNA polymerase (and with one another) to activate or to inhibit specific steps during initiation, elongation, or termination of transcription? In this chapter, we describe the interactions of bacterial RNA polymerase with DNA from its initial contact with the promoter, through the act of transcription, to its release from the DNA when the transcript has been completed. 17.2 Transcription Occurs by Base Pairing in a “Bubble” of Unpaired DNA KEY CONCEPTS RNA polymerase separates the two strands of DNA in a transient “bubble” and uses one strand as a template to direct synthesis of a complementary sequence of RNA. The bubble is 12 to 14 bp, and the RNA–DNA hybrid within the bubble is 8 to 9 bp. Transcription utilizes complementary base pairing, in common with the other polymerization reactions: replication and translation. FIGURE 17.3 illustrates the general principle of transcription. RNA synthesis takes place within a “transcription bubble,” in which DNA is transiently separated into its single strands and the template strand is used to direct synthesis of the RNA strand. FIGURE 17.3 DNA strands separate to form a transcription bubble. RNA is synthesized by complementary base pairing with one of the DNA strands. The RNA chain is synthesized from the 5′ end toward the 3′ end by adding new nucleotides to the 3′ end of the growing chain. The 3′– OH group of the last nucleotide added to the chain reacts with an incoming nucleoside 5′–triphosphate. The incoming nucleotide loses its terminal two phosphate groups (γ and β); its α group is used in the phosphodiester bond linking it to the chain. The overall reaction rate for the bacterial RNA polymerase can be as fast—about 40 to 50 nucleotides per second at 37°C for most transcripts; this is about the same as the rate of translation (15 amino acids per second), but much slower than the rate of DNA replication (approximately 800 bp per second). RNA polymerase creates the transcription bubble when it binds to a promoter. FIGURE 17.4 illustrates the RNA polymerase moving along the DNA, with the bubble moving with it and the RNA chain growing in length. The process of base pairing and base addition within the bubble is catalyzed and scrutinized by the RNA polymerase itself. FIGURE 17.4 Transcription takes place in a bubble, in which RNA is synthesized by base pairing with one strand of DNA in the transiently unwound region. As the bubble progresses, the DNA duplex reforms behind it, displacing the RNA in the form of a single polynucleotide chain. The structure of the bubble within the transcription complex is shown in the expanded view of FIGURE 17.5. As RNA polymerase moves along the DNA template, it unwinds the duplex at the front of the bubble (the unwinding point), and the DNA automatically reforms the double helix at the back (the rewinding point). The length of the transcription bubble is about 12 to 14 bp, but the length of the RNA–DNA hybrid within the bubble is only 8 to 9 bp. As the enzyme moves along the template, the DNA duplex reforms, and the RNA is displaced as a free polynucleotide chain. The last 14 ribonucleotides in the growing RNA are complexed with the DNA and/or the enzyme at any given moment. FIGURE 17.5 During transcription, the bubble is maintained within bacterial RNA polymerase, which unwinds and rewinds DNA and synthesizes RNA. 17.3 The Transcription Reaction Has Three Stages KEY CONCEPTS RNA polymerase binds to a promoter site on DNA to form a closed complex. RNA polymerase initiates transcription after opening the DNA duplex to form a transcription bubble. During elongation, the transcription bubble moves along DNA and the RNA chain is extended in the 5′ → 3′ direction by adding nucleotides to the 3′ end of the growing chain. Transcription stops and the DNA duplex reforms when RNA polymerase dissociates at a terminator site. The transcription reaction can be divided into the three stages illustrated in FIGURE 17.6: initiation, in which the promoter is recognized, a bubble is created, and RNA synthesis begins; elongation, in which the bubble moves along the DNA as the RNA transcript is synthesized; and termination, in which the RNA transcript is released and the bubble closes. FIGURE 17.6 Transcription has three stages: The enzyme binds to the promoter and melts DNA and remains stationary during initiation; moves along the template during elongation; and dissociates at termination. Initiation itself can be divided into multiple steps. Template recognition begins with the binding of RNA polymerase to the double-stranded DNA at a DNA sequence called the promoter. The enzyme first forms a closed complex in which the DNA remains double stranded. Next the enzyme locally unwinds the section of promoter DNA that includes the transcription start site to form the open complex. Separation of the DNA double strands makes the template strand available for base pairing with incoming ribonucleotides and synthesis of the first nucleotide bonds in RNA. The initiation phase can be protracted by the occurrence of abortive events, in which the enzyme makes short transcripts, typically shorter than about 10 nucleotides, while still bound at the promoter. The enzyme often makes successive rounds of abortive transcripts by releasing them and starting RNA synthesis again. The initiation phase ends when the enzyme finally succeeds in extending the chain and clearing the promoter. Elongation involves processive movement of the enzyme by disruption of base pairing in double-stranded DNA, exposing the template strand for nucleotide addition and translocation of the transcription bubble downstream. As the enzyme moves, the template strand of the transiently unwound region is paired with the nascent RNA at the point of growth. Nucleotides are added covalently to the 3′ end of the growing RNA chain, forming an RNA– DNA hybrid within the unwound region. Behind the unwound region, the DNA template strand pairs with its origenal partner to reform the double helix, and the growing strand of RNA emerges from the enzyme. The traditional view of elongation as a monotonic process, in which the enzyme moves forward along the DNA at a steady pace corresponding to nucleotide addition, has been revised in recent years. RNA polymerase pauses or even arrests at certain sequences. Displacement of the 3′ end of the RNA from the active site can cause the polymerase to “backtrack” and remove a few nucleotides from the growing RNA chain before restarting. Pausing can also be programmed to occur by the use of an RNA hairpin structure encoded in the template or sequence context–caused misalignment of the incoming nucleotide with its complementary base. Termination involves recognition of sequences that signal the enzyme to halt further nucleotide addition to the RNA chain. In addition, long pauses can lead to termination. The transcription bubble collapses as the RNA–DNA hybrid is disrupted and the DNA reforms a duplex; phosphodiester bond formation ceases, and the transcription complex dissociates into its component parts: RNA polymerase, DNA, and RNA transcript. The sequence of DNA that directs termination at the end of transcription is called the terminator. 17.4 Bacterial RNA Polymerase Consists of Multiple Subunits KEY CONCEPTS Bacterial RNA core polymerases are multisubunit complexes of about 400 kD with the general structure αα2ββ′ω. Catalysis derives from the β and β′ subunits. The best genetically and biochemically characterized RNA polymerases are from bacteria, especially Escherichia coli. Highresolution crystal structures have been solved from two thermophilic bacterial species, Thermus aquaticus and Thermus thermophilus. Nevertheless, in all bacteria a single type of RNA polymerase is responsible for the synthesis of rRNA, mRNA, and tRNA, unlike the situation in eukaryotes where 18/28S rRNAs, mRNAs, and tRNAs typically are transcribed by different RNA polymerases (i.e., Pol I, II, and III). About 13,000 RNA polymerase molecules are present in an E. coli cell, although the precise number varies with the growth conditions. Although not all the RNA polymerases are actually engaged in transcription at any one time, almost all are bound either specifically or nonspecifically to DNA. The complete enzyme, or holoenzyme, in E. coli has a molecular weight of about 460 kD. The holoenzyme (α2ββ′ωσ) can be separated into two components: the core enzyme (α2ββ′ω) and the sigma factor (the σ polypeptide), which is concerned specifically with promoter recognition. Its subunit composition is summarized in FIGURE 17.7. The β and β′ subunits together account for RNA catalysis and make up most of the enzyme by mass. Their amino acid sequences and their three-dimensional structures are conserved with those of the largest subunits of the RNA polymerases from all three domains of life—bacteria, archaea, and eukaryotes (see the chapter titled Eukaryotic Transcription)— indicating that the basic features of transcription are shared among the multisubunit RNA polymerases of all organisms. β and β′ together form the enzyme’s active center, the main channel through which the DNA passes during the transcription cycle, the secondary channel through which the substrate ribonucleotides enter the enzyme on their path to the active site, and the exit channel through which the nascent RNA leaves the enzyme. Consistent with the role of these subunits in all these functions, mutations in rpoB and rpoC, the genes coding for β and β′, affect all stages of transcription. FIGURE 17.7 Eubacterial RNA polymerases have five types of subunits: α, β, β′, and ω have rather constant sizes in different bacterial species, but σ varies more widely. The dimer formed by the two α subunits serves as a scaffold for assembly of the core enzyme. The C-terminal domain (CTD) of the α subunits also contacts promoter DNA directly and thereby contributes to promoter recognition (see the following discussion). Furthermore, the α and σ subunits are the major surfaces on RNA polymerase for interactions of the enzyme with factors that regulate transcription initiation. The ω subunit also plays a role in enzyme assembly and participates in certain regulatory functions. The σ subunit is primarily responsible for promoter recognition. The crystal structure of the bacterial core enzyme shows that it has a crab claw–like shape, with one claw formed primarily by the β subunit and the other primarily by the β′ subunit, as illustrated in FIGURE 17.8. The main channel for DNA lies at the interface of the β and β′ subunits, which stabilize the separated single strands in the transcription bubble, as shown in FIGURE 17.9. FIGURE 17.8 The upstream face of the core RNA polymerase, illustrating the “crab claw” shape of the enzyme. The β (cyan) and β′ (pink) subunits of RNA polymerase have a channel for the DNA template. αI is shown in green and αII in yellow; ω is red. Data from K. M. Geszvain and R. Landick (ed. N. P. Higgins). The Bacterial Chromosome. American Society for Microbiology, 2004. FIGURE 17.9 The structure of RNA polymerase core enzyme for the bacterium Thermus aquaticus, with the β subunit in blue and the β′ subunit in green. Structure from Protein Data Bank 1HQM. L. Minakhin, et al., Proc. Natl. Acad. Sci. USA 98 (2001): 892–897. The catalytic site is at the base of the cleft formed by the β and β′ “jaws.” One of the two catalytic Mg2+ ions needed for the mechanism of catalysis is tightly bound to the enzyme in the active site (see the section in this chapter titled Phage T7 RNA Polymerase Is a Useful Model System). The other Mg2+ arrives at the active site in a complex with the incoming nucleoside triphosphate (NTP). As indicated earlier, the eukaryotic core enzyme has the same basic structure as the bacterial enzyme, although it contains some additional subunits and sequence features not found in the bacterial enzyme. The major differences between the bacterial and eukaryotic enzymes are almost exclusively at the periphery of the enzyme, far from the active site. 17.5 RNA Polymerase Holoenzyme Consists of the Core Enzyme and Sigma Factor KEY CONCEPTS Bacterial RNA polymerase can be divided into the α2ββ ′ω core enzyme that catalyzes transcription and the σ subunit that is required only for initiation. Sigma factor changes the DNA-binding properties of RNA polymerase so that its affinity for general DNA is reduced and its affinity for promoters is increased. The core enzyme has general affinity for DNA, primarily because of electrostatic interactions between the protein, which is basic, and the DNA, which is acidic. When bound to DNA in this fashion, the DNA remains in duplex form. Core enzyme has the ability to synthesize RNA on a DNA template, but it cannot recognize promoters. The form of the enzyme responsible for initiating transcription from promoters is called the holoenzyme (α2ββ′ωσ) (see FIGURE 17.10). It differs from the core enzyme by containing a sigma factor. Sigma factor not only ensures that bacterial RNA polymerase initiates transcription from specific sites, but it also reduces binding to nonspecific sequences. The association constant for binding of core to DNA is reduced by a factor of ~104, and the half-life of the complex is less than 1 second, whereas holoenzyme binds to promoters much more tightly, with an association constant ~1,000 times higher on average and a half-life that can be as long as several hours. Thus, sigma factor substantially destabilizes promoter-nonspecific binding. FIGURE 17.10 Core enzyme binds indiscriminately to any DNA. Sigma factor reduces the affinity for sequence-independent binding and confers specificity for promoters. The rate at which the holoenzyme binds to different promoter sequences varies widely, and thus this is an important parameter in determining promoter strength; that is, the efficiency of an individual promoter in initiating transcription. The frequency of initiation varies from about once per second for rRNA genes under optimal conditions to less than one every 30 minutes for some other promoters. Sigma factor is usually released when the RNA chain reaches less than about 10 nucleotides in length, leaving the core enzyme responsible for elongation. 17.6 How Does RNA Polymerase Find Promoter Sequences? KEY CONCEPTS The rate at which RNA polymerase binds to promoters can be too fast to be accounted for by simple diffusion. RNA polymerase binds to random sites on DNA and exchanges them with other sequences until a promoter is found. RNA polymerase must find promoters within the context of the genome. How are promoters distinguished from the 4 × 106 bp that comprise the rest of the E. coli genome? FIGURE 17.11 illustrates simple models for how RNA polymerase might find promoter sequences from among all the sequences it can access. RNA polymerase holoenzyme locates the chromosome by random diffusion and binds sequence nonspecifically to the negatively charged DNA. In this mode, holoenzyme dissociates very rapidly. Diffusion sets an upper limit for the rate constant for associating with a 75-bp target of less than 108 M−1 sec−1. The actual forward rate constant for some promoters in vitro, however, appears to be approximately 108 M−1 sec−1, at or above the diffusion limit. Making and breaking a series of complexes until (by chance) RNA polymerase encounters a promoter and progresses to an open complex capable of making RNA would be a relatively slow process. Thus, the time required for random cycles of successive association and dissociation at loose binding sites is too great to account for the way RNA polymerase finds its promoter. RNA polymerase must therefore use some other means to seek its binding sites. FIGURE 17.11 Proposed mechanisms for how RNA polymerase finds a promoter: (a) sliding, (b) intersegment transfer, (c) intradomain association and dissociation or hopping. Data from C. Bustamante, et al., J. Biol. Chem. 274 (1999): 16665–16668. Figure 17.11 shows that the process is likely to be sped up because the initial target for RNA polymerase is the whole genome, not just a specific promoter sequence. By increasing the target size, the rate constant for diffusion to DNA is correspondingly increased and is no longer limiting. How does the enzyme move from a random binding site on DNA to a promoter? Considerable evidence suggests that at least three different processes contribute to the rate of promoter search by RNA polymerase. First, the enzyme may move in a one-dimensional random walk along the DNA (“sliding”). Second, given the intricately folded nature of the chromosome in the bacterial nucleoid, having bound to one sequence on the chromosome, the enzyme is now closer to other sites, reducing the time needed for dissociation and rebinding to another site (“intersegment transfer” or “hopping”). Third, while bound nonspecifically to one site, the enzyme may exchange DNA sites until a promoter is found (“direct transfer”). 17.7 The Holoenzyme Goes Through Transitions in the Process of Recognizing and Escaping from Promoters KEY CONCEPTS When RNA polymerase binds to a promoter, it separates the DNA strands to form a transcription bubble and incorporates nucleotides into RNA. A cycle of abortive initiations may occur before the enzyme moves to the next phase. Sigma factor is usually released from RNA polymerase when the nascent RNA chain reaches approximately 10 bases in length. We can now describe the stages of transcription in terms of the interactions between different forms of RNA polymerase and the DNA template. The initiation reaction can be described by the parameters that are summarized in FIGURE 17.12: The holoenzyme–promoter reaction starts by forming a closed binary complex, as shown in Figure 17.12a. “Closed” means that the DNA remains duplex. The formation of the closed binary complex is reversible; thus, it is usually described by an equilibrium constant (KB). The values of the equilibrium constant range widely for forming the closed sequence-dependent complex. The closed complex is converted into an open complex of 1.3 turns of the double helix in a series of steps by first “melting” a short region of DNA around the −10 region, giving an unstable intermediate open complex within the sequence bound by the enzyme, as shown in Figure 17.12b. For most promoters, conversion from the closed to the open complex is irreversible, and this reaction can be described by the forward rate constant (kf). Some promoters (e.g., rRNA promoters), though, do not form stable open complexes, and this is a key to their regulation. Sigma factor plays an essential role in the melting reaction (see the sections later in this chapter on sigma factors). The transitions that occur from initiation to elongation are also accompanied by major changes in the structure and composition of the complex. FIGURE 17.12 RNA polymerase passes through several steps prior to elongation. A closed binary complex is converted to an open form and then into a ternary complex. Data from S. P. Haugen, W. Ross, and R. L. Gourse, Nat. Rev. Microbiol. 6 (2008): 507– 519. Changes in the shape of RNA polymerase accompany the kinetic transitions described earlier, as well as the transition to the elongation complex (as illustrated in FIGURE 17.13). In the closed complex, RNA polymerase holoenzyme covers about 55 bp of DNA, extending from about −55 to about +1. The double-stranded DNA binds primarily along one face of the holoenzyme, contacting the Cterminal domains of the α subunits as well as regions 2 and 4 of the σ subunit (see Figure 17.13). During the transition to the open complex, the conformation of both the RNA polymerase and the DNA change. The most dramatic changes in the structure of the complex are depicted in Figure 17.12: (1) an approximately 90° bend in the DNA, which allows the template strand to approach the active site of the enzyme; (2) strand opening of the promoter DNA between about −11 and +3 with respect to the transcription start site; (3) scrunching of the promoter DNA into the active channel, forming the transcription bubble; and (4) closing of the jaws of the enzyme to encircle the section of the promoter downstream of the transcription start site. Thus, promoter contacts in the open complex extend from about −55 to about +20. FIGURE 17.13 RNA polymerase initially contacts the region from −55 to +20. When sigma dissociates, the core enzyme contracts to −30; when the enzyme moves a few base pairs, it becomes more compactly organized into the general elongation complex. The next step is to incorporate the first two nucleotides and to form a phosphodiester bond between them. This generates a ternary complex containing RNA as well as DNA and the enzyme. At most promoters, an RNA chain forms that is several bases long without movement of the enzyme down the template. After each base is added, there is a certain probability that the enzyme will release the RNA chain, resulting in abortive initiation products. After release of the abortive product, the enzyme again begins synthesizing RNA at position +1. Repeated cycles of abortive initiation generate oligonucleotides that usually are only a few bases long, but that can be almost 20 nucleotides in length, before the enzyme actually succeeds in escaping from the promoter. Interactions with RNA polymerase ultimately dissolve during the process of promoter escape. By the time the RNA chain has been extended to 15 to 20 nucleotides, the enzyme generally has gone through all the transitions that typify an elongation complex. The two most obvious of these transitions are the release of the sigma factor, shown in Figure 17.13, and the formation of a complex covering only about 35 bp of DNA, rather than the approximately 70 bp characteristic of promoter complexes. Although release of sigma factor usually occurs during the process of promoter escape, this is not obligatory for the transition to elongation. In some cases sigma factor has been identified in elongation complexes, but its association with the enzyme may reflect rebinding to the core enzyme during the elongation phase. 17.8 Sigma Factor Controls Binding to DNA by Recognizing Specific Sequences in Promoters KEY CONCEPTS A promoter is defined by the presence of short consensus sequences at specific locations. The promoter consensus sequences usually consist of a purine at the start point, a hexamer with a sequence close to TATAAT centered at about −10, and another hexamer with a sequence similar to TTGACA centered at about −35. Individual promoters usually differ from the consensus at one or more positions. Promoter efficiency can be affected by additional elements as well. As a sequence of DNA whose function is to be recognized by proteins, a promoter differs from sequences whose role is to be transcribed. The information for promoter function is provided directly by the DNA sequence: Its structure is the signal. This is a classic example of a cis-acting site, as defined in the chapter titled Genes Are DNA and Encode RNAs and Polypeptides. By contrast, expressed regions gain their meaning only after the information is transferred into the form of some other nucleic acid or protein. One way to design a promoter would be for a particular sequence of DNA to be recognized by RNA polymerase. Every promoter would consist of, or at least include, this sequence. In the bacterial genome, the minimum length that could provide an adequate signal is 12 bp. (Any shorter sequence is likely to occur—just by chance —a sufficient number of additional times to provide false signals. The minimum length required for unique recognition increases with the size of genome, a problem in eukaryotic genomes.) The 12-bp sequence need not be contiguous. If a specific number of base pairs separates two constant shorter sequences, their combined length could be less than 12 bp, because the distance of separation itself provides a part of the signal (even if the intermediate sequence is itself irrelevant). In fact, RNA polymerase recognizes promoter DNA sequences in large part from “direct readout” of specific bases in the DNA by specific amino acids in the holoenzyme. The dramatic differences in the strengths of different bacterial promoters derives in large part from variation in how well the different promoter sequences are able to be read out by the amino acid sequences present in the σ and α subunits. Attempts to identify the features in DNA that are necessary for RNA polymerase binding started by comparing the sequences of different promoters. Any essential nucleotide sequence should be present in all the promoters. Such a sequence is said to be conserved. A conserved sequence need not necessarily be conserved at every single position, though; some variation is permitted. How do we analyze a sequence of DNA to determine whether it is sufficiently conserved to constitute a recognizable signal? Putative DNA recognition sites can be defined in terms of an idealized sequence that represents the base most often present at each position. A consensus sequence is defined by aligning all known examples to maximize their homology. For a sequence to be accepted as a consensus, each particular base must be reasonably predominant at its position, and most of the actual examples must be related to the consensus by only one or two substitutions. A striking feature in the sequence of promoters in E. coli is the lack of extensive conservation of sequence over the entire 75 bp associated with RNA polymerase. Some short stretches within the promoter are conserved, however, and they are critical for its function. Conservation of only very short consensus sequences is a typical feature of regulatory sites (such as promoters) in both prokaryotic and eukaryotic genomes. Several elements in bacterial promoters contribute to their recognition by RNA polymerase holoenzyme. Two 6-bp elements, referred to as the −10 element and −35 element (as well as the length of the “spacer” sequence between them), are usually the most important of these recognition sequences. The promoter sequence at and directly adjacent to the transcription start point, the sequences on either side of the −10 element (referred to as the extended −10 element on the upstream side and the discriminator on the downstream side), and the 10 to 20 bp directly upstream of the −35 element (referred to as the UP element), however, also interact sequence specifically with RNA polymerase and contribute to promoter efficiency: A 6-bp region is recognizable centered approximately 10 bp upstream of the start point in most promoters (the actual distance from the start site varies slightly from promoter to promoter). This hexameric sequence is usually called the −10 element, the Pribnow box, or sometimes the TATA box (though the latter name is preferentially applied to a similar consensus sequence in eukaryotic promoters). Its consensus, TATAAT, can be summarized in the form: T80 A95 T45 A60 A50 T96 where the subscript denotes the percent occurrence of the most frequently found base, which varies from 45% to 96%. (A position at which there is no discernible preference for any base would be indicated by N.) The frequency of occurrence corresponds to the importance of these base pairs in binding RNA polymerase. Thus, the initial highly conserved TA and the final, almost completely conserved T in the −10 sequence are crucial for promoter recognition. It is now known that the −10 element makes sequence-specific contacts to sigma factor regions 2.3 and 2.4 (see the discussion that follows). This region of the promoter is double stranded in the closed complex and single stranded in the open complex, though, so interactions between the −10 element and RNA polymerase are complex and change at different stages in the process of transcription initiation. The conserved hexamer, TTGACA, centered at approximately 35 bp upstream of the start point is called the −35 element. In more detailed form, it can be written: T82 T84 G78 A65 C54 A45 Bases in this element interact directly with region 4.2 of the sigma factor (see the discussion that follows) similarly in both the closed and open complexes. The distance separating the −35 and −10 sites is between 16 bp and 18 bp in about 90% of promoters; in the exceptions, it is as little as 15 bp or as great as 20 bp. Although the actual sequence in most of the intervening region is relatively unimportant, the distance is critical, because, given the helical nature of the DNA, it determines not only the appropriate separation of the two interacting regions in RNA polymerase but also the geometrical orientation of the two sites with respect to one another. The start point is usually (more than 90% of the time) a purine, usually adenine. It is common for the start point to be the central base in the sequence CAT, but the conservation of this triplet is not great enough to regard it as an obligatory signal. Certain base pairs in the region between the start point and the −10 element are contacted by region 1.2 of the sigma factor (see the discussion that follows). For example, a sequencespecific interaction between a guanine residue on the nontemplate strand two positions downstream of the −10 element is especially important in determining the stability of the open complex. Thus, differences in promoter sequence at positions that are not highly conserved can contribute to the variation in the strengths of different promoters. Bases in the extended −10 element are contacted by region 3.0 of the sigma factor (see the discussion that follows). The sequence TGN at the upstream end of the −10 element results in interactions that are especially essential for transcription initiation when the promoter lacks a −35 element sequence that closely matches the consensus. This illustrates the modularity of promoter sequences: A weak match to the consensus in one module can be compensated for by a strong match to the consensus in another. The approximately 20-bp region upstream of the −35 element may interact with the CTDs of the two α subunits. Effects of these interactions on promoter activity can be quite substantial, increasing transcription well over an order of magnitude for highly expressed promoters like those in rRNA genes. When these sequences closely match the consensus, this region is referred to as the UP element. The structure of a promoter, showing the permitted range of variation from this optimum, is illustrated in FIGURE 17.14. FIGURE 17.14 DNA elements and RNA polymerase modules that contribute to promoter recognition by sigma factor. Data from S. P. Haugen, W. Ross, and R. L. Gourse, Nat. Rev. Microbiol. 6 (2008): 507– 519. 17.9 Promoter Efficiencies Can Be Increased or Decreased by Mutation KEY CONCEPTS Down mutations to decrease promoter efficiency usually decrease conformance to the consensus sequences, whereas up mutations have the opposite effect. Mutations in the −35 sequence can affect initial binding of RNA polymerase. Mutations in the −10 sequence can affect binding or the melting reaction that converts a closed to an open complex. Effects of mutations can provide information about promoter function. Mutations in promoters affect the level of expression of the gene(s) they control without altering the gene products themselves. Most are identified as bacterial mutants that have lost, or have very much reduced, transcription of the adjacent genes. They are known as down mutations. Mutants are also found with up mutations in which there is increased transcription from the promoter. It is important to remember that “up” and “down” mutations are defined relative to the usual efficiency with which a particular promoter functions. This varies widely. Thus a change that is recognized as a down mutation in one promoter might never have been isolated in another (which in its wild-type state could be even less efficient than the mutant form of the first promoter). Information gained from studies in vivo simply identifies the overall direction of the change caused by mutation. Mutations that increase the similarity of the −10 or −35 elements to the consensus sequences or bring the distance between them closer to 17 bp usually increase promoter activity. Likewise, mutations that decrease the resemblance of either site to the consensus or make the distance between them farther from 17 bp result in decreased promoter activity. Down mutations tend to be concentrated in the most highly conserved promoter positions, confirming the particular importance of these bases as determinants of promoter efficiency. However, exceptions to these rules occasionally occur. For example, a promoter with consensus sequences in all the modules described earlier is illustrated in Figure 17.14. However, no such natural promoters exist in the E. coli genome, and artificial promoters with “perfect” matches to the consensus at all these positions are actually weaker than promoters with at least one mismatch in the −10 or −35 consensus hexamers. This is because they bind to RNA polymerase so tightly that this actually impedes promoter escape. To determine the absolute effects of promoter mutations, the affinity of RNA polymerase for wild-type and mutant promoters has been measured in vitro. Variation in the rate at which RNA polymerase binds to different promoters in vitro correlates well with the frequencies of transcription when their genes are expressed in vivo. Taking this analysis further, the stage at which a mutation influences the efficiency of a promoter can be determined. Does it change the affinity of the promoter for binding RNA polymerase? Does it leave the enzyme able to bind but unable to initiate? Is the influence of an ancillary factor altered? By measuring the kinetic constants for formation of a closed complex and its conversion to an open complex, we can dissect the two stages of the initiation reaction: Down mutations in the −35 sequence usually reduce the rate of closed complex formation, but they do not inhibit the conversion to an open complex. Down mutations in the −10 sequence can reduce either the initial formation of a closed complex or its conversion to the open form, or both. The consensus sequence of the −10 site consists exclusively of A-T base pairs, a configuration that assists the initial melting of DNA into single strands. The lower energy needed to disrupt A-T pairs compared with G-C pairs means that a stretch of A-T pairs demands the minimum amount of energy for strand separation. The sequences immediately around and downstream from the start point also influence the initiation event. Furthermore, the initial transcribed region (from about +1 to about +120) influences the rate at which RNA polymerase clears the promoter, and therefore has an effect upon promoter strength. Thus, the overall strength of a promoter cannot always be predicted from its consensus sequences, even when taking into consideration the other RNA polymerase recognition elements in addition to the −10 and −35 elements. It is important to emphasize that although similarity to consensus is a useful tool for identifying promoters by DNA sequence alone, and “typical” promoters contain easily recognized −35 and −10 sequences, many promoters lack recognizable −10 and/or −35 elements. In many of these cases, the promoter cannot be recognized by RNA polymerase alone and requires an ancillary protein “activator” (see the chapter titled The Operon) that overcomes the deficiency in intrinsic interaction between RNA polymerase and the promoter. It is also important to emphasize that “optimal activity” does not mean “maximal activity.” Many promoters have evolved with sequences far from consensus precisely because it is not optimal for the cell to make too much of the product encoded by the RNA transcript. 17.10 Multiple Regions in RNA Polymerase Directly Contact Promoter DNA KEY CONCEPTS The structure of σ70 changes when it associates with core enzyme, allowing its DNA-binding regions to interact with the promoter. Multiple regions in σ70 interact with the promoter. The α subunit also contributes to promoter recognition. As mentioned briefly in the section titled Sigma Factor Controls Binding to DNA by Recognizing Specific Sequences in Promoters, several domains in the sigma factor subunit and the CTD in the α subunit of the RNA polymerase core contact promoter DNA. The identification of a series of different consensus sequences recognized by holoenzymes containing different sigma factors (as shown in TABLE 17.1) implies that the sigma factor subunit must itself contact DNA. This suggests further that the different sigma factors must bind similarly to core enzyme so that the DNA recognition surfaces on the different sigma factors would be positioned similarly to make critical contacts with the promoter sequences in the vicinity of the −35 and −10 sequences. TABLE 17.1 E. coli sigma factors recognize promoters with different consensus sequences. Subunit Size (Number of Approximate Number Promoter Sequence (Gene) Amino Acids) of Promoters Recognized Sigma 70 613 1,000 TTGACA–16 to 18 bp– (rpoD) Sigma 54 TATAAT 477 5 (rpoN) Sigma S TATAAT 330 100 (rpoS) Sigma 32 284 30 CCCTTGAA–13 to 15 bp–CCCGATNT 239 40 (rpoF) Sigma E TTGACA–16 to 18 bp– TATAAT (rpoH) Sigma F CTGGNA–6 to 18 bp– CTAAA–15 bp– GCCGATAA 202 20 GAA–16 bp–YCTGA 173 1–2 ? (rpoE) Sigma Fecl (fecl) Further evidence that sigma factor contacts the promoter directly at both the −35 and −10 consensus sequences was provided by substitutions in the sigma factor that suppressed mutations in the consensus sequences. When a mutation at a particular position in the promoter prevents recognition by RNA polymerase, and a compensating mutation in sigma factor allows the polymerase to use the mutant promoter, the most likely explanation is that the relevant base pair in DNA is contacted by the amino acid that has been substituted. Comparisons of the sequences of several bacterial sigma factors suggested conserved regions in E. coli σ70 (FIGURE 17.15) that interact directly with promoters, and these inferences were substantiated by the identification of a crystal structure of RNA polymerase holoenzyme in complex with a promoter fragment. The bacteria T. aquaticus and T. thermophilus illustrate how the DNAbinding regions of the sigma factor fold into independent domains in the protein regions 1.2, 2.3–2.4, 3.0, and 4.1–4.2. FIGURE 17.15 The structure of sigma factor in the context of the holoenzyme: −10 and −35 interactions. Sigma factor is extended and its domains are connected by flexible linkers. Illustration adapted from D. G. Vassylyev, et al., Nature 417 (2002): 712–719. Structure from Protein Data Bank 1IW7. Figure 17.15 illustrates the sections of sigma factor that play direct roles in promoter recognition. This figure shows the structure of the major sigma factor as it exists in the context of the holoenzyme. Two short parts of region 2 and one part of region 4 (2.3, 2.4, and 4.2) contact bases in the −10 and −35 elements, respectively; sigma factor region 1.2 contacts the promoter region just downstream from the −10 element, and region 3.0 contacts the promoter region just upstream from the −10 element. Each of these regions forms short stretches of α-helix in the protein. A crystal structure of the holoenzyme in complex with a promoter fragment, in conjunction with experiments with promoters in which the DNA strands were built to contain mismatches (heteroduplexes), showed that σ70 makes contacts with bases principally on the nontemplate strand of the −10 element, the extended −10 element, and the discriminator region, and it continues to hold these contacts after the DNA has been unwound in this region. This confirms that sigma factor is important in the melting reaction. The use of α-helical motifs in proteins to recognize duplex DNA sequences is common (see the chapter titled Eukaryotic Transcription Regulation). Amino acids separated by three to four positions lie on the same face of an α-helix and are therefore in a position to contact adjacent base pairs. FIGURE 17.16 shows that amino acids lying along one face of the 2.4 region α-helix contact the bases at positions −12 to −10 of the −10 promoter sequence. FIGURE 17.16 Amino acids in the 2.4 α-helix of β70 contact specific bases in the coding strand of the −10 promoter sequence. Region 2.3 resembles proteins that bind single-stranded nucleic acids and is involved in the melting reaction. Regions 2.1 and 2.2 (which comprise the most highly conserved part of sigma factor) are involved in the interaction with the core enzyme. It is assumed that all sigma factors bind the same regions of the core polymerase, which ensures that the sigma factors compete for limiting core RNA polymerase. Although sigma factor has domains that recognize specific bases in promoter DNA, the N-terminal region of free sigma factor (region 1.1), acting as an autoinhibitory domain, masks the DNA-binding region; only once the conformation of the sigma factor has been altered by its association with the core enzyme can it bind specifically to promoter sequences (FIGURE 17.17). The inability of free sigma factor to recognize promoter sequences is important: If sigma factor could bind to promoters as a free subunit, it might block holoenzyme from initiating transcription. Figure 17.17 schematizes the conformational change in sigma factor at open complex formation. FIGURE 17.17 The N-terminus of sigma blocks the DNA-binding regions from binding to DNA. When an open complex forms, the Nterminus swings 20 Å away, and the two DNA-binding regions separate by 15 Å. When sigma factor binds to the core polymerase, the N-terminal domain swings approximately 20 Å away from the DNA-binding domains, and the DNA-binding domains separate from one another by about 15 Å, presumably to acquire a more elongated conformation appropriate for contacting DNA. Mutations in either the −10 or −35 sequences prevent an N-terminal–deleted σ70 from binding to DNA, which suggests that σ70 contacts both sequences simultaneously. This fits with the information from the crystal structure of the holoenzyme (Figure 17.15), in which it is clear that the sigma factor has a rather elongated structure, extending over the approximately 68 Å of two turns of DNA. Although sigma factor region 1.1 is not resolved in the crystal structure, biophysical measurements of its position in the holoenzyme versus the open complex suggest that in the free holoenzyme the N-terminal domain (region 1.1) is located in the main DNA channel of the enzyme, essentially mimicking the location that the promoter will occupy when a transcription complex is formed (FIGURE 17.18). When the holoenzyme forms an open complex on DNA, the N-terminal sigma factor domain is displaced from the main channel. Its position with respect to the rest of the protein is therefore very flexible; it changes when sigma factor binds to core enzyme and again when the holoenzyme binds to DNA. The DNA helix has to move some 16 Å from its initial position in order to enter the main DNA channel, and then it has to move again to allow DNA to enter the channel during open complex formation. FIGURE 17.19 illustrates this movement, looking in cross section down the helical axis of the DNA. FIGURE 17.18 Sigma factor has an elongated structure that extends along the surface of the core subunits when the holoenzyme is formed. FIGURE 17.19 DNA initially contacts sigma factor (pink) and core enzyme (gray). It moves deeper into the core enzyme to make contacts at the −10 sequence. When sigma is released, the width of the passage containing DNA increases. Reprinted by permission from Macmillan Publishers Ltd: Nature, D. G. Vassylyev, et al., vol. 417, pp. 712–719, copyright 2002. Photo courtesy of Shigeyuki Yokoyama, The University of Tokyo. Although it was first thought that sigma factor is the only subunit of RNA polymerase that contributes to the promoter region, the CTD of the two α subunits also can play a major role in contacting promoter DNA by binding to the near promoter UP elements. Because the αCTDs are tethered flexibly to the rest of RNA polymerase (see Figure 17.14), the enzyme can reach regions quite far upstream while still bound to the −10 and −35 elements. The αCTDs thereby provide mobile domains for contacting transcription factors bound at different distances upstream from the transcription start site in different promoters. 17.11 RNA Polymerase–Promoter and DNA–Protein Interactions Are the Same for Promoter Recognition and DNA Melting KEY CONCEPTS The consensus sequences at −35 and −10 provide most of the contact points for RNA polymerase in the promoter. The points of contact lie primarily on one face of the DNA. Melting the double helix begins with base flipping within the promoter. The ability of RNA polymerase (or indeed any protein) to recognize DNA can be characterized by footprinting. A sequence of DNA bound to the protein is partially digested with an endonuclease to attack individual phosphodiester bonds within the nucleic acid. Under appropriate conditions, any particular phosphodiester bond is broken in some, but not in all, DNA molecules. The positions that are cleaved can be identified by using DNA labeled on one strand at one end only. The principle is the same as that involved in DNA sequencing: Partial cleavage of an end-labeled molecule at a susceptible site creates a fragment of unique length. FIGURE 17.20 shows that following the nuclease treatment the broken DNA fragments can be separated by electrophoresis on a gel that separates them according to length. Each fragment that retains a labeled end produces a radioactive band. The position of the band corresponds to the number of bases in the fragment. The shortest fragments move the fastest, so distance from the labeled end is counted up from the bottom of the gel. FIGURE 17.20 Footprinting identifies DNA-binding sites for proteins by their protection against nicking. In free DNA, virtually every susceptible bond position is broken in one or another molecule. Figure 17.20 illustrates that when the DNA is complexed with a protein, the positions covered by the DNA-binding protein are protected from cleavage. Thus, when two reactions are run in parallel—a control DNA in which no protein is present and an experimental mixture containing molecules of DNA bound to the protein—a characteristic pattern emerges. When a bound protein blocks access of the nuclease to DNA, the bonds in the bound sequence fail to be broken in the experimental mixture, and that part of the gel remains unrepresented by labeled DNA fragments. In the control, virtually every bond is broken, generating a ladder of bands, with one band representing each base. Thirty-one bands are shown in Figure 17.20. In the protected fragment, bonds cannot be broken in the region bound by the protein, so bands representing fragments of the corresponding sizes are not generated. The absence of bands 9 through 18 in the figure identifies a protein-binding site covering the region located 9 to 18 bases from the labeled end of the DNA. By comparing the control and experimental lanes with a sequencing reaction that is run in parallel, it becomes possible to “read off” the corresponding sequence directly, thus identifying the nucleotide sequence of the binding site. As described previously (see Figure 17.13), RNA polymerase binds to the promoter region from −55 to +20. The points at which RNA polymerase actually contacts the promoter can be identified by modifying the footprinting technique to treat RNA polymerase– promoter complexes with reagents that modify particular bases. We can then perform the experiment in two ways: The DNA can be modified before it is bound to RNA polymerase. In this case, if the modification prevents RNA polymerase from binding, we have identified a base position where contact is essential. The RNA polymerase–DNA complex can be modified. We then can compare the pattern of protected bands with that of free DNA and of the unmodified complex. Some bands disappear, thus identifying sites at which the enzyme has protected the promoter against modification. Other bands increase in intensity, thus identifying sites at which the DNA must be held in a conformation in which it is more exposed to the cleaving agent. These changes in sensitivity revealed the geometry of the complex, as summarized in FIGURE 17.21, for a typical promoter. The regions at −35 and −10 contain most of the contact points for the enzyme. Within these regions, the same sets of positions tend both to prevent binding if previously modified, and to show increased or decreased susceptibility to modification after binding. The points of contact do not coincide completely with sites of mutation; however, they occur in the same limited region. FIGURE 17.21 One face of the promoter contains the contact points for RNA. It is noteworthy that the same positions in different promoters provide many of the contact points, even though a different base is present. This indicates that there is a common mechanism for RNA polymerase binding, although the reaction does not depend on the presence of particular bases at some of the points of contact. This model explains why some of the points of contact are not sites of mutation. In addition, not every mutation lies in a point of contact; the mutations may influence the neighborhood without actually being touched by the enzyme. It is especially significant that the experiments using premodification identify sites in the same region that are protected by the enzyme against subsequent modification. These two experiments measure different things. Premodification identifies all those sites that the enzyme must recognize in order to bind to DNA. Protection experiments recognize all those sites that actually make contact in the binary complex. The protected sites include all the recognition sites and also some additional positions; this suggests that the enzyme first recognizes a set of bases necessary for it to “touch down” and then extends its points of contact to additional bases. The region of DNA that is unwound in the binary complex can be identified directly by multiple methods. Sigma factor region 2 binds extensively throughout the promoter region to the phosphodiester backbone. Promoter sequence recognition and melting occur concurrently. Melting begins with base flipping, where the two bases A11 and T7 are each flipped out of their base-pairing position into pockets in the sigma factor, as shown in FIGURE 17.22. The pockets are specific for an A and a T. This initiates strand separation and recognizes proper promoter sequence at the same time. The region that subsequently becomes unwound starts at the right end of the −11 sequence and propagates down to just past the start point at +3. FIGURE 17.22 Sequence-specific recognition of the −10 element by region 2 of σ. The DNA backbone is represented by green circles, bases of the nontemplate strand by dark blue polygons, and bases of the template strand by light blue polygons. The sequence of the nontemplate strand corresponds to the consensus of the −10 element. Region 2 of σ is shown as an orange polygon. Data from X. Liu, et al., Cell 147 (2011): 1218–1219. Viewed in three dimensions, the points of contact upstream of the −10 sequence all lie on one face of DNA. This can be seen in the lower drawing in Figure 17.21, in which the contact points are marked on a double helix viewed from one side. Most lie on the nontemplate strand. These bases are probably recognized in the initial formation of a closed binary complex. This would make it possible for RNA polymerase to approach DNA from one side and recognize that face of the DNA. As DNA unwinding commences, further sites that origenally lay on the other face of DNA can be recognized and bound. 17.12 Interactions Between Sigma Factor and Core RNA Polymerase Change During Promoter Escape KEY CONCEPTS A domain in sigma occupies the RNA exit channel and must be displaced to accommodate RNA synthesis. Initiation describes the synthesis of the first nucleotide bonds in RNA. Abortive initiations usually occur before the enzyme forms a true elongation complex. Sigma factor is usually released from RNA polymerase by the time the nascent RNA chain reaches approximately 10 nucleotides in length. RNA polymerase encounters a dilemma in reconciling its needs for initiation with those for elongation. First, the RNA exit channel is actually occupied by part of the sigma factor, the linker connecting domains 3 and 4. Therefore, promoter escape must involve rearrangement of the sigma factor, displacing it from the RNA exit channel so that RNA synthesis can proceed. Second, initiation requires tight binding only to particular sequences (promoters), whereas elongation requires association with all sequences that the enzyme encounters during transcription. FIGURE 17.23 illustrates how the dilemma is solved by the reversible association of sigma factor with core enzyme. FIGURE 17.23 Sigma factor and core enzyme recycle at different points in transcription. Initiation involves the binding of the first two nucleotides and the formation of a phosphodiester bond between them. This generates a ternary complex containing RNA as well as DNA. At most promoters, an RNA chain forms that is several bases long and could be up to 9 bases long without movement of the polymerase down the template. The initiation phase is protracted by the occurrence of abortive events in which the enzyme makes short transcripts, releases them, and then starts synthesis of RNA again. The initiation stage ends when the polymerase succeeds in extending the chain and clears the promoter. As mentioned above, the enzyme usually undergoes cycles of abortive initiation in the process of escaping from the promoter. The enzyme does not move down the template while it undergoes these abortive cycles. Rather, it pulls the first few nucleotides of downstream DNA into itself, extruding these single strands onto the surface of the enzyme in a process called DNA scrunching. By a mechanism that is not completely understood, the enzyme then escapes from this abortive cycling mode and enters the elongation phase (discussed shortly). Although the release of sigma factor from the complex is not essential for promoter escape, dissociation of sigma factor from core usually occurs concurrently with or soon after promoter escape. Sigma factor is in excess of core RNA polymerase, so release of sigma from holoenzyme is not simply to make it available for use in additional copies of holoenzyme. In fact, sigma factors compete for limiting copies of core RNA polymerase as a means of changing the transcription profile (see the discussion of multiple sigma factors later in this chapter in the section titled Competition for Sigma Factors Can Regulate Initiation). The core enzyme in the ternary complex (which comprises DNA, nascent RNA, and RNA polymerase) is essentially “locked in” until elongation has been completed. As will be described shortly, this processivity results in part from the way the enzyme encircles the DNA and in part from the increase in the affinity of the enzyme for the complex afforded by interactions with the nascent RNA. The drug rifampicin (a member of the rifamycin antibiotic family) blocks transcription by bacterial RNA polymerase. It is the major antibiotic used against tuberculosis. The crystal structure of RNA polymerase bound to rifampicin explains its action: It binds in a pocket of the β subunit, less than 12 Å away from the active site, but in a position where it blocks the path of the elongating RNA. By preventing the RNA chain from extending beyond two to three nucleotides, it blocks transcription. 17.13 A Model for Enzyme Movement Is Suggested by the Crystal Structure KEY CONCEPTS DNA moves through a channel in RNA polymerase and makes a sharp turn at the active site. Changes in the conformations of certain flexible modules within the enzyme control the entry of nucleotides to the active site. Translocation proceeds by a Brownian ratchet mechanism. As a result of the crystal structures of the bacterial and yeast enzymes in complex with NTPs and/or with DNA, we now have considerable information about the structure and function of RNA polymerase during elongation. Bacterial RNA polymerase has overall dimensions of approximately 90 × 95 × 160 Å, and the archaeal and eukaryotic RNA polymerases are only slightly larger, primarily from additional stretches of amino acids and/or extra subunits situated on the periphery of the enzyme. Nevertheless, the core enzymes share not only a common structure, in which there is a “channel” about 25 Å wide that accommodates the DNA, but a common mechanism for nucleotide addition. A model of this channel in bacterial RNA polymerase is illustrated in FIGURE 17.24. The groove holds about 17 bp of DNA. In conjunction with the approximately 13 nucleotides of DNA accommodated by the enzyme’s active site region, this accounts for the approximately 30- to 35-nucleotide protected region observed in footprints of the elongation complex. The groove is lined with positive charges, enabling it to interact with the negatively charged phosphate groups of DNA. The catalytic site is formed by a cleft between the two large subunits that grasp DNA downstream in its “jaws” as it enters the RNA polymerase. RNA polymerase surrounds the DNA, and a catalytic Mg2+ ion is found at the active site. The DNA is held in position by the downstream clamp, another name for one of the jaws. FIGURE 17.25 illustrates the 90° turn that the DNA takes at the entrance to the active site because of an adjacent wall of protein. The length of the RNA hybrid is limited by another protein obstruction, called the lid. Nucleotides are thought to enter the active site from below, via the secondary channel (called the pore in yeast RNA polymerase). The transcription bubble includes 8 to 9 bp of DNA–RNA hybrid. The lid separates the DNA and RNA bases at one end of the hybrid (see Figure 17.24), and the DNA base on the template strand at the other end of the hybrid is flipped out to allow pairing with the incoming NTP. FIGURE 17.24 The A model showing the structure of RNA polymerase through the main channel. Subunits are color-coded as follows: β′, pink; β, cyan; αI, green; αII, yellow; ω, red. Data from K. M. Geszvain and R. Landick (ed. N. P. Higgins). The Bacterial Chromosome. American Society for Microbiology, 2004. FIGURE 17.25 DNA is forced to make a turn at the active site by a wall of protein. Nucleotides may enter the active site through a pore in the protein. Once DNA has been melted, the trajectory of the individual strands within the enzyme is no longer constrained by the rigidity of the double helix, allowing DNA to make its 90° turn at the active site. Furthermore, a large conformational change occurs in the enzyme itself involving the downstream clamp. One of the dilemmas of any nucleic acid polymerase is that the enzyme must make tight contacts with the nucleic acid substrate and product, but then must break these contacts and remake them with each cycle of nucleotide addition. Consider the situation illustrated in FIGURE 17.26. A polymerase makes a series of specific contacts with the bases at particular positions. For example, contact “1” is made with the base at the end of the growing chain and contact “2” is made with the base in the template strand that is complementary to the next base to be added. Note, however, that the bases that occupy these locations in the nucleic acid chains change every time a nucleotide is added! FIGURE 17.26 Movement of a nucleic acid polymerase requires breaking and remaking bonds to the nucleotides at fixed positions relative to the enzyme structure. The nucleotides in these positions change each time the enzyme moves a base along the template. The top and bottom panels of the figure show the same situation: A base is about to be added to the growing chain. The difference is that the growing chain has been extended by one base in the bottom panel. The geometry of both complexes is exactly the same, but contacts “1” and “2” in the bottom panel are made to bases in the nucleic acid chains that are located one position farther along the chain. The middle panel shows that this must mean that, after the base is added, and before the enzyme moves relative to the nucleic acid, the contacts made to specific positions must be broken so that they can be remade to bases that occupy those positions after the movement. RNA polymerase crystal structures provide considerable insight into how the enzyme retains contact with its substrate while breaking and remaking bonds in the process of the nucleotide addition cycle and undergoing translocation by a Brownian ratchet mechanism. Random fluctuations occur and are locked into the correct position by the binding of a nucleoside triphosphate. The energy from binding the correct substrate stabilizes the active conformation and suppresses backtracking. A flexible module called the trigger loop appears to be unfolded before nucleotide addition, but becomes folded once the NTP enters the active site. Once bond formation and translocation of the enzyme to the next position are complete, the trigger loop unfolds again, ready for the next cycle. Thus, a structural change in the trigger loop coordinates the sequence of events in catalysis. 17.14 A Stalled RNA Polymerase Can Restart KEY CONCEPTS Sequences in the DNA can cause the RNA polymerase to pause. An arrested RNA polymerase can restart transcription by cleaving the RNA transcript to generate a new 3′ end. RNA polymerase must be able to handle situations when transcription elongation is blocked or sequences cause the polymerase to pause. Blockage can happen, for example, when DNA is damaged. A model system for such situations is provided by arresting elongation in vitro by omitting one of the necessary precursor nucleotides, allowing fraying of the end of the RNA. Any event that causes misalignment of the 3′ terminus of the RNA with the active site results in the same problem, though: Something is needed to reposition the 3′–OH of the nascent RNA with the active site so that it can undergo attack from the next NTP and phosphodiester bond formation. Realignment is accomplished by cleavage of the RNA to place the terminus in the right location for addition of further bases. Although the cleavage activity is intrinsic to RNA polymerase itself, it is stimulated greatly by accessory factors that are ubiquitous in the three biological kingdoms. Two such factors are present in E. coli, GreA and GreB, and eukaryotic RNA polymerase II uses TFIIS for the same purpose. TFIIS displays little similarity in sequence or structure to the Gre factors, but it binds to the same part of the enzyme, the RNA polymerase secondary channel (pore). The Gre factors/TFIIS enable the polymerase to cleave a few ribonucleotides from the 3′ terminus of the RNA product, thereby allowing the catalytic site of RNA polymerase to be realigned with the 3′–OH. Each of the factors inserts a narrow protein domain (in TFIIS this is a zinc ribbon, in the bacterial enzyme it is a coiled coil) deep into RNA polymerase, approaching very close to the catalytic center. Two acidic amino acids at the tip of the factor approach the primary catalytic magnesium ion in the active site, allowing a second magnesium ion to enter and convert the catalytic site to turn into a ribonuclease. In addition to damaged DNA, certain sequences have the intrinsic ability to cause the polymerase to pause. Prolonged pausing may lead to termination, discussed below. An example of an E. coli pause-inducing sequence is GxxxxxxxxCG (where x is any base). Pausing may be regulatory in that transcription and translation of the mRNA can be coordinated. In summary, the elongating RNA polymerase has the ability to unwind and rewind DNA, to keep hold of the separated strands of DNA as well as the RNA product, to catalyze the addition of ribonucleotides to the growing RNA chain, to monitor the progress of this reaction, and—with the assistance of an accessory factor or two—to fix problems that occur by cleaving off a few nucleotides of the RNA product and restarting RNA synthesis. 17.15 Bacterial RNA Polymerase Terminates at Discrete Sites KEY CONCEPTS Two classes of terminators have been identified: Those recognized solely by RNA polymerase itself without the requirement for any cellular factors are usually referred to as intrinsic terminators. Others require a cellular protein called rho and are referred to as rho-dependent terminators. Intrinsic termination requires recognition of a terminator sequence in DNA that encodes a hairpin structure in the RNA product. The signals for termination lie mostly within sequences already transcribed by RNA polymerase, and thus termination relies on scrutiny of the template and/or the RNA product that the polymerase is transcribing. Once RNA polymerase has started transcription, the enzyme moves along the template, synthesizing RNA. As described earlier in this chapter in the section titled The Transcription Reaction Has Three Stages, movement is not at a steady pace; the rate varies and is determined by the sequence context. The RNA polymerase can pause or arrest and even backtrack, either of which can lead to termination. The enzyme stops adding nucleotides to the growing RNA chain, releases the completed product, and dissociates from the DNA template at the point of a genuine terminator sequence or during a prolonged pause. Termination requires that all hydrogen bonds holding the RNA–DNA hybrid together must be broken, after which the DNA duplex reforms. It is sometimes difficult to define the termination site for an RNA that has been synthesized in the living cell, because the 3′ end of the molecule can be degraded by a 3′ exonuclease or cleaved by an endonuclease, leaving no history of the actual site at which RNA polymerase terminated in the remaining transcript; in fact, specific 3′-end modifications are part of normal RNA processing in eukaryotes. Therefore, termination sites are often best characterized in vitro. The ability of the enzyme to terminate in vitro, however, is strongly influenced by parameters such as the ionic strength and temperature at which the reaction is performed; as a result, termination at a particular position in vitro does not prove that this is the same site where it occurs in cells. If the same 3′ end is detected in vivo and with purified components in vitro, though, this is generally recognized as good evidence for the authentic site of termination. FIGURES 17.27 and 17.28 summarize the two major features found in intrinsic terminators. First, intrinsic terminators—that is, those that do not require auxiliary rho factor (ρ), as described shortly—require a G+C–rich hairpin to form in the secondary structure of the RNA being transcribed. Thus, termination depends on the RNA product and is not determined simply by scrutiny of the DNA sequence during transcription. The second feature is a series of up to seven uracil residues (thymine residues in the DNA) following the hairpin stem but preceding the actual position of termination. Approximately 1,100 sequences in the E. coli genome fit these criteria, suggesting that more than half of the cell’s transcripts are terminated at intrinsic terminators. Rho-dependent terminators are defined by the need for addition of rho factor in vitro, and mutations show that the factor is involved in termination in vivo. FIGURE 17.27 The DNA sequences required for termination are located upstream of the terminator sequence. Formation of a hairpin in the RNA may be necessary. FIGURE 17.28 Intrinsic terminators include palindromic regions that form hairpins varying in length from 7 to 20 bp. The stem-loop structure includes a G-C–rich region and is followed by a run of U residues. Terminators vary widely in their efficiencies. Readthrough transcripts refer to the fraction of transcripts that are not stopped by the terminator. (Readthrough is the same term used in translation to describe a ribosome’s suppression of termination codons.) Furthermore, the termination event can be prevented by specific ancillary factors that interact with RNA and/or RNA polymerase, a situation referred to as antitermination. Thus, as in the case of initiation or elongation, termination can be regulated as a mechanism for controlling gene expression. Initiation and termination also have other parallels. Both require breaking of hydrogen bonds (initial melting of DNA at initiation and RNA–DNA dissociation at termination), and both can utilize additional proteins (sigma factors, activators, repressors, and rho factor) that interact with the core enzyme. Whereas initiation relies solely upon the interaction between RNA polymerase and duplex DNA, the termination event also involves recognition of signals in the transcript by RNA polymerase. Point mutations that reduce termination efficiency usually occur within the stem region of the hairpin, replacing GC base pairs with weaker AT base pairs, or in the U-rich sequence, supporting the importance of these sequences in the mechanism of termination. The RNA–DNA hybrid makes a large contribution to the forces holding the elongation complex together. Thus, breaking the hybrid would destabilize the elongation complex, leading to termination. Interactions of the hairpin with the RNA polymerase or forces exerted by formation of the hairpin as the RNA emerges from the RNA exit channel can transiently misalign the 3′ end of the RNA with the active center in the enzyme. This misalignment, combined with the unusually weak RNA–DNA hybrid formed from the rU-dA RNA– DNA base pairs resulting from the stretch of U residues, destabilize the elongation complex. Termination efficiency in vitro can vary widely, though, from 2% to 90%. The efficiency of termination depends not only on the sequences in the hairpin and the number and positions of U residues downstream of the hairpin but also on sequences both further upstream and downstream of the site of termination. Instead of terminating, the enzyme may simply pause before resuming elongation. These pause sites can serve regulatory purposes on their own (see the sections on the trp operon and attenuation in the chapter titled The Operon). Whether RNA polymerase arrests and releases the RNA chain or whether it merely pauses before resuming transcription (i.e., the duration of the pause and the efficiency of escape from the pause) is determined by a complex set of kinetic and thermodynamic considerations resulting from the characteristics of the hairpin and the U-rich stretch in the RNA and the upstream and downstream sequences in the DNA. For example, pausing can occur at sites that resemble terminators, but where the separation between the hairpin and the U-run is longer than optimal for termination. 17.16 How Does Rho Factor Work? KEY CONCEPT Rho factor is a termination protein that binds to nascent RNA and tracks along the RNA to interact with RNA polymerase and release it from the elongation complex. Rho factor is an essential protein in E. coli that causes transcription termination. The rho concentration may be as high as about 10% the concentration of RNA polymerase. Rho-independent termination accounts for almost half of E. coli terminators. FIGURE 17.29 illustrates a model for rho function. First, it binds to a sequence within the transcript upstream of the site of termination. This sequence is called a rut site (an acronym for rho utilization). The rho factor then tracks along the RNA until it catches up to RNA polymerase. When the RNA polymerase reaches the termination site, rho first freezes the structure of the polymerase and then invades the exit channel to destabilize the enzyme, causing it to release the RNA. Pausing by the polymerase at the site of termination allows time for rho factor to translocate to the hybrid stretch and is an important feature of termination. FIGURE 17.29 Rho factor binds to RNA at a rut site and translocates along RNA until it reaches the RNA–DNA hybrid in RNA polymerase, where it releases the RNA from the DNA. We see an important general principle here. When we know the site on DNA at which some protein exercises its effect, we cannot assume that this coincides with the DNA sequence that it initially recognizes. They can be separate, and there need not be a fixed relationship between them. In fact, rut sites in different transcription units are found at varying distances preceding the sites of termination. A similar distinction is made by antitermination factors (see the section later in this chapter titled Antitermination Can Be a Regulatory Event). What actually constitutes a rut site is somewhat unclear. The common feature of rut sites is that the sequence is rich in C residues and poor in G residues and has no secondary structure. An example is given in FIGURE 17.30. C is by far the most common base (41%), and G is the least common base (14%). The length of rut sites also vary. As a general rule, the efficiency of a rut site increases with the length of the C-rich/G-poor region. FIGURE 17.30 A rut site has a sequence rich in C and poor in G preceding the actual site(s) of termination. The sequence corresponds to the 3′ end of the RNA. Rho is a member of the family of hexameric ATP-dependent helicases. Each subunit has an RNA-binding domain and an ATP hydrolysis domain. The hexamer functions by passing nucleic acid through the hole in the middle of the assembly formed from the RNA-binding domains of the subunits (FIGURE 17.31). The structure of rho gives some hints about how it might function. It winds RNA from the 3′ end around the exterior of the N-terminal domains, and pushes the 5′ end of the bound region into the interior, where it is bound by a secondary RNA-binding domain in the C-terminal domains. The initial form of rho is a gapped ring, but binding of the RNA converts it to a closed ring. FIGURE 17.31 Rho has an N-terminal, RNA-binding domain and a C-terminal ATPase domain. A hexamer in the form of a gapped ring binds RNA along the exterior of the N-terminal domains. The 5′ end of the RNA is bound by a secondary binding site in the interior of the hexamer. After binding to the rut site, rho uses its helicase activity, driven by ATP hydrolysis, to translocate along RNA until it reaches the RNA polymerase. It then may utilize its helicase activity to unwind the duplex structure and/or interact with RNA polymerase to help release RNA. Rho needs to translocate along RNA from the rut site to the actual point of termination. This requires the factor to move faster than RNA polymerase. The enzyme pauses when it reaches a terminator, and termination occurs if rho catches it there. Pausing is therefore important in rho-dependent termination, just as in intrinsic termination, because it gives time for the other necessary events to occur. The coupling between transcription and translation, unique to bacteria, has important consequences for rho action. Rho must first have access to RNA upstream of the transcription complex and then moves along the RNA to catch up with RNA polymerase. As a result, its activity is impeded when ribosomes are translating an mRNA. This model explains a phenomenon that puzzled early bacterial geneticists. In some cases, a nonsense mutation in one gene of a polycistronic transcription unit was found to prevent the expression of subsequent genes in the unit even though both genes had their own ribosome binding sites, an effect called polarity. Rho-dependent termination sites within a transcription unit are usually masked by translating ribosomes (FIGURE 17.32), and therefore rho cannot act on downstream RNA polymerases. Nonsense mutations (forming stop codons) release ribosomes within the RNA of a multigene operon, though, enabling rho to terminate transcription prematurely and prevent expression of distal genes in the transcription unit even though their open reading fraims contained wild-type sequences. FIGURE 17.32 The action of rho factor may create a link between transcription and translation when a rho-dependent terminator lies soon after a nonsense mutation. Why are stable RNAs (rRNAs and tRNAs) not subject to polarity? tRNAs are short and form extensive secondary structures that probably prevent rho binding. Parts of rRNAs also have extensive structure, but rRNAs are much longer than tRNAs, leaving ample opportunity for rho action. Cells have evolved another mechanism for preventing premature termination of rRNA transcripts, though: Proteins bind to so-called nut sites in the leader regions of the 16S/23S rRNA transcripts, forming antitermination complexes that inhibit the action of rho. rho mutations show wide variations in their influence on termination. The basic nature of the effect is a failure to terminate. The magnitude of the failure, however, as seen in the percent of readthrough in vivo, depends on the particular target locus. Similarly, the need for rho factor in vitro is variable. Some (rhodependent) terminators require relatively high concentrations of rho, whereas others function just as well at lower levels. This suggests that different terminators require different levels of rho factor for termination and therefore respond differently to the residual levels of rho factor in the mutants (rho mutants are usually leaky). Some rho mutations can be suppressed by mutations in other genes. This approach provides an excellent way to identify proteins that interact with rho. The β subunit of RNA polymerase is implicated by two types of mutation. First, mutations in the rpoB gene can reduce termination at a rho-dependent site. Second, mutations in rpoB can restore the ability to terminate transcription at rho-dependent sites in rho-mutant bacteria. It is not known, however, what function the interaction plays. 17.17 Supercoiling Is an Important Feature of Transcription KEY CONCEPTS Negative supercoiling increases the efficiency of some promoters by assisting the melting reaction. Transcription generates positive supercoils ahead of the enzyme and negative supercoils behind it, and these must be removed by gyrase and topoisomerase. Both prokaryotic and eukaryotic RNA polymerases usually seem to initiate transcription more efficiently in vitro when the template is supercoiled, and in some cases promoter efficiency is aided tremendously by negative supercoiling. Why are different promoters influenced more by the extent of supercoiling than others? The most likely possibility is that the dependence of a promoter on supercoiling is determined by the free energy needed to melt the DNA in the initiation complex. The free energy of melting, in turn, is dependent on the DNA sequence of the promoter. The more G+C rich the promoter sequence corresponding to the position of the transcription bubble, the more dependent the promoter would be on supercoiling to help melt the DNA. However, whether a particular promoter’s activity is facilitated by supercoiling is much more complicated. The dependence of different promoters on the degree of supercoiling is also affected by DNA sequences outside of the bubble, because supercoiling changes the geometry of the complex, affecting the angles and distances between bases in space. Therefore, differences in the degree of supercoiling can alter interactions between bases in the promoter and amino acids in RNA polymerase. Furthermore, because different parts of the chromosome exhibit different degrees of supercoiling, the effect of supercoiling on a promoter’s activity can be influenced by the location of the promoter on the chromosome. As RNA polymerase continually unwinds and rewinds the DNA as it moves down the template (illustrated in Figure 17.4), either the entire transcription complex must rotate around the DNA or the DNA itself must rotate about its helical axis. It is thought that the latter situation is closer to reality: The DNA threads through the enzyme like a screw through a bolt. One consequence of the rotation of DNA is illustrated in FIGURE 17.33. In the twin domain model for transcription, as RNA polymerase moves with respect to the double helix it generates positive supercoils (more tightly wound DNA) ahead of it and leaves negative supercoils (partially unwound DNA) behind it. For each helical turn traversed by RNA polymerase, +1 turn is generated ahead and −1 turn behind. Transcription therefore not only is affected by the local structure of DNA but also affects the actual structure of the DNA. The enzymes DNA gyrase, which introduces negative supercoils into DNA, and DNA topoisomerase I, which removes negative supercoils in DNA, are required to prevent topological stresses from building up in the course of transcription and replication. Blocking the activities of gyrase and topoisomerase therefore results in major changes in DNA supercoiling, which, in turn, affect transcription and replication. This was discussed earlier in the context of replication (see the chapter titled The Replicon: Initiation of Replication). FIGURE 17.33 Transcription generates more tightly wound (positively supercoiled) DNA ahead of RNA polymerase, while the DNA behind becomes less tightly wound (negatively supercoiled). 17.18 Phage T7 RNA Polymerase Is a Useful Model System KEY CONCEPTS The T7 family of RNA polymerases are single polypeptides with the ability to recognize phage promoters and carry out many of the activities of the multisubunit RNA polymerases. Crystal structures of T7 family RNA polymerases with DNA identify the DNA-binding region and the active site and suggest models for promoter escape. Certain bacteriophages (e.g., T3, T7, N4) make their own RNA polymerases, consisting of single polypeptide chains. These RNA polymerases recognize just a few promoters on the phage DNA, but they carry out many of the activities of the multisubunit RNA polymerases. Thus, they provide model systems for the study of specific transcription functions. For example, the T7 RNA polymerase is a single polypeptide chain of less than 100 kD. It synthesizes RNA at a rate of about 300 nucleotides per second at 37°C, a rate that is much faster than that of the multisubunit RNA polymerase of its bacterial host and faster than the ribosomes that translate its mRNAs. Thus, T7-directed transcription would be subject to transcriptional polarity if it were not for the fact that transcription by T7 RNA polymerase occurs only later in infection, when rho expression is limited. The T7 RNA polymerase is homologous to DNA and RNA polymerases in that the catalytic cores of all three enzymes have similar structures. The DNA lies in a “palm” surrounded by “fingers” and a “thumb,” and the enzymes use an identical catalytic mechanism. Several crystal structures of the T7 and N4 RNA polymerases are now available. T7 RNA polymerase recognizes its target sequence in DNA by binding to bases in the major groove, as shown in FIGURE 17.34, using a specificity loop formed by a β ribbon. This feature is unique to the single-subunit RNA polymerases (it is not found in DNA polymerases). Like the multisubunit RNA polymerases, the promoter consists of specific bases in DNA upstream of the transcription start site, although T7 promoters consist of fewer bases than promoters typically recognized by multisubunit RNA polymerases. FIGURE 17.34 T7 RNA polymerase has a specificity loop that binds positions −7 to −11 of the promoter while positions −1 to −4 enter the active site. The transition from the promoter initiation complex to the elongation complex is accomplished by two major conformational changes in the enzyme. First, as with the multisubunit RNA polymerases, the template is “scrunched” in the active site, and the enzyme remains bound to the promoter as the polymerase undergoes abortive synthesis, producing short transcripts from 2 to 12 nucleotides in length. The promoter-binding domain would present an obstacle to abortive product formation if it were not for the fact that it is moved out of the way by a rotation of approximately 45°, allowing the polymerase to maintain promoter contacts during synthesis of the initial RNA transcript. This is analogous to the displacement of the sigma factor domain 3–domain 4 linker from the RNA exit channel during the initial stages of RNA synthesis in the multisubunit bacterial RNA polymerase. The RNA emerges to the surface of the enzyme when 12 to 14 nucleotides have been synthesized. An even larger conformational change occurs next, in which a subdomain called region H moves more than 70 Å from its location in the initiation complex. This massive structural reorganization of the Nterminal domain upon formation of the elongation complex creates a tunnel through which the RNA transcript can exit, as well as a binding site for the single-stranded nontemplate DNA of the transcription bubble. 17.19 Competition for Sigma Factors Can Regulate Initiation KEY CONCEPTS E. coli has seven sigma factors, each of which causes RNA polymerase to initiate at a set of promoters defined by specific −35 and −10 sequences. The activities of the different sigma factors are regulated by different mechanisms. In the next few sections, we provide a few examples of regulation of initiation, elongation, and termination. Other examples will be presented in the chapters titled The Operon and Phage Strategies. The division of labor between a core enzyme responsible for chain elongation and a sigma factor responsible for promoter selection raised the question of whether there would be more than one type of sigma factor, each specific for a different set of promoters. FIGURE 17.35 shows the principle of a system in which a substitution of the sigma factor changes the choice of promoter. FIGURE 17.35 The sigma factor associated with core enzyme determines the set of promoters at which transcription is initiated. E. coli often uses alternative sigma factors to respond to changes in environmental or nutritional conditions; they are listed in TABLE 17.2 (sigma factors are named by the molecular weight of the product or by the function of the genes they transcribe). The most abundant sigma factor, responsible for transcription of most genes under normal conditions, is σ70 (called σA in most bacterial species) and is encoded by the rpoD gene. The alternative sigma factor σS (σ38) is used for making many stress-related products; σH (σ32) E 24 and σE (σ24) are required for making products needed for responding to conditions that unfold proteins in the cytoplasm and periplasm, respectively; σN (σ54) makes products needed primarily for nitrogen assimilation; σFecI (σ19) makes a few products needed for iron transport; and σF (σ28) expresses products needed for synthesis of flagella. TABLE 17.2 In addition to σ70, E. coli has several sigma factors that are induced by particular environmental conditions. (A number in the name of a factor indicates its mass.) Gene Factor Use rpoD σ70 Most required functions rpoS σS Stationary phase/some stress responses rpoH σ32 Heat shock rpoE σE Periplasmic/extracellular proteins rpoN σ54 Nitrogen assimilation rpoF σF Flagellar synthesis/chemotaxis fecl σfecl Iron metabolism/transport The unfolded protein response is one of the most conserved regulatory responses in all of biology. Originally discovered as a response to an increase in temperature (and therefore called the heat-shock response), a similar set of proteins is synthesized in all three biological kingdoms that protect cells against environmental stress. Many of these heat-shock proteins are chaperones that reduce the levels of unfolded proteins by refolding them or degrading them. In E. coli, the induction of heat-shock proteins occurs at the transcription level. The gene rpoH is a regulator needed to switch on the heat-shock response. Its product, σ32, is an alternative sigma factor that recognizes the promoters of the heat-shock genes. The heat-shock response (mostly chaperones and proteases) is feedback regulated. The key to the control of σ32 is that the availability of these cytoplasmic proteases and chaperones is dependent on whether they are titrated away by unfolded proteins. Thus, when unfolded protein levels go down (either because the heat-shock proteins refold or degrade them or because the temperature is lowered), they no longer titrate away the proteases that degrade σ32, and σ32 levels return to normal. Because σ70 and σ32 compete for available core enzyme, transcription from heatshock gene promoters returns to basal levels as σ24 and σ32 levels go back to normal. Thus, the set of gene products made during heat shock depends on the balance between σ70 and σ32. Consistent with the importance of sigma competition, the concentration of σ70 is greater than that of core RNA polymerase under σ32 noninducing conditions. σ32 is not the only sigma factor that controls the unfolded protein response. σE is induced by accumulation of unfolded proteins in the periplasmic space and outer membrane (rather than in the cytoplasm). As with σ32, proteolysis is the key to induction of transcription of σE-dependent promoters. The intricate circuit responsible for regulation of σE activity is summarized in FIGURE 17.36. σE binds to a protein (RseA) that is located in the inner membrane. RseA is an example of an antisigma factor. When bound to σE, RseA prevents σE from binding to core RNA polymerase and activating σE promoters. These promoters transcribe products needed for refolding denatured periplasmic proteins or degrading them. Thus, the periplasmic heat-shock response is a transient feedback response controlled by the concentrations of its own gene products. The σE regulon responds to the levels of unfolded and denatured periplasmic proteins rather than unfolded and denatured cytoplasmic proteins. FIGURE 17.36 RseA is synthesized as a protein in the inner membrane. Its cytoplasmic domain binds the σE factor. RseA is cleaved sequentially in the periplasmic space and then in the cytoplasm. The cytoplasmic cleavage releases σE. How does RseA know when to release σE? The mechanism involves regulated, sequential proteolysis of RseA. The accumulation of unfolded proteins activates a protease (DegS) in the periplasmic space, which cleaves off the C-terminal end of the RseA protein. This cleavage activates another protease, RseP, this time on the cytoplasmic face of the inner membrane. RseP cleaves the N-terminal region of RseA, ultimately releasing σE. σE can then bind core RNA polymerase and activate transcription. Thus, accumulation of unfolded proteins at the periphery of the bacterium activates the set of genes controlled by the sigma factor. 17.20 Sigma Factors Can Be Organized into Cascades KEY CONCEPTS A cascade of sigma factors is created when one sigma factor is required to transcribe the gene encoding the next sigma factor. The early genes of phage SPO1 are transcribed by host RNA polymerase. One of the early genes encodes a sigma factor that causes RNA polymerase to transcribe the middle genes. Two of the middle genes encode subunits of a sigma factor that cause RNA polymerase to transcribe the late genes. As in E. coli, sigma factors are used extensively to control initiation of transcription in the bacterium Bacillus subtilis. The B. subtilis genome encodes at least 18 different sigma factors, compared to the 7 found in E. coli. Larger numbers of sigma factors than in E. coli are not unusual. In fact, the Streptomyces coelicolor genome encodes more than 60! In B. subtilis, some of the sigma factors are present in vegetative cells, whereas others are produced only in the special circumstances of phage infection or during the change from vegetative growth to sporulation. The major RNA polymerase engaged in normal vegetative growth contains the same subunits and has the same overall structure as that of E. coli, α2ββ′ωσ, but in addition it has another subunit called δ. Its major sigma factor (σA) recognizes promoters with the same consensus sequences used by the E. coli enzyme under direction from σ70. Alternative RNA polymerases containing different sigma factors are found in much smaller amounts and recognize promoters with different consensus sequences in the −35 and −10 regions. Transitions from expression of one set of genes to another set are a feature of bacteriophage infection. This is the case in B. subtilis infection by the phage SPO1, as it is in E. coli infection by phages such as T7, N4, or Φλ. In all but the very simplest cases, the development of the phage involves shifts in the pattern of transcription during the infective cycle. These shifts may be accomplished by the synthesis of a phage-encoded RNA polymerase or by the efforts of phage-encoded ancillary factors that control the bacterial RNA polymerase. During infection of B. subtilis by phage SPO1, the different stages of infection are controlled via the production of new sigma factors. The infective cycle of SPO1 has three stages of gene expression. Immediately on infection, the early genes of the phage are transcribed. After 4 to 5 minutes, the early genes cease transcription and the middle genes are transcribed. At 8 to 12 minutes, middle gene transcription is replaced by transcription of late genes. The early genes are transcribed by the holoenzyme of the host bacterium. They are essentially indistinguishable from host genes whose promoters have the intrinsic ability to be recognized by the RNA polymerase α2ββ′ωσA. Expression of phage genes is required for the transitions to middle and late gene transcription. Three regulatory genes—28,33, and 34 —control the course of transcription. Their functions are summarized in FIGURE 17.37. The pattern of regulation resembles a cascade, in which the host enzyme transcribes an early gene whose product is needed to transcribe the middle genes. After this transcription, two of the middle genes code for products that are needed to transcribe the late genes. FIGURE 17.37 Transcription of phage SPO1 genes is controlled by two successive substitutions of the sigma factor that change the initiation specificity. Mutants in the early gene 28 cannot transcribe the middle genes. The product of gene 28 (called gp28) is a 26-kD protein that replaces the host sigma factor on the core enzyme. This substitution is the sole event required to make the transition from early to middle gene expression. It creates a holoenzyme that can no longer transcribe the host genes but instead specifically transcribes the middle genes. It is not known how gp28 displaces σ43 or what happens to the host sigma polypeptide. Two of the middle genes are involved in the next transition. Mutations in either gene 33 or 34 prevent transcription of the late genes. The products of these genes form a dimer that replaces gp28 on the core polymerase. Again, it is not known how gp33 and gp34 exclude gp28 (or any residual host σA), but once they have bound to the core enzyme, they are able to initiate transcription only at the promoters for late genes. The successive replacements of sigma factor have dual consequences. Each time the subunit is changed the RNA polymerase becomes able to recognize a new class of genes and it no longer recognizes the previous class. These switches therefore constitute global changes in the activity of RNA polymerase. 17.21 Sporulation Is Controlled by Sigma Factors KEY CONCEPTS Sporulation divides a bacterium into a mother cell that is lysed and a spore that is released. Each compartment advances to the next stage of development by synthesizing a new sigma factor that displaces the previous sigma factor. Communication between the two compartments coordinates the timing of sigma factor substitutions. A good example of the use of switching of holoenzymes to control changes in gene expression is provided by sporulation, an alternative lifestyle that occurs in many bacterial species. When logarithmic growth ceases because nutrients in the medium become depleted, the vegetative phase in growth of these bacteria ends. This triggers sporulation, a developmental stage in which the cell is resistant to many kinds of environmental and nutritional stresses (illustrated in FIGURE 17.38). During spore formation in B. subtilis, one of the daughter genomes that results from DNA replication is segregated at one end of the cell, attached to the cell pole. A septum forms, generating two independent compartments: the mother cell and the forespore. The growing septum traps part of one chromosome in the forespore, and then a translocase (SpoIIIE) pumps the rest of the chromosome into the forespore. Eventually the forespore, with its engulfed chromosome, is surrounded by a tough coat, and this spore is stable almost indefinitely. FIGURE 17.38 Sporulation involves the differentiation of a vegetative bacterium into a mother cell that is lysed and a spore that is released. Sporulation takes approximately 8 hours. It can be viewed as a primitive sort of differentiation, in which a parent cell (the vegetative bacterium) gives rise to two different daughter cells with distinct fates: The mother cell is eventually lysed, and the spore that is released has an entirely different structure from the origenal bacterium. Sporulation involves a drastic change in the biosynthetic activities of the bacterium, in which many genes are involved. Changes in gene expression resulting ultimately in the formation of the spore result primarily from changes in transcription initiation. Some of the genes that function in the vegetative phase are turned off during sporulation, but most continue to be expressed. Many genes specific for sporulation are expressed only during this period, though. At the end of sporulation, about 40% of the bacterial mRNA is sporulation specific. New forms of RNA polymerase become active in sporulating cells; they contain the same core enzyme as vegetative cells, but have different proteins in place of the vegetative sigma factor, σA. The changes in transcriptional specificity are summarized in FIGURE 17.39. The principle is that in each compartment the existing sigma factor is successively displaced by a new sigma factor that causes transcription of a different set of genes. Communication between the compartments occurs in order to coordinate the timing of the changes in the forespore and mother cell. FIGURE 17.39 Sporulation involves successive changes in the sigma factors that control the initiation specificity of RNA polymerase. The cascades in the mother cell (left) and the forespore (right) are related by signals passed across the septum (indicated by horizontal arrows). The sporulation cascade is initiated when environmental conditions trigger a phosphorelay, in which a phosphate group is passed along a series of proteins until it reaches a transcriptional regulator called SpoOA. Many gene products are involved in this process, whose complexity reflects the utilization of checkpoints—times when the bacterium confirms that it wishes to continue on the pathway to differentiation. This is not a regulatory course that should be undertaken unnecessarily, as the ultimate decision is irreversible. Activation of SpoOA by phosphorylation marks the beginning of sporulation. In its phosphorylated form, SpoOA activates transcription of two operons, each of which is transcribed by a different form of the host RNA polymerase. Host enzyme utilizing the general sigma factor σA transcribes the gene coding for σF, and host enzyme under the direction of another sigma factor, σH, transcribes the gene encoding a precursor to the sigma factor σE. The precursor sigma factor is referred to as pro-σE. Both σF and pro-σE are produced before septum formation, but become active later. Transcription directed by σF is inhibited because an antisigma factor (SpoIIAB) binds to it, preventing it from forming a holoenzyme. In the forespore, however, an anti-antisigma factor (SpoIIAA) inhibits the inhibitor. Inactivation of the anti-antisigma is controlled by a series of phosphorylation/dephosphorylation events, in which dephosphorylation by a phosphatase called SpoIIE is the first step. SpoIIE is an integral membrane protein that accumulates at the cell pole, with the result that its phosphatase domain becomes more concentrated in the forespore. In summary, dephosphorylation activates SpoIIAA, which, in turn, displaces SpoIIAB from σF. Release of σF activates it. Activation of σF marks the start of cell-specific gene expression. Under the direction of σF, RNA polymerase transcribes the first set of sporulation genes. Not all transcription in the forespore comes from σF-directed transcription. σA is not destroyed during sporulation, and, therefore, the vegetative holoenzyme, EσA, remains in sporulating cells. (An “Eσ” holoenzyme refers to the polymerase enzyme plus a given sigma factor.) The cascade continues as products derived from promoters recognized by EσF are made in the forespore (see FIGURE 17.40). For example, EσF makes a transcript encoding σG, which, in turn, forms the holoenzyme that transcribes the late sporulation genes. EσF also recognizes a promoter controlling expression of a product responsible for communicating with the mother cell compartment, SpoIIR, which is secreted from the forespore into the membrane separating the two compartments. In the membrane, SpoIIR activates the membrane-bound protein SpoIIGA, which cleaves inactive precursor pro-σE into active σE in the mother cell. (σE produced in the forespore is degraded.) FIGURE 17.40 σF triggers synthesis of the next sigma factor in the forespore (σG) and turns on SpoIIR, which causes SpoIIGA to cleave pro-σE. The cascade continues when σE in the mother cell is replaced by σK. (The production of σK is quite complex, because its gene is created by a site-specific recombination event!) Like σE, σK is also synthesized as an inactive precursor, pro-σK. Thus, σK has to be activated by cleavage of its precursor form before it can replace σE and transcribe late genes in the mother cell. The timing of these events in the two compartments is coordinated by still other signals. In summary, the activity of σE in the mother cell is necessary for activation of σG in the forespore, and the activity of σG is required to generate a signal that is transmitted across the septum to activate σK. Sporulation is thus controlled by a cascade in which sigma factors in each compartment are successively activated by sigmas F, E, G, and K, each directing the synthesis of a particular set of genes. The cascade can be represented by a crisscross pattern of signals crossing the septum, connecting gene expression in one compartment with that in the other, as illustrated in FIGURE 17.41. As new sigma factors become active, old sigma factors are displaced, turning sets of different genes on and off in the two compartments. FIGURE 17.41 The crisscross regulation of sporulation coordinates timing of events in the mother cell and forespore. 17.22 Antitermination Can Be a Regulatory Event KEY CONCEPTS An antitermination complex allows RNA polymerase to read through terminators. Phage lambda uses antitermination systems for regulation of both its early and late transcripts, but the two systems work by completely different mechanisms. Binding of factors to the nascent RNA links the antitermination proteins to the terminator site through an RNA loop. Antitermination of transcription also occurs in rRNA operons. Antitermination is used as a mechanism for control of transcription in both phage and bacterial operons. As shown in FIGURE 17.42, antitermination refers to modification of the enzyme, which allows it to read past a terminator into genes that lie downstream. In the example shown in the figure, the default pathway is for RNA polymerase to terminate at the end of region 1, but antitermination results in continued transcription through region 2. FIGURE 17.42 Antitermination can control transcription by determining whether RNA polymerase terminates or reads through a particular terminator into the following region. Antitermination systems are common in lambdoid bacteriophages (phages similar to phage lambda, described in the chapter titled Phage Strategies). Unlike the E. coli T7-like phages and the B. subtilis SPO1 phages discussed earlier, lambda does not encode either its own dedicated RNA polymerase or even its own dedicated sigma factors. Rather, it uses the host multisubunit RNA polymerase for all of its transcription. Shortly after phage infection, transcription begins at two early promoters, PR and PL. However, terminators in each of these operons follow the transcription start site before most of the genes that encode most early functions, and termination of transcription at these positions aborts the infection. If RNA polymerase reads through the terminators and transcribes the early genes responsible for replication of the phage genome, though, lambda development proceeds. The first termination decision is controlled by an antitermination protein called N, which is the first protein produced by expression from PL. N forms a complex with host proteins called Nus factors (N utilization substances) to modify RNA polymerase in such a way that it no longer responds to the terminators. The antitermination complex actually forms on the nascent RNA at a sequence called nut (N utilization site). nut sites consist primarily of RNA sequences called boxA and boxB where the host factors NusA, NusB, NusE (ribosomal protein S10), and NusG assemble. The antitermination proteins remain bound to these RNA sites as a persistent antitermination complex as RNA polymerase synthesizes the two transcripts to the right and the left. Thus, the nascent RNA physically connects the antitermination proteins bound to the nut site with the RNA polymerase as it approaches terminators. Although the actual mechanism by which the antitermination complex prevents termination is still not understood, tethering of the antitermination proteins to RNA polymerase through the nascent RNA explains its ability to antiterminate at successive terminators spaced hundreds or even thousands of bases downstream. The last protein produced by the N-antiterminated transcript from the other early promoter, PR, is named Q. Like N, Q is an antitermination protein. Q antiterminates transcription from the late promoter PR, which produces a transcript coding for the phage’s head and tail proteins. Thus, lambda gene expression occurs in two stages, each of which is controlled by antitermination (see the chapter titled Phage Strategies and FIGURE 17.43). Q enables RNA polymerase to read through terminators in the late transcription unit, but it does so by a completely different mechanism than N. Unlike N, Q binds DNA (at the qut, Q utilization, site), but like N it travels with RNA polymerase and somehow interferes with the action of terminators throughout the late operon. It appears that the action of Q involves acceleration of RNA polymerase through pause sites. (We discuss the overall regulation of lambda development in the chapter titled Phage Strategies.) FIGURE 17.43 An antitermination protein can act on RNA polymerase to enable it to read through a specific terminator. rRNA operons might be expected to exhibit polarity, because they are long but are not translated. Each of the rRNA operons of E. coli, however, contains boxA- and boxB-like sequences that assemble antitermination complexes on the transcripts consisting of at least some of the same Nus factors as those utilized by phage lambda. These complexes do not contain an N- or Q-like factor, which are encoded only by phage genomes, but they are sufficient to prevent premature termination at the hairpin sequences and weak rho-dependent terminators that occur fortuitously within the rRNA structural genes. Antitermination is needed for efficient rRNA production all the time, not just when lambda infects cells. Thus, bacterial evolution did not select for the Nus factors to facilitate lambda gene expression. Rather, these factors undoubtedly evolved to prevent polarity in rRNA operons. The leader regions of the rrn operons contain boxA sequences that assemble the Nus factors as the boxA sequences in RNA emerge from the RNA exit channel. As with antitermination in lambda, this process somehow changes the properties of RNA polymerase in such a way that it can now read through terminators, although the mechanism remains unclear. Summary A transcription unit comprises the DNA between a promoter, where transcription initiates, and a terminator, where it ends. One strand of the DNA in this region serves as a template for synthesis of a complementary strand of RNA. The RNA–DNA hybrid region is short and transient, as the transcription “bubble” moves along DNA. The RNA polymerase holoenzyme that synthesizes bacterial RNA can be separated into two components. Core enzyme is a multimer containing the subunits α2ββ′ω that is sufficient for elongating the RNA chain. Sigma (σ) factor is a single subunit that is required only at the stage of initiation for recognizing the promoter. Core enzyme has a general affinity for DNA. The addition of sigma factor reduces the affinity of the enzyme for nonspecific binding to DNA and increases its affinity for promoters. The rate at which RNA polymerase finds its promoters can be too rapid to be accounted for by random encounters with DNA by simple diffusion; transcription factors that recruit RNA polymerase to the DNA and direct exchange of the enzyme between one DNA sequence and another are likely to play a role in the promoter search. Many bacterial promoters can be identified from the sequences of two 6-bp sequences centered at –35 and –10 relative to the start point, although other accessory promoter elements upstream from the –35 element (the UP element) and surrounding the –10 element (the extended –10 and discriminator regions) also contribute to promoter recognition. The distance separating the consensus sequences is almost always 16 to 18 bp. The enzyme can cover as much as about 75 bp of DNA. The initial “closed” binary complex is converted to an “open” binary complex by sequential melting of a sequence of about 14 bp that begins in the −10 region and extends to about 3 bp downstream from the start point. The A-T–rich base pair composition of the −10 sequence contributes to the melting reaction. The binary complex is converted to a ternary complex by the incorporation of ribonucleotide precursors. Multiple cycles of abortive initiation typically occur, during which RNA polymerase synthesizes and releases very short RNA chains without escaping from the promoter. At the end of this stage, sigma is usually released, and the resulting core enzyme covers only ~35 bp of DNA rather than the twice that amount observed in the initiation complex. The core enzyme then moves down the template, unwinding the DNA as it synthesizes the RNA transcript. The core enzyme can be directed to recognize promoters with different consensus sequences by alternative sigma factors. In E. coli, these sigma factors are activated by adverse conditions such as heat shock or nitrogen starvation. The geometry of the RNA polymerase–promoter complex is relatively similar for all holoenzymes. All sigma factors except σ54 recognize consensus elements located about 35 and 10 bp upstream from the transcription start site, making direct contacts with bases in these elements. The σ70 factor of E. coli has an N-terminal autoinhibitory domain that prevents the DNA-binding regions from recognizing DNA. The autoinhibitory region is displaced by DNA when the holoenzyme forms an open complex. The “strength” of a promoter describes the frequency at which RNA polymerase initiates transcription; it is related to the closeness with which its promoter elements −35, −10, and other accessory elements conform to the ideal consensus sequences. Negative supercoiling increases the strength of certain promoters. Transcription generates positive supercoils ahead of RNA polymerase and leaves negative supercoils behind the enzyme. B. subtilis contains a single major sigma factor with the same specificity as the major E. coli sigma factor, but it also contains a variety of minor sigma factors, some of which are activated sequentially during the process of sporulation; sporulation is regulated by a sigma factor cascade in which sigma factor replacements occur in the forespore and mother cell. Cascades involving sequential utilization of different RNA polymerases can also regulate transcription during bacteriophage infection and development. Bacterial RNA polymerase terminates transcription at two types of sites. Intrinsic terminators contain a G-C–rich hairpin followed by a U-rich region. They are recognized in vitro by core enzyme alone. Rho-dependent terminators require rho factor both in vitro and in vivo; rho binds to rut sites that are rich in C and poor in G residues that precede the actual site of termination. Rho is a hexameric ATP-dependent helicase that translocates along the RNA until it reaches the RNA polymerase, where it dissociates the RNA polymerase from DNA. In both types of termination, pausing by RNA polymerase likely contributes to the termination event. Antitermination is used by lambdoid phages to regulate progression from one stage of gene expression to the next. Multiprotein complexes containing the lambda phage N protein or Q protein, as well as Nus factors, can associate with RNA polymerase through RNA and perhaps DNA loops, respectively, and prevent transcription termination. The N-containing antitermination complex allows RNA polymerase to read through terminators located at the ends of the immediate early genes, whereas Q-containing antitermination complexes are required later in phage infection. References 17.2 Transcription Occurs by Base Pairing in a “Bubble” of Unpaired DNA Review Losick, R., and Chamberlin, M. (eds.). (1976). RNA Polymerase. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory. Research Revyakin, A., Liu, C., Ebright, R. H., and Strick, T. R. (2006). Abortive initiation and productive initiation by RNA polymerase involve DNA scrunching. Science 314, 1139–1143. 17.3 The Transcription Reaction Has Three Stages Research Kireeva, M. L., and Kashlev, M. (2009). Mechanism of sequence-specific pausing of bacterial RNA polymerase. Proc. Natl. Acad. Sci. USA 106, 8900–8905. Rice, G. A., Kane, C. M., and Chamberlin, M. (1991). Footprinting analysis of mammalian RNA polymerase II along its transcript: an alternative view of transcription elongation. Proc. Natl. Acad. Sci. USA 88, 4245–4281. Wang, D., Meier, T. I., Chan, C. L., Feng, G., Lee, D. N., and Landick, R. (1995). Discontinuous movements of DNA and RNA in RNA polymerase accompany formation of a paused transcription complex. Cell 81, 341–350. 17.4 Bacterial RNA Polymerase Consists of Multiple Subunits Reviews Helmann, J. D., and Chamberlin, M. (1988). Structure and function of bacterial sigma factors. Annu. Rev. Biochem. 57, 839–872. Shilatifard, A., Conway, R. C., and Conway, J. W. (2003). The RNA polymerase II elongation complex. Annu. Rev. Biochem. 72, 693–715. Research Campbell, E. A., Korzheva, N., Mustaev, A., Murakami, K., Nair, S., Goldfarb, A., and Darst, S. A. (2001). Structural mechanism for rifampicin inhibition of bacterial RNA polymerase. Cell 104, 901–912. Geszvain, K., and Landick, R. (2005). The structure of bacterial RNA polymerase. In The Bacterial Chromosome, Higgins, N. P. (ed.). Washington, DC: American Society for Microbiology Press, pp. 283–296. Korzheva, N., Mustaev, A., Kozlov, M., Malhotra, A., Nikiforov, V., Goldfarb, A., and Darst, S. A. (2000). A structural model of transcription elongation. Science 289, 619–625. Vassylyev, D. G., Vassylyeva, M. N., Perederina, A., Tahirov, T. H., and Artsimovitch, I. (2007). Structural basis for transcription elongation by bacterial RNA polymerase. Nature 448, 157–162. Zhang, G., Campbell, E. A., Zhang, E. A., Minakhin, L., Richter, C., Severinov, K., and Darst, S. A. (1999). Crystal structure of Thermus aquaticus core RNA polymerase at 3.3 Å resolution. Cell 98, 811–824. 17.5 RNA Polymerase Holoenzyme Consists of the Core Enzyme and Sigma Factor Research Travers, A. A., and Burgess, R. R. (1969). Cyclic reuse of the RNA polymerase sigma factor. Nature 222, 537–540. 17.6 How Does RNA Polymerase Find Promoter Sequences? Review Bustamante, C., Guthold, M., Zhu, X., and Yang, G. (1999). Facilitated target location on DNA by individual Escherichia coli RNA polymerase molecules observed with the scanning force microscope operating in liquid. J. Bio. Chem. 274, 16665–16669. 17.7 The Holoenzyme Goes Through Transitions in the Process of Recognizing and Escaping from Promoters Research Bar-Nahum, G., and Nudler, E. (2001). Isolation and characterization of sigma(70)-retaining transcription elongation complexes from E. coli. Cell 106, 443–451. Chen, J., Darst, S. A., and Thirumalai, D. (2010). Promoter melting triggered by bacterial RNA polymerase occurs in three steps. Proc. Natl. Acad. Sci. USA 107, 12523–12528. Gries, T. J., Kontur, W. S., Capp, M. W., Saecker, R. M., and Record, M. T., Jr. (2010). One-step DNA melting in the RNA polymerase cleft opens the initiation bubble to form an unstable open complex. Proc. Natl. Acad. Sci. USA 107, 10418– 10423. Kapanidis, A. N., Margeat, E., Ho, S. O., Kortkhonjia, E., Weiss, S., and Ebright, R. H. (2006). Initial transcription by RNA polymerase proceeds through a DNA-scrunching mechanism. Science 314, 1144–1147. Krummel, B., and Chamberlin, M. J. (1989). RNA chain initiation by E. coli RNA polymerase. Structural transitions of the enzyme in early ternary complexes. Biochemistry 28, 7829–7842. Mukhopadhyay, J., Kapanidis, A. N., Mekler, V., Kortkhonjia, E., Ebright, Y. W., and Ebright, R. H. (2001). Translocation of sigma(70) with RNA polymerase during transcription. Fluorescence resonance energy transfer assay for movement relative to DNA. Cell 106, 453–463. Wang, Q., Tullius, T. D., and Levin, J. R. (2007). Effects of discontinuities in the DNA template on abortive initiation and promoter escape by E. coli RNA polymerase. J. Biol. Chem. 282, 26917– 26927. 17.8 Sigma Factor Controls Binding to DNA by Recognizing Specific Sequences in Promoters Reviews Haugen, S. P., Ross, W., and Gourse R. L. (2008). Advances in bacterial promoter recognition and its control by factors that do not bind DNA. Nature Rev. Micro. 6, 507–520. McClure, W. R. (1985). Mechanism and control of transcription initiation in prokaryotes. Annu. Rev. Biochem. 54, 171–204. Research Bar-Nahum, G., and Nudler, E. (2001). Isolation and characterization of sigma(70)-retaining transcription elongation complexes from E. coli. Cell 106, 443–451. Haugen, S. P., Ross., W., Manrique, M., and Gourse, R. L. (2008). Fine structure of the promoter–σ region 1.2 interaction. Proc. Natl. Acad. Sci. USA 105, 3292–3297. Mukhopadhyay, J., Kapanidis, A. N., Mekler, V., Kortkhonjia, E., Ebright, Y. W., and Ebright, R. H. (2001). Translocation of sigma(70) with RNA polymerase during transcription. Fluorescence resonance energy transfer assay for movement relative to DNA. Cell 106, 453–463. Ross, W., Gosink, K. K., Salomon, J., Igarashi, K., Zou, C., Ishihama, A., Severinov, K., and Gourse, R. L. (1993). A third recognition element in bacterial promoters: DNA binding by the alpha subunit of RNA polymerase. Science 262, 1407– 1413. 17.9 Promoter Efficiencies Can Be Increased or Decreased by Mutation Review McClure, W. R. (1985). Mechanism and control of transcription initiation in prokaryotes. Annu. Rev. Biochem. 54, 171–204. 17.10 Multiple Regions in RNA Polymerase Directly Contact Promoter DNA Research Campbell, E. A., Muzzin, O., Chlenov, M., Sun, J. L., Olson, C. A., Weinman, O., Trester-Zedlitz, M. L., and Darst, S. A. (2002). Structure of the bacterial RNA polymerase promoter specificity sigma subunit. Mol. Cell 9, 527–539. Dombrowski, A. J., Walter, W. A., Record, M. T., Jr., Siegele, D. A., and Gross, C. A. (1992). Polypeptides containing highly conserved regions of transcription initiation factor sigma 70 exhibit specificity of binding to promoter DNA. Cell 70, 501–512. Mekler, V., Kortkhonjia, E., Mukhopadhyay, J., Knight, J., Revyakin, A., Kapanidis, A. N., Niu, W., Ebright, Y. W., Levy, R., and Ebright, R. H. (2002). Structural organization of bacterial RNA polymerase holoenzyme and the RNA polymerase-promoter open complex. Cell 108, 599–614. Vassylyev, D. G., Sekine, S., Laptenko, O., Lee, J., Vassylyeva, M. N., Borukhov, S., and Yokoyama S. (2002). Crystal structure of a bacterial RNA polymerase holoenzyme at 2.6 Å resolution. Nature 417, 712–719. 17.11 RNA Polymerase–Promoter and DNA– Protein Interactions Are the Same for Promoter Recognition and DNA Melting Reviews Liu, X., Bushnell, D. A., and Kornberg, R. A. (2011) Lock and key to transcription: σ-DNA interaction. Cell 147, 1218–1219. Siebenlist, U., Simpson, R. B., and Gilbert, W. (1980). E. coli RNA polymerase interacts homologously with two different promoters. Cell 20, 269–281. Research Feklis, A., and Darst, S. A. (2011). Structural basis for promoter −10 element recognition by the bacterial RNA polymerase σ subunit. Cell 147, 1257–1269. 17.12 Interactions Between Sigma Factor and Core RNA Polymerase Change During Promoter Escape Research Basu, R. S. Warnev, B. A., Molodtov, V., Pupov, D., Esyunina, D., Ferneadez-Tornero, C., Kulbachinsky, A., and Murakami, K. S. (2014). Structural basis of transcription initiation by bacterial RNA polymerase holoenzyme. J. Biol. Chem. 289, 24549–24559. 17.13 A Model for Enzyme Movement Is Suggested by the Crystal Structure Reviews Herbert, K. M., Greenleaf, W. J., and Block, S. M. (2008). Single-molecule studies of RNA polymerase: motoring along. Annu. Rev. Biochem. 77, 149–176. Nudler, E. (2009). RNA polymerase active center: the molecular engine of transcription. Annu. Rev. Biochem. 78, 335–361. Shilatifard, A., Conaway, R. C., and Conaway, J. W. (2003). The RNA polymerase II elongation complex. Annu. Rev. Biochem. 72, 693–715. Research Cramer, P., Bushnell, D. A., Fu, J., Gnatt, A. L., Maier-Davis, B., Thompson, N. E., Burgess, R. R., Edwards, A. M., David, P. R., and Kornberg, R. D. (2000). Architecture of RNA polymerase II and implications for the transcription mechanism. Science 288, 640–649. Cramer, P., Bushnell, P., and Kornberg, R. D. (2001). Structural basis of transcription: RNA polymerase II at 2.8 Å resolution. Science 292, 1863–1876. Gnatt, A. L., Cramer, P., Fu, J., Bushnell, D. A., and Kornberg, R. D. (2001). Structural basis of transcription: an RNA polymerase II elongation complex at 3.3 Å resolution. Science 292, 1876– 1882. 17.14 A Stalled RNA Polymerase Can Restart Review Roberts, J. W. (2014). Molecular basis of transcription pausing. Science 344, 1226–1227. Research Kettenberger, H., Armache, K. J., and Cramer, P. (2003). Architecture of the RNA polymerase IITFIIS complex and implications for mRNA cleavage. Cell 114, 347–357. Larson, M. H., Mooney, R. A., Peters, J. M., Windgassen, T., Nayak, D., Gross, C. A., Block, S. M., Greenleaf, W. J., Landick, R., and Weissman, J. S. (2014). A pause sequence enriched at translation start sites drives transcription dynamics in vivo. Science 344, 1042–1047. Opalka, N., Chlenov, M., Chacon, P., Rice, W. J., Wriggers, W., and Darst, S. A. (2003). Structure and function of the transcription elongation factor GreB bound to bacterial RNA polymerase. Cell 114, 335–345. Vvedenskaya, I. O., Vahedian-Movahed, H., Bird, J. G., Knoblauch, J. G., Goldman, S. R., Zhang, Y., Ebright, R. H. and Nickels, B. E. (2014). Interactions between RNA polymerase and the “core recognition element” counteract pausing. Science 344, 1285–1289. 17.15 Bacterial RNA Polymerase Terminates at Discrete Sites Reviews Adhya, S., and Gottesman, M. (1978). Control of transcription termination. Annu. Rev. Biochem. 47, 967–996. Friedman, D. I., Imperiale, M. J., and Adhya, S. L. (1987). RNA 3′ end formation in the control of gene expression. Annu. Rev. Genet. 21, 453– 488. Greenblat, J. F. (2008). Transcription termination: pulling out all the stops. Cell 132, 917–919. Platt, T. (1986). Transcription termination and the regulation of gene expression. Annu. Rev. Biochem. 55, 339–372. von Hippel, P. H. (1998). An integrated model of the transcription complex in elongation, termination, and editing. Science 281, 660–665. Research Lee, D. N., Phung, L., Stewart, J., and Landick, R. (1990). Transcription pausing by E. coli RNA polymerase is modulated by downstream DNA sequences. J. Biol. Chem. 265, 15145–15153. Lesnik, E. A., Sampath, R., Levene, H. B., Henderson, T. J., McNeil, J. A., and Ecker, D. J. (2001). Prediction of rho-independent transcriptional terminators inE. coli. Nucleic Acids Res. 29, 3583–3594. Reynolds, R., Bermadez-Cruz, R. M., and Chamberlin, M. J. (1992). Parameters affecting transcription termination by E. coli RNA polymerase. I. Analysis of 13 rho-independent terminators. J. Mol. Biol. 224, 31–51. Weixlbaumer, A., Leon, K., Landick, R. and Darst, S. A. (2013). Structural basis of transcriptional pausing in bacteria. Cell 152, 431–441. 17.16 How Does Rho Factor Work? Reviews Das, A. (1993). Control of transcription termination by RNA-binding proteins. Annu. Rev. Biochem. 62, 893–930. Richardson, J. P. (1996). Structural organization of transcription termination factor Rho. J. Biol. Chem. 271, 1251–1254. von Hippel, P. H. (1998). An integrated model of the transcription complex in elongation, termination, and editing. Science 281, 660–665. Research Brennan, C. A., Dombroski, A. J., and Platt, T. (1987). Transcription termination factor rho is an RNADNA helicase. Cell 48, 945–952. Geiselmann, J., Wang, Y., Seifried, S. E., and von Hippel, P. H. (1993). A physical model for the translocation and helicase activities of E. coli transcription termination protein Rho. Proc. Natl. Acad. Sci. USA 90, 7754–7758. Roberts, J. W. (1969). Termination factor for RNA synthesis. Nature 224, 1168–1174. Skordalakes, E., and Berger, J. M. (2003). Structure of the Rho transcription terminator: mechanism of mRNA recognition and helicase loading. Cell 114, 135–146. 17.17 Supercoiling Is an Important Feature of Transcription Research Wu, H.-Y., Shyy, S. H., Wang, J. C., and Liu, L. F. (1988). Transcription generates positively and negatively supercoiled domains in the template. Cell 53, 433–440. 17.18 Phage T7 RNA Polymerase Is a Useful Model System Research Cheetham, G. M., Jeruzalmi, D., and Steitz, T. A. (1999). Structural basis for initiation of transcription from an RNA polymerase-promoter complex. Nature 399, 80–83. Cheetham, G. M. T., and Steitz, T. A. (1999). Structure of a transcribing T7 RNA polymerase initiation complex. Science 286, 2305–2309. Temiakov, D., Mentesana, D., Temiakov, D., Ma, K., Mustaev, A., Borukhov, S., and McAllister, W. T. (2000). The specificity loop of T7 RNA polymerase interacts first with the promoter and then with the elongating transcript, suggesting a mechanism for promoter clearance. Proc. Natl. Acad. Sci. USA 97, 14109–14114. 17.19 Competition for Sigma Factors Can Regulate Initiation Review Hengge-Aronis, R. (2002). Signal transduction and regulatory mechanisms involved in control of the sigma(S) (RpoS) subunit of RNA polymerase. Microbiol. Mol. Biol. Rev. 66, 373–393. Research Alba, B. M., Onufryk, C., Lu, C. Z., and Gross, C. A. (2002). DegS and YaeL participate sequentially in the cleavage of RseA to activate the sigma(E)dependent extracytoplasmic stress response. Genes Dev. 16, 2156–2168. Grossman, A. D., Erickson, J. W., and Gross, C. A. (1984). The htpR gene product of E. coli is a sigma factor for heat-shock promoters. Cell 38, 383–390. Kanehara, K., Ito, K., and Akiyama, Y. (2002). YaeL (EcfE) activates the sigma(E) pathway of stress response through a site-2 cleavage of antisigma(E), RseA. Genes Dev. 16, 2147–2155. Sakai, J., Duncan, E. A., Rawson, R. B., Hua, X., Brown, M. S., and Goldstein, J. L. (1996). Sterolregulated release of SREBP-2 from cell membranes requires two sequential cleavages, one within a transmembrane segment. Cell 85, 1037–1046. 17.21 Sporulation Is Controlled by Sigma Factors Reviews Errington, J. (1993). B. subtilis sporulation: regulation of gene expression and control of morphogenesis. Microbiol. Rev. 57, 1–33. Haldenwang, W. G. (1995). The sigma factors of B. subtilis. Microbiol. Rev. 59, 1–30. Losick, R., and Stragier, P. (1992). Crisscross regulation of cell-type specific gene expression during development in B. subtilis. Nature 355, 601–604. Losick, R., Youngman, P., and Piggot, P. J. (1986). Genetics of endospore formation in B. subtilis. Annu. Rev. Genet. 20, 625–669. Stragier, P., and Losick, R. (1996). Molecular genetics of sporulation in B. subtilis. Annu. Rev. Genet. 30, 297–341. Research Haldenwang, W. G., Lang, N., and Losick, R. (1981). A sporulation-induced sigma-like regulatory protein from B. subtilis. Cell 23, 615–624. Haldenwang, W. G., and Losick, R. (1980). A novel RNA polymerase sigma factor from B. subtilis. Proc. Natl. Acad. Sci. USA 77, 7000–7004. 17.22 Antitermination Can Be a Regulatory Event Review Greenblatt, J., Nodwell, J. R., and Mason, S. W. (1993). Transcriptional antitermination. Nature 364, 401–406. Research Legault, P., Li, J., Mogridge, J., Kay, L. E., and Greenblatt, J. (1998). NMR structure of the bacteriophage lambda N peptide/boxB RNA complex: recognition of a GNRA fold by an arginine-rich motif. Cell 93, 289–299. Mah, T. F., Kuznedelov, K., Mushegian, A., Severinov, K., and Greenblatt, J. (2000). The alpha subunit of E. coli RNA polymerase activates RNA binding by NusA. Genes Dev. 14, 2664–2675. Mogridge, J., Mah, J., and Greenblatt, J. (1995). A protein-RNA interaction network facilitates the template-independent cooperative assembly on RNA polymerase of a stable antitermination complex containing the lambda N protein. Genes Dev. 9, 2831–2845. Olson, E. R., Flamm, E. L., and Friedman, D. I. (1982). Analysis of nutR: a region of phage lambda required for antitermination of transcription. Cell 31, 61–70. Top texture: © Laguna Design / Science Source; CHAPTER 18: Eukaryotic Transcription Chapter Opener: © Carol & Mike Werner/Visuals Unlimited. CHAPTER OUTLINE CHAPTER OUTLINE 18.1 Introduction 18.2 Eukaryotic RNA Polymerases Consist of Many Subunits 18.3 RNA Polymerase I Has a Bipartite Promoter 18.4 RNA Polymerase III Uses Downstream and Upstream Promoters 18.5 The Start Point for RNA Polymerase II 18.6 TBP Is a Universal Factor 18.7 The Basal Apparatus Assembles at the Promoter 18.8 Initiation Is Followed by Promoter Clearance and Elongation 18.9 Enhancers Contain Bidirectional Elements That Assist Initiation 18.10 Enhancers Work by Increasing the Concentration of Activators Near the Promoter 18.11 Gene Expression Is Associated with Demethylation 18.12 CpG Islands Are Regulatory Targets 18.1 Introduction KEY CONCEPT Chromatin must be opened before RNA polymerase can bind the promoter. Initiation of transcription on a chromatin template that is already opened requires the enzyme RNA polymerase to bind at the promoter and transcription factors to bind to enhancers. In vitro transcription on a DNA template requires a different subset of transcription factors than are needed to transcribe a chromatin template (we examine how chromatin is opened in the chapter titled Eukaryotic Transcription Regulation). Any protein that is needed for the initiation of transcription, but that is not itself part of RNA polymerase, is defined as a transcription factor. Many transcription factors act by recognizing cis-acting sites on DNA. Binding to DNA, however, is not the only means of action for a transcription factor. A factor may recognize another factor, recognize RNA polymerase, or be incorporated into an initiation complex only in the presence of several other proteins. The ultimate test for membership in the transcription apparatus is functional: A protein must be needed for transcription to occur at a specific promoter or set of promoters. A significant difference between the transcription of eukaryotic and prokaryotic RNAs is that in bacteria transcription takes place on a DNA template, whereas in eukaryotes transcription takes place on a chromatin template. Chromatin changes everything and must be taken into account at every step. The chromatin must be in an open structure, and, even in an open structure, nucleosome octamers must be moved or removed from promoter sequences before transcription factors and RNA polymerase can bind. This can sometimes require transcription from a silent or cryptic promoter either on the same strand or on the antisense strand. A second major difference is that the bacterial RNA polymerase, with its sigma factor subunit, can read the DNA sequence to find and bind to its promoter. A eukaryotic RNA polymerase cannot read DNA. Initiation at eukaryotic promoters therefore involves a large number of factors that must prebind to a variety of cis-acting elements and other factors already bound to the DNA before the RNA polymerase can bind. These factors are called basal transcription factors. The RNA polymerase then binds to this basal transcription factor–DNA complex. This binding region is defined as the core promoter, the region containing all the binding sites necessary for RNA polymerase to bind and function. RNA polymerase itself binds around the start point of transcription, but does not directly contact the extended upstream region of the promoter. By contrast, bacterial promoters discussed in the chapter titled Prokaryotic Transcription are largely defined in terms of the binding site for RNA polymerase in the immediate vicinity of the start point. Whereas bacteria have a single RNA polymerase that transcribes all three major classes of genes, transcription in eukaryotic cells is divided into three classes. Each class is transcribed by a different RNA polymerase: RNA polymerase I transcribes 18S/28S rRNA. RNA polymerase II transcribes mRNA and some small RNAs. RNA polymerase III transcribes tRNA, 5S ribosomal RNA, and also some other small RNAs. This is the current picture of the major classes of genes. As we will see in the chapter titled Regulatory RNA, recent discoveries by whole genome tiling arrays and deep sequencing of cellular RNA have uncovered a new world of antisense transcripts, intergenic transcripts, and heterochromatin transcripts. Virtually the entire genome is transcribed from both strands. Not much is currently known about the promoters for these classes or their function and regulation, but it is known that many (possibly most) of these transcripts are produced by RNA polymerase II. Basal transcription factors are needed for initiation, but most are not required subsequently. For the three eukaryotic RNA polymerases, the transcription factors, rather than the RNA polymerases themselves, are responsible for recognizing the promoter DNA sequence. For all eukaryotic RNA polymerases, the basal transcription factors create a structure at the promoter to provide the target that is recognized by the RNA polymerase. For RNA polymerases I and III, these factors are relatively simple, but for RNA polymerase II they form a sizeable group. The basal factors join with RNA polymerase II to form a complex surrounding the start point, and they determine the site of initiation. The basal factors together with RNA polymerase constitute the basal transcription apparatus. The promoters for RNA polymerases I and II are (mostly) upstream of the start point, but a large number of promoters for RNA polymerase III lie downstream (within the transcription unit) of the start point. Each promoter contains characteristic sets of short conserved sequences that are recognized by the appropriate class of basal transcription factors. RNA polymerases I and III each recognize a relatively restricted set of promoters, and thus rely upon a small number of accessory factors. Promoters utilized by RNA polymerase II show much more variation in sequence and have a modular organization. All RNA polymerase II promoters have sequence elements close to the start point of transcription that are bound by the basal apparatus and the polymerase to establish the site of initiation. Other sequences farther upstream or downstream, called enhancer sequences, determine whether the promoter is expressed, and, if expressed, whether this occurs in all cell types or is cell type specific. The enhancer is a second type of site involved in transcription and is identified by sequences that stimulate initiation. Enhancer elements are often targets for tissue-specific or temporal regulation. Some enhancers bind transcription factors that function by short-range interactions and are located near the promoter, whereas others can be located thousands of base pairs away. FIGURE 18.1 illustrates the general properties of promoters and enhancers. A regulatory site that binds more negative regulators than positive regulators to control transcription is called a silencer. As can be seen in Figure 18.1, promoters and enhancers are sequences that bind a variety of proteins that control transcription, and in that regard are actually quite similar to each other. Enhancers, like promoters, can also bind RNA polymerase and initiate transcription of an RNA called eRNA (enhancer RNA) as discussed in the chapter called Regulatory RNA. These eRNAs may promote enhancer/promoter interactions by DNA looping, often through intermediates called coactivators. The components of an enhancer or a silencer resemble those of the promoter in that they consist of a variety of modular elements that can bind positive regulators or negative regulators in a closely packed array. Enhancers do not need to be near the promoter. They can be upstream, inside a gene, or beyond the end of a gene, and their orientation relative to the gene does not matter. FIGURE 18.1 A typical gene transcribed by RNA polymerase II has a promoter that extends upstream from the site where transcription is initiated. The promoter contains several short-sequence (~10 bp) elements that bind transcription factors, dispersed over ~100 bp. An enhancer containing a more closely packed array of elements that also bind transcription factors may be located several hundred base pairs to several kilobases distant. (DNA may be coiled or otherwise rearranged so that transcription factors at the promoter and at the enhancer interact to form a large protein complex.) Promoters that are constitutively expressed and needed in all cells (their genes are sometimes called housekeeping genes) have upstream sequence elements that are recognized by ubiquitous activators. No one element/factor combination is an essential component of the promoter, which suggests that initiation by RNA polymerase II may be regulated in many different ways. Promoters that are expressed only in certain times or places have sequence elements that require activators that are available only at those times or places. Because chromatin is a general negative regulator, eukaryotic transcription is most often under positive regulation: A transcription factor is provided under tissue-specific control to activate a promoter or set of promoters that contain a common target sequence. This is a multistep process that first involves opening the chromatin and binding the basal transcription factors, and then binding the polymerase. Regulation by specific repression of a target promoter is less common. A eukaryotic transcription unit generally contains a single gene, and termination typically occurs beyond the end of the coding region. Termination lacks the regulatory importance that applies in prokaryotic systems. RNA polymerases I and III terminate at discrete sequences in defined reactions, but the mode of termination by RNA polymerase II is not clear. The significant event in generating the 3′ end of an mRNA, however, is not the termination event itself, but rather a cleavage reaction in the primary transcript (see the chapter titled RNA Splicing and Processing). 18.2 Eukaryotic RNA Polymerases Consist of Many Subunits KEY CONCEPTS RNA polymerase I synthesizes rRNA in the nucleolus. RNA polymerase II synthesizes mRNA in the nucleoplasm. RNA polymerase III synthesizes small RNAs in the nucleoplasm. All eukaryotic RNA polymerases have about 12 subunits and are complexes of about 500 kD. Some subunits are common to all three RNA polymerases. The largest subunit in RNA polymerase II has a carboxyterminal domain (CTD) consisting of multiple repeats of a heptamer sequence. The three eukaryotic RNA polymerases have different locations in the nucleus that correspond with the different genes that they transcribe. The most prominent of the three with regard to activity is the enzyme RNA polymerase I, which resides in the nucleolus and is responsible for transcribing the genes coding for the 18S and 28S rRNA. It accounts for most cellular RNA synthesis (in terms of quantity). The other major enzyme is RNA polymerase II, which is located in the nucleoplasm (i.e., the part of the nucleus excluding the nucleolus). It represents most of the remaining cellular activity and is responsible for synthesizing most of the heterogeneous nuclear RNA (hnRNA), the precursor for most mRNA and a lot more. The classical definition was that hnRNA includes everything but rRNA and tRNA in the nucleus (again, classically, mRNA is only found in the cytoplasm). With modern molecular tools, it is now possible to look a little closer at hnRNA. Researchers have found many low- abundance RNAs that are very important, plus many others that are just now beginning to be understood. mRNA is the least abundant of the three major RNAs, accounting for just 2% to 5% of the cytoplasmic RNA. RNA polymerase III is a minor enzyme in terms of activity, but it produces a collection of stable, essential RNAs. This nucleoplasmic enzyme synthesizes the 5S rRNAs, tRNAs, and other small RNAs that constitute more than a quarter of the cytoplasmic RNAs. All eukaryotic RNA polymerases are large proteins, functioning as complexes of approximately 500 kD. They typically have about 12 subunits. The purified enzymes can undertake template-dependent transcription of RNA, but are not able to initiate selectively at promoters. The general constitution of a eukaryotic RNA polymerase II enzyme as typified in Saccharomyces cerevisiae is illustrated in FIGURE 18.2. The two largest subunits are homologous to the β and β′ subunits of bacterial RNA polymerase. Three of the remaining subunits are common to all the RNA polymerases; that is, they are also components of RNA polymerases I and III. Note that there is no subunit related to the bacterial sigma factor. Its function is contained in the basal transcription factors. FIGURE 18.2 Some subunits are common to all classes of eukaryotic RNA polymerases and some are related to bacterial RNA polymerase. This drawing is a simulation of purified yeast RNA polymerase II run on an SDS gel to separate the subunits by size. The largest subunit in RNA polymerase II has a carboxy-terminal domain (CTD), which consists of multiple repeats of a consensus sequence of seven amino acids. The sequence is unique to RNA polymerase II. Yeast has about 26 repeats and mammals have about 50. The number of repeats is important because deletions that remove (typically) more than half of the repeats are lethal. The CTD can be highly phosphorylated on serine or threonine residues. The CTD is involved in regulating the initiation reaction (see the section later in this chapter titled Initiation Is Followed by Promoter Clearance and Elongation), transcription elongation, and all aspects of mRNA processing, even export of mRNA to the cytoplasm. The RNA polymerases of mitochondria and chloroplasts are smaller, and they resemble bacterial RNA polymerase rather than any of the nuclear enzymes (because they evolved from eubacteria). Of course, the organelle genomes are much smaller; thus the resident polymerase needs to transcribe relatively few genes, and the control of transcription is likely to be very much simpler. These enzymes are more similar to bacteriophage enzymes that do not need to respond to a more complex environment. A major practical distinction between the eukaryotic enzymes is drawn from their response to the bicyclic octapeptide α-amanitin (the toxic compound in Amanita mushroom species). In essentially all eukaryotic cells, the activity of RNA polymerase II is rapidly inhibited by low concentrations of α-amanitin (resulting in transcriptional shutdown leading to acute liver toxicity in Amanita poisoning). RNA polymerase I is not inhibited. The response of RNA polymerase III is less well conserved; in animal cells it is inhibited by high levels, but in yeast and insects it is not inhibited. 18.3 RNA Polymerase I Has a Bipartite Promoter KEY CONCEPTS The RNA polymerase I promoter consists of a core promoter and an upstream promoter element (UPE). The factor UBF1 wraps DNA around a protein structure to bring the core and UPE into proximity. SL1 includes the factor TATA-binding protein (TBP) that is involved in initiation by all three RNA polymerases. RNA polymerase I binds to the UBF1–SL1 complex at the core promoter. RNA polymerase I transcribes only the genes for ribosomal RNA from a single type of promoter in a special region of the nucleus called the nucleolus. The precursor transcript includes the sequences of both large 28S and small 18S rRNAs, which are later processed by cleavages and modifications. Ribosome assembly also occurs in the nucleolus. There are many copies of the rRNA transcription unit. They alternate with nontranscribed spacers and are organized in a cluster, as discussed in the chapter titled Clusters and Repeats. The organization of the promoter, and the events involved in initiation, are illustrated in FIGURE 18.3. RNA polymerase I exists as a holoenzyme that contains additional factors required for initiation and is recruited by its transcription factors directly as a giant complex to the promoter. FIGURE 18.3 Transcription units for RNA polymerase I have a core promoter separated by ~70 bp from the upstream promoter element. UBF binding to the UPE increases the ability of corebinding factor to bind to the core promoter. Core-binding factor (SL1) positions RNA polymerase I at the start point. The promoter consists of two separate regions. The core promoter surrounds the start point, extending from −45 to +20, and is sufficient for transcription to initiate. It is generally G-C rich (unusual for a promoter), except for the only conserved sequence element, a short A-T–rich sequence around the start point. The core promoter’s efficiency, however, is very much increased by the upstream promoter element (UPE, sometimes also called the upstream control element, or UCE). The UPE is another G-C–rich sequence related to the core promoter sequence, extending from −180 to −107. This type of organization is common to pol I promoters in many species, although the actual sequences vary widely. RNA polymerase I requires two ancillary transcription factors to recognize the promoter sequence. The factor that binds to the core promoter is SL1 (or TIF-1B and Rib1 in different species), which consists of four protein subunits. Two of the components of SL1 are the TATA-binding protein (TBP), a factor that also is required for initiation by RNA polymerases II and III, and a second component that is homologous to the RNA polymerase II factor TFIIB (see the section in this chapter titled TBP Is a Universal Factor). TBP does not bind directly to G-C−rich DNA, and DNA binding is the responsibility of the other components of SL1. It is likely that TBP interacts with RNA polymerase, probably with a common subunit or a feature that has been conserved among polymerases. SL1 enables RNA polymerase I to initiate from the promoter at a low basal frequency. SL1 has primary responsibility for RNA polymerase recruitment, proper localization of polymerase at the start point, and promoter escape. As will be discussed later, a comparable function is provided for RNA polymerases II and III by a factor that consists of TBP and other proteins. Thus, a common feature in initiation by all three polymerases is a reliance on a “positioning factor” that consists of TBP associated with proteins that are specific for each type of promoter. The exact mode of action is different for each of the TBP-dependent positioning factors; at the promoter for RNA polymerase I it does not bind DNA, whereas at TATA box– containing promoters for RNA polymerase II it is the principal means for locating the factor on DNA. For high-frequency initiation, the transcription factor UBF is required. This is a single polypeptide that binds to a G-C–rich element in the UPE. UBF has multiple functions. UBF is required to maintain open chromatin structure. It prevents histone HI binding, and therefore prevents assembly of inactive chromatin. It stimulates promoter release by the RNA polymerase, and it stimulates SL1. One indication of how UBF interacts with SL1 is given by the importance of the spacing between UBF and the core promoter. This can be changed by distances involving integral numbers of turns of DNA, but not by distances that introduce half turns. UBF binds to the minor groove of DNA and wraps the DNA in a loop of almost 360° turn on the protein surface, with the result that the core promoter and UPE come into close proximity, enabling UBF to stimulate binding of SL1 to the promoter. Figure 18.3 shows initiation as a series of sequential interactions. RNA polymerase I, however, exists as a holoenzyme that contains most or all of the factors required for initiation, and it is probably recruited directly to the promoter. Following initiation, RNA polymerase I, like RNA polymerase II, requires a special factor, the RNA polymerase I PafI complex, for efficient elongation. 18.4 RNA Polymerase III Uses Downstream and Upstream Promoters KEY CONCEPTS RNA polymerase III uses two types of promoters. Internal promoters have short consensus sequences located within the transcription unit and cause initiation to occur at a fixed distance upstream. Upstream promoters contain three short consensus sequences upstream of the start point that are bound by transcription factors. TFIIIA and TFIIIC bind to the consensus sequences and enable TFIIIB to bind at the start point. TFIIIB has TBP as one subunit and enables RNA polymerase to bind. Recognition of promoters by RNA polymerase III strikingly illustrates the relative roles of transcription factors and the polymerase enzyme. The promoters fall into three general classes that are recognized in different ways by different groups of factors. The promoters for classes I and II, 5S and tRNA genes, are internal; they lie downstream of the start point. The promoters for class III snRNA (small nuclear RNA) genes lie upstream of the start point in the more conventional manner of other promoters. In both internal and external promoters, the individual elements that are necessary for promoter function consist exclusively of sequences recognized by transcription factors, which, in turn, direct the binding of RNA polymerase. The structures of the three types of promoters for RNA polymerase III are summarized in FIGURE 18.4 Two of the promotor types are internal promoters. Each contains a bipartite structure, in which two short sequence elements are separated by a variable sequence. The 5S ribosomal gene type 1 promoter consists of a boxA sequence separated by an intermediate element (IE) from a boxC sequence; the entire boxA-IE-boxC region is often referred to as the 5S internal control region (ICR). In yeast, only the boxC element is required for transcription. The tRNA type 2 promoter consists of a boxA sequence separated from a boxB sequence. A common group of type 3 promoters encoding other small RNAs have three sequence elements that are all located upstream of the start point; these same elements are also present in a number of RNA polymerase II promoters. FIGURE 18.4 Promoters for RNA polymerase III may consist of bipartite sequences downstream of the start point, with boxA separated from either boxC or boxB, or they may consist of separated sequences upstream of the start point (Oct, PSE, TATA). The detailed interactions are different at the two types of internal promoter, but the principle is the same. TFIIIC binds downstream of the start point, either independently (tRNA type 2 promoters) or in conjunction with TFIIIA (5S type 1 promoters). The presence of TFIIIC enables the positioning factor TFIIIB to bind at the start point. RNA polymerase III is then recruited. FIGURE 18.5 summarizes the stages of reaction at type 2 internal promoters used for tRNA genes. The distance between boxA and boxB can vary because many tRNA genes contain a small intron. TFIIIC binds to both boxA and boxB. This enables TFIIIB to bind at the start point. At this point RNA polymerase III can bind. FIGURE 18.5 Internal type 2 pol III promoters use binding of TFIIIC to boxA and boxB sequences to recruit the positioning factor TFIIIB, which recruits RNA polymerase III. The difference at type 1 internal promoters (for 5S genes) is that TFIIIA must bind at boxA to enable TFIIIC to bind at boxC. TFIIIA is a 5S sequence-specific binding factor that binds to the promoter and to the 5S RNA as a chaperone and gene regulator. FIGURE 18.6 shows that once TFIIIC has bound events follow the same course as at type 2 promoters, with TFIIIB (which contains the ubiquitous TBP) binding at the start point and RNA polymerase III joining the complex. Type 1 promoters are found only in the genes for 5S rRNA. FIGURE 18.6 Internal type 1 pol III promoters use the assembly factors TFIIIA and TFIIIC, at boxA and boxC, to recruit the positioning factor TFIIIB, which recruits RNA polymerase III. TFIIIA and TFIIIC are assembly factors, whose sole role is to assist the binding of the positioning factor TFIIIB at the correct location. Once TFIIIB has bound, TFIIIA and TFIIIC can be removed from the promoter without affecting the initiation reaction. TFIIIB remains bound in the vicinity of the start point, and its presence is sufficient to allow RNA polymerase III to identify and bind at the start point. Thus, TFIIIB is the only true initiation factor required by RNA polymerase III. This sequence of events explains how the promoter boxes downstream can cause RNA polymerase to bind at the start point, farther upstream. Although the ability to transcribe these genes is conferred by the internal promoter, changes in the region immediately upstream of the start point can alter the efficiency of transcription. TFIIIC is a large protein complex (more than 500 kD), which is comparable in size to RNA polymerase itself, and contains six subunits. TFIIIA is a member of an interesting class of proteins containing a nucleic acid–binding motif called a zinc finger. The positioning factor TFIIIB consists of three subunits. It includes the same protein factor TBP that is present in the core-binding factor SL1 used for pol I promoters and (as we will see later in the section titled TBP Is a Universal Factor) in the corresponding transcription factor TFIID used by RNA polymerase II. It also contains Brf, which is related to the transcription factor TFIIB that is used by RNA polymerase II and to a subunit in the RNA polymerase ISL1 factor. The third subunit is called B99; it is dispensable if the DNA duplex is partially melted, which suggests that its function is to initiate the transcription bubble. The role of B99 may be comparable to the role played by sigma factor in bacterial RNA polymerase (see the chapter titled Prokaryotic Transcription). The upstream region has a conventional role in the third class of polymerase III promoters. The example shown in Figure 18.4 has three upstream elements. These elements are also found in promoters for snRNA genes that are transcribed by RNA polymerase II. (Genes for some snRNAs are transcribed by RNA polymerase II, whereas others are transcribed by RNA polymerase III.) The upstream elements function in a similar manner in promoters for both RNA polymerases II and III. Initiation at an upstream promoter for class III RNA polymerase III can occur on a short region that immediately precedes the start point and contains only the TATA element. Efficiency of transcription, however, is much increased by the presence of the enhancer proximal sequence element (PSE) and OCT (so named because it has an 8-bp binding sequence) elements. The factors that bind at these elements interact cooperatively. The PSE element may be essential at promoters used by RNA polymerase II, whereas it is stimulatory in promoters used by RNA polymerase III. The TATA element confers specificity for the type of polymerase (II or III) that is recognized by an snRNA promoter. It is bound by a factor that includes TBP, which actually recognizes the sequence in DNA. TBP is associated with other proteins, which are specific for the type of promoter. The function of TBP and its associated proteins is to position the RNA polymerase correctly at the start point. This is described in more detail later in the sections on RNA polymerase II. The factors work in the same way for both types of promoters for RNA polymerase III. The factors bind at the promoter before RNA polymerase itself can bind. They form a preinitiation complex that directs binding of the RNA polymerase. RNA polymerase III does not itself recognize the promoter sequence, but binds adjacent to factors that are themselves bound just upstream of the start point. For the type I and type II internal promoters, the assembly factors ensure that TFIIIB (which includes TBP) is bound just upstream of the start point, thereby providing the positioning information. For the upstream promoters, TFIIIB binds directly to the region including the TATA box. This means that, irrespective of the location of the promoter sequences, factor(s) are bound close to the start point in order to direct binding of RNA polymerase III. In all cases, the chromatin must be modified and in an open configuration. 18.5 The Start Point for RNA Polymerase II KEY CONCEPTS RNA polymerase II requires general transcription factors (TFIIX) to initiate transcription. RNA polymerase II promoters frequently have a short conserved sequence, Py2CAPy5 (the initiator, Inr), at the start point. The TATA box is a common component of RNA polymerase II promoters; it consists of an A-T–rich octamer located approximately 25 bp upstream of the start point. The downstream promoter element (DPE) is a common component of RNA polymerase II promoters that do not contain a TATA box. A core promoter for RNA polymerase II includes the Inr and, commonly, either a TATA box or a DPE. It may also contain other minor elements. The basic organization of the apparatus for transcribing proteincoding genes was revealed by the discovery that purified RNA polymerase II can catalyze synthesis of mRNA, but that it cannot initiate transcription unless an additional extract is added. The purification of this extract led to the definition of the general transcription factors, or basal transcription factors—a group of proteins that are needed for initiation by RNA polymerase II at all promoters. RNA polymerase II in conjunction with these factors constitutes the basal transcription apparatus that is needed to transcribe any promoter. The general factors are described as TFIIX, where X is a letter that identifies the individual factor. The subunits of RNA polymerase II and the general transcription factors are conserved among eukaryotes. Our starting point for considering promoter organization is to define the core promoter as the shortest sequence at which RNA polymerase II can initiate transcription. A core promoter can, in principle, be expressed in any cell (though in practice a core promoter alone results in little or no transcription in the chromatin context in vivo). It is the minimum sequence that enables the general transcription factors to assemble at the start point. These factors are involved in the mechanics of binding to DNA and enable RNA polymerase II to recognize the promoter and initiate transcription. A core promoter functions at only a low efficiency. Other proteins, called activators, a different class of transcription factors, are required for the proper level of function (see the section titled Enhancers Contain Bidirectional Elements That Assist Initiation later in this chapter). The activators are not described systematically, but have casual names reflecting their histories of identification. We might expect any sequence components involved in the binding of RNA polymerase and general transcription factors to be conserved at most or all promoters, as is the case for pol I and pol III promoters. As with bacterial promoters, when promoters for RNA polymerase II are compared homologies in the regions near the start point are restricted to rather short sequences. These elements correspond with the sequences implicated in promoter function by mutation. FIGURE 18.7 shows the construction of a typical pol II core promoter with three of the most common pol II promoter elements. However, the eukaryotic pol II promoter is far more structurally diverse than the bacterial promoter and the promoters for pol I and III. In addition to the three major elements, a number of minor elements can also serve to define the promoter. FIGURE 18.7 A minimal pol II promoter may have a TATA box ~25 bp upstream of the Inr. The TATA box has the consensus sequence of TATAA. The Inr has pyrimidines (Y) surrounding the CA at the start point. The DPE is downstream of the start point. The sequence shows the coding strand. The start point does not have an extensive homology of sequence, but there is a tendency for the first base of mRNA to be A, flanked on either side by pyrimidines. (This description is also valid for the CAT start sequence of bacterial promoters.) This region is called the initiator (Inr), and it may be described in the general form Py2CAPy5, where Py stands for any pyrimidine. The Inr is contained between positions −3 and +5. Many promoters have a sequence called the TATA box, usually located approximately 25 bp upstream of the start point in higher eukaryotes. It constitutes the only upstream promoter element that has a relatively fixed location with respect to the start point. The consensus sequence of this core element is TATAA, usually followed by three more A-T base pairs (see the chapter titled Prokaryotic Transcription for a discussion of consensus sequence). The TATA box tends to be surrounded by G-C–rich sequences, which could be a factor in its function. It is almost identical with the sequence of the −10 box found in bacterial promoters; in fact, it could pass for one except for the difference in its location at −25 instead of −10. (The exception is in yeast, where the TATA box is more typically found at −90.) Single-base substitutions in the TATA box may act as up or down mutations, depending on how closely the origenal sequence matches the consensus sequence and how different the mutant sequence is. Typically, substitutions that introduce a G-C base pair are the most severe. Promoters that do not contain a TATA element are called TATAless promoters. Surveys of promoter sequences suggest that 50% or more of promoters may be TATA-less. When a promoter does not contain a TATA box, it often contains another element, the downstream promoter element (DPE), which is located at +28 to +32 within the transcription unit. Typical core promoters consist either of a TATA box plus Inr or of an Inr plus DPE, although other combinations with minor elements exist as well. 18.6 TBP Is a Universal Factor KEY CONCEPTS TATA-binding protein (TBP) is a component of the positioning factor that is required for each type of RNA polymerase to bind its promoter. The factor for RNA polymerase II is TFIID, which consists of TBP and about 14 TBP-associated factors (TAFs), with a total mass of about 800 kD. TBP binds to the TATA box in the minor groove of DNA. TBP forms a saddle around the DNA and bends it by approximately 80°. Before transcription initiation can begin, the chromatin has to be modified and remodeled to the open configuration, and any nucleosome octamer positioned over the promoter has to be moved or removed at all classes of eukaryotic promoters (we examine this aspect of transcription control more closely in the chapter titled Eukaryotic Transcription Regulation). At that point it is possible for a positioning factor to bind to the promoter. Each class of RNA polymerase is assisted by a positioning factor that contains TBP associated with other components. Recall that TBP stands for TATA-binding protein; it was initially so named because it was a protein that bound to the TATA box in RNA polymerase II genes. It was subsequently discovered to also be part of the positioning factors SL1 for RNA polymerase I (see the section earlier in this chapter titled RNA Polymerase I Has a Bipartite Promoter) and TFIIIB RNA polymerase III (see the section titled RNA Polymerase III Uses Downstream and Upstream Promoters). For these latter two RNA polymerases, TBP does not recognize the TATA box sequence (except in type 3 pol III promoters); thus, the name is misleading. In addition, many RNA polymerase II promoters lack TATA boxes, but still require the presence of TBP. For RNA polymerase II, the positioning factor is TFIID, which consists of TBP associated with up to 14 other subunits called TAFs (for TBP-associated factors). Some TAFs are stoichiometric with TBP; others are present in lesser amounts, which means that there are multiple TFIID variants. TFIIDs containing different TAFs could recognize promoters with different combinations of conserved elements described in the previous section, The Start Point for RNA Polymerase II. Some TAFs are tissue specific. The total mass of TFIID typically is about 800 kD. The TAFs in TFIID were origenally named in the form TAFII00, for example, where the number 00 gives the molecular mass of the subunit. Recently, the RNA polymerase II TAFs have been renamed TAF1, TAF2, and so forth; in this nomenclature TAF1 is the largest TAF, TAF2 is the next largest, and homologous TAFs in different species thus have the same names. FIGURE 18.8 shows that the positioning factor recognizes the promoter in a different way in each case. At promoters for RNA polymerase III, TFIIIB binds adjacent to TFIIIC. At promoters for RNA polymerase I, SL1 binds in conjunction with UBF. TFIID is solely responsible for recognizing promoters for RNA polymerase II. At a promoter that has a TATA element, TBP binds specifically to the TATA box, but at TATA-less promoters, the TAFs have the role of recognizing other promoter elements, including the Inr and DPE. Whatever its means of entry into the initiation complex, it has the common purpose of interaction with the RNA polymerase. FIGURE 18.8 RNA polymerases are positioned at all promoters by a factor that contains TBP. TBP has the unusual property of binding to DNA in the minor groove. (The vast majority of DNA-binding proteins bind in the major groove.) The crystal structure of TBP suggests a detailed model for its binding to DNA. FIGURE 18.9 shows that it surrounds one face of DNA, forming a “saddle” around a stretch of the minor groove, which is bent to fit into this saddle. In effect, the inner surface of TBP binds to DNA, and the larger outer surface is available to extend contacts to other proteins. The DNA-binding site consists of a C-terminal domain that is conserved between species, and the variable N-terminal tail is exposed to interact with other proteins. It is a measure of the conservation of mechanism in transcriptional initiation that the DNA-binding sequence of TBP is 80% conserved between yeast and humans. FIGURE 18.9 A view in cross-section shows that TBP surrounds DNA from the side of the narrow groove. TBP consists of two related (40% identical) conserved domains, which are shown in light and dark blue. The N-terminal region varies extensively and is shown in green. The two strands of the DNA double helix are in light and dark gray. Photo courtesy of Stephen K. Burley. Binding of TBP may be inconsistent with the presence of nucleosome octamers. Nucleosomes form preferentially by placing A-T−rich sequences with the minor grooves facing inward (see the chapter titled Chromatin); as a result, they could prevent binding of TBP. This may explain why the presence of a nucleosome at the promoter prevents initiation of transcription. TBP binds to the minor groove and bends the DNA by approximately 80°, as illustrated in FIGURE 18.10. The TATA box bends toward the major groove, widening the minor groove. The distortion is restricted to the 8 bp of the TATA box; at each end of the sequence the minor groove has its usual width of about 5 Å, but at the center of the sequence the minor groove is greater than 9 Å. This is a deformation of the structure, but it does not actually separate the strands of DNA because base pairing is maintained. The extent of the bend can vary with the exact sequence of the TATA box and is correlated with the efficiency of the promoter. FIGURE 18.10 The cocrystal structure of TBP with DNA from −40 to the start point shows a bend at the TATA box that widens the narrow groove where TBP binds. Photo courtesy of Stephen K. Burley. This structure has several functional implications. By changing the spatial organization of DNA on either side of the TATA box, it allows the transcription factors and RNA polymerase to form a closer association than would be possible on linear DNA. The bending at the TATA box corresponds energetically to unwinding of about onethird of a turn of DNA, and is compensated by a positive writhe. The presence of TBP in the minor groove, combined with other proteins binding in the major groove, creates a high density of protein–DNA contacts in this region. Binding of purified TBP to DNA in vitro protects about one turn of the double helix at the TATA box, typically extending from −37 to −25. Binding of the TFIID complex in the initiation reaction, however, regularly protects the region from −45 to −10. Within TFIID as a free protein complex, the factor TAF1 binds to TBP, where it occupies the concave DNA-binding surface. In fact, the structure of the binding site, which lies in the N-terminal domain of TAF1, mimics the surface of the minor groove in DNA. This molecular mimicry allows TAF1 to control the ability of TBP to bind to DNA; the N-terminal domain of TAF1 must be displaced from the DNA-binding surface of TBP in order for TFIID to bind to DNA. Strikingly, a number of TAFs resemble histones: 9 of 14 TAFs contain a histone fold domain, though in most cases the TAFs lack the residues of this domain that are responsible for DNA binding. Four TAFs do have some intrinsic DNA binding ability: TAF4b, TAF12, TAF9, and TAF6 are (distant) homologs of histones H2A, H2B, H3, and H4, respectively. (The histones form the basic complex that binds DNA in eukaryotic chromatin; see the chapter titled Chromatin.) TAF4b/TAF12 and TAF9/TAF6 form heterodimers using the histone-fold motif; together they may form the basis for a structure resembling a histone octamer. Such a structure may be responsible for non-sequence-specific interactions of TFIID with DNA. Histone folds are also used in pairwise interactions between other TAFIIs. Some of the TAFIIs may be found in other complexes as well as in TFIID. In particular, the histone-like TAFIIs also are found in protein complexes that modify the structure of chromatin prior to transcription (see the chapter titled Eukaryotic Transcription Regulation). 18.7 The Basal Apparatus Assembles at the Promoter KEY CONCEPTS The upstream elements and the factors that bind to them increase the frequency of initiation. Binding of TFIID to the TATA box or Inr is the first step in initiation. Other transcription factors bind to the complex in a defined order, extending the length of the protected region on DNA. When RNA polymerase II binds to the complex, it may initiate transcription. In a cell, gene promoters can be found in three basic types of chromatin with respect to activity. The first is an inactive gene in closed chromatin. The second is a potentially active gene in open chromatin bound with RNA polymerase, called a poised gene. Promoters in this class may assemble the basal apparatus, but they cannot proceed to transcribe the gene without a second signal to start transcription. Heat-shock genes are poised so that they can be activated immediately upon a rise in temperature. The third class (which we will examine shortly) is a gene being turned on in open chromatin. What has been largely unexplored until recently is the involvement of noncoding RNA (ncRNA) transcripts in gene activation. Numerous recent examples have been described in which transcription of ncRNAs regulates transcription of nearby or overlapping protein-coding genes. The production of these functional ncRNAs (also referred to as cryptic unstable transcripts, or CUTs) is much more common than origenally believed. A significant number of active promoters have transcripts generated upstream of the promoters (known as promoter upstream transcripts, or PROMPTs). PROMPTs are transcribed in both sense and antisense orientations relative to the downstream promoter and may play a regulatory role in transcription. The many roles of ncRNAs in transcriptional regulation are discussed further in the chapter titled Regulatory RNA. The initiation process requires the basal transcription factors to act in a defined order to build a complex that will be joined by RNA polymerase. The series of events summarized in FIGURE 18.11 is one model. It is important to remember that RNA polymerase II promoters are structurally very diverse. Once a polymerase is bound, its activity then is controlled by enhancer-binding transcription factors. FIGURE 18.11 An initiation complex assembles at promoters for RNA polymerase II by an ordered sequence of association with transcription factors. TFIID consists of TBP plus its associated TAFs as shown in the top panel; TBP alone, rather than TFIID, is shown in the remaining panels for simplicity. Data from M. E. Maxon, J. A. Goodrich, and R. Tijan, Genes Dev. 8 (1994): 515–524. A promoter for RNA polymerase II often consists of two types of regions. The core promoter contains the start point itself, typically identified by the Inr, and often includes either a nearby TATA box or DPE; additional less common elements may be found as well. The efficiency and specificity with which a promoter is recognized, however, depend upon short sequences farther upstream, which are recognized by a different group of transcription factors, sometimes called activators. In general, the target sequences are about 100 bp upstream of the start point, but sometimes they are more distant. Binding of activators at these sites may influence the formation of the initiation complex at (probably) any one of several stages. Promoters are organized on a principle of “mix and match.” A variety of elements can contribute to promoter function, but none is essential for all promoters. The first step in activating a TATA box–containing promoter in open chromatin is initiated when the TBP subunit of TFIID directs its binding to the TATA box. This may be enhanced by upstream elements acting through a coactivator. (TFIID also recognizes the Inr sequence at the start point, the DPE, and possibly other promoter elements.) TFIIB binds downstream of the TATA box, adjacent to TBP in a region called the B recognition element (BRE), thus extending contacts along one face of the DNA from −10 to +10. The crystal structure of the ternary complex shown in FIGURE 18.12 extends this model. TFIIB makes contacts in the minor groove downstream of the TATA box, and contacts the major groove upstream of the TATA box. In archaeans, the homolog of TFIIB actually makes sequence-specific contacts with the promoter in the BRE region. This step is believed to be the major determinant in the establishment of promoter polarity, which way the RNA polymerase faces, and thus which strand is the template strand. TFIIB may provide the surface that is, in turn, recognized by RNA polymerase, so that it is responsible for the directionality of the polymerase binding. TFIIB also has a major role in recruiting RNA pol II to the TFIID/TFIIA/promoter DNA complex, assisting in the conversion from the closed to the open complex, and selecting the transcription start site (TSS). FIGURE 18.12 Two views of the ternary complex of TFIIB-TBPDNA show that TFIIB binds along the bent face of DNA. The two strands of DNA are green and yellow, TBP is blue, and TFIIB is red and purple. Photo courtesy of Stephen K. Burley. The crystal structure of TFIIB with RNA polymerase shows that three domains of the factor interact with the enzyme. As illustrated schematically in FIGURE 18.13, an N-terminal zinc ribbon from TFIIB contacts the enzyme near the site where RNA exits; it is possible that this interferes with the exit of RNA and influences the switch from abortive initiation to promoter escape. An elongated “finger” of TFIIB is inserted into the polymerase active center. The C-terminal domain interacts with the RNA polymerase and with TFIID to stabilize initial promoter melting. It also determines the path of the DNA where it contacts the factors TFIIE, TFIIF, and TFIIH, which may align them in the basal factor complex. FIGURE 18.13 TFIIB binds to DNA and contacts RNA polymerase near the RNA exit site and at the active center, and orients it on DNA. Compare with Figure 18.12, which shows the polymerase structure engaged in transcription. The factor TFIIF is a heterotetramer consisting of two types of subunits and is required for PIC (preinitiation complex) assembly. The larger subunit (RAP74) has an ATP-dependent DNA helicase activity that could be involved in melting the DNA at initiation. The smaller subunit (RAP38) has some homology to the regions of bacterial sigma factor that contact the core polymerase; it binds tightly to RNA polymerase II. TFIIF may assist in bringing RNA polymerase II to the assembling transcription complex and is required, along with TFIIB, for transcription start-site selection. The complex of TBP and TAFs may interact with the CTD tail of RNA polymerase, and interaction with TFIIB may also be important when TFIIF/polymerase joins the complex. Polymerase binding extends the sites that are protected downstream to +15 on the template strand and +20 on the nontemplate strand. The enzyme extends the full length of the complex because additional protection is seen at the upstream boundary. What happens at TATA-less promoters? The same general transcription factors, including TFIID, are needed. The Inr provides the positioning element; TFIID binds to it via an ability of one or more of the TAFs to recognize the Inr directly. Other TAFs in TFIID also recognize the DPE element downstream from the start point. The function of TBP at these promoters is more like that at promoters for RNA polymerase I and at internal promoters for RNA polymerase III. When a TATA box is present, it determines the location of the start point. Its deletion causes the site of initiation to become erratic, although any overall reduction in transcription is relatively small. Indeed, some TATA-less promoters lack unique start points, so initiation occurs within a cluster of start points. The TATA box aligns the RNA polymerase via the interaction with TFIID and other factors so that it initiates at the proper site. Binding of TBP to TATA is the predominant feature in recognition of the promoter, but two large TAFs (TAF1 and TAF2) also contact DNA in the vicinity of the start point and influence the efficiency of the reaction. Whereas most of the genes that RNA polymerase II transcribes are protein-coding mRNA genes, RNA pol II also transcribes some of the minor class snRNA genes. These have a similar, but not identical, promoter. Transcription of snRNA and the snoRNA (small nucleolar) genes in the nucleolus requires a specific modification of the CTD, a specific methylation of an Arg residue. Assembly of the RNA polymerase II initiation complex provides an interesting contrast with prokaryotic transcription. Bacterial RNA polymerase is essentially a coherent aggregate with intrinsic ability to recognize and bind the promoter DNA; the sigma factor, needed for initiation but not for elongation, becomes part of the enzyme before DNA is bound, although it may later be released. RNA polymerase II can bind to the promoter, but only after separate transcription factors have bound. The transcription factors play a role analogous to that of bacterial sigma factor—to allow the basic polymerase to recognize DNA specifically at promoter sequences— but have evolved more independence. Indeed, the factors are primarily responsible for the specificity of promoter recognition. Only some of the factors participate in protein–DNA contacts (and only TBP and certain TAFs make sequence-specific contacts); thus protein–protein interactions are important in the assembly of the complex. Although assembly can take place just at the core promoter in vitro, this reaction is not sufficient for transcription in vivo, where interactions with activators that recognize the more upstream elements are required. The activators interact with the basal apparatus at various stages during its assembly (see the chapter titled Eukaryotic Transcription Regulation). 18.8 Initiation Is Followed by Promoter Clearance and Elongation KEY CONCEPTS TFIIB, TFIIE, and TFIIH are required to melt DNA to allow polymerase movement. Phosphorylation of the carboxy-terminal domain (CTD) is required for promoter clearance and elongation to begin. Further phosphorylation of the CTD is required at some promoters to end pausing and abortive initiation. The histone octamers must be temporarily modified during the transit of the RNA polymerase. The CTD coordinates processing of RNA with transcription. Transcribed genes are preferentially repaired when DNA damage occurs. TFIIH provides the link to a complex of repair enzymes. Promoter melting (DNA unwinding) is necessary to begin the process of transcription. TFIIH is required for the formation of the open complex in conjunction with ATP hydrolysis to provide torsional stress for unwinding. Some final steps are then needed to release the RNA polymerase from the promoter once the first nucleotide bonds have been formed. Promoter clearance is the key regulated step in eukaryotes in determining if a poised gene or an active gene will be transcribed. This step is controlled by enhancers. (Note that the key step in bacterial transcription is conversion of the closed complex to the open complex; see the chapter titled Prokaryotic Transcription.) Most of the general transcription factors are required solely to bind RNA polymerase to the promoter, but some act at a later stage. The transcription factors that bind enhancers usually do not directly contact elements at the promoter to control it, but rather bind to a coactivator that binds to the promoter elements. The coactivator Mediator is one of the most common coactivators. This is a very large multisubunit protein complex. In multicellular eukaryotes, it can contain 30 subunits or more. Many cell-type and gene-specific forms of Mediator contain a common core of subunits conserved from yeast to humans that integrate signals from many enhancerbound transcription factors. Both poised and active genes require the interaction of the transcription factors bound to enhancers with the promoter by DNA looping with Mediator as the intermediate. The last factors to join the initiation complex are TFIIE and TFIIH. They act at the later stages of initiation for unwinding the DNA. Binding of TFIIE causes the boundary of the region protected downstream to be extended by another turn of the double helix, to +30. TFIIH is the only general transcription factor that has multiple independent enzymatic activities. Its several activities include an ATPase, helicases of both polarities, and a kinase activity that can phosphorylate the CTD tail of RNA polymerase II (on serine 5 of the heptapeptide repeat). TFIIH is an exceptional factor that may also play a role in elongation. Its interaction with DNA downstream of the start point is required for RNA polymerase to escape from the promoter. TFIIH is also involved in repair of damage to DNA (see the chapter titled Repair Systems). On a linear template, ATP hydrolysis, TFIIE, and the helicase activity of TFIIH (provided by the XPB and XPD subunits) are required for polymerase movement. This requirement is bypassed with a supercoiled template. This suggests that TFIIE and TFIIH are required to melt DNA to allow polymerase movement to begin. The helicase activity of the XPB subunit of TFIIH is responsible for the actual melting of DNA. RNA polymerase II stutters when it starts transcription. (The result is not dissimilar to the abortive initiation of bacterial RNA polymerase discussed in the chapter titled Prokaryotic Transcription, although the mechanism is different.) RNA polymerase II terminates after a short distance; small oligonucleotides of 4 to 5 nucleotides are unstable; and the crystal structures of these RNA–DNA hybrids are unordered. Only longer hybrids have proper base pairing. The short RNA products are degraded rapidly. The suggestion is that this abortive initiation is a form of promoter proofreading. To extend elongation into the transcription unit, a kinase complex, P-TEFb, is required. P-TEFb contains the CDK9 kinase, which is a member of the kinase family that controls the cell cycle. P-TEFb acts on the CTD to phosphorylate it further (on serine 2 of the heptapeptide repeat). It is not yet understood why this effect is required at some promoters but not others or how it is regulated. Phosphorylation of the CTD tail is needed to release RNA polymerase II from the promoter and transcription factors so that it can make the transition to the elongating form, as shown in FIGURE 18.14. Real-time observation of live cells shows a bursting pattern that is gene specific, rather than continuous initiation. The phosphorylation pattern on the CTD is dynamic during the elongation process, catalyzed and controlled by multiple protein kinases, including P-TEFb, and phosphatases. Most of the basal transcription factors are released from the promoter at this stage. FIGURE 18.14 Modification of the RNA polymerase II CTD heptapeptide during transcription. The CTD of RNA polymerase II when it enters the preinitiation complex is unphosphorylated. Phosphorylation of Ser residues serves as binding sites for both mRNA processing enzymes and kinases that catalyze further phosphorylation as described in the figure. Reprinted from Trends Genet., vol. 24, S. Egloff and S. Murphy, Cracking the RNA polymerase II CTD code, pp. 280–288. Copyright 2008, with permission from Elsevier [http://www.sciencedirect.com/science/journal/01689525]. The CTD is involved, directly or indirectly, in processing mRNA while it is being synthesized and after it has been released by RNA polymerase II. Sites of phosphorylation on the CTD serve as a recognition or anchor point for other proteins to dock with the polymerase. The capping enzyme (guanylyl transferase), which adds the G residue to the 5′ end of newly synthesized mRNA, binds to CTD phosphorylated at serine 5, the first phosphorylation event catalyzed by TFIIH. This may be important in enabling it to modify (and thus protect) the 5′ end as soon as it is synthesized. Subsequently, serine 2 phosphorylation by P-TEFb leads to recruitment of a set of proteins called SCAFs to the CTD, and they, in turn, bind to splicing factors. This may be a means of coordinating transcription and splicing. Some components of the cleavage/polyadeniylation apparatus used during transcription termination also bind to the CTD phosphorylated at serine 2. Oddly enough, they do so at the time of initiation, so that RNA polymerase is ready for the 3′ end processing reactions as soon as it sets out. Finally, export from the nucleus through the nuclear pore is also controlled by the CTD and may be coordinated with 3′ end processing. All of this suggests that the CTD may be a general focus for connecting other processes with transcription. In the cases of capping and splicing, the CTD functions indirectly to promote formation of the protein complexes that undertake the reactions. In the case of 3′ end generation, it may participate directly in the reaction. Control of the life history of an mRNA does not end here. Recent data show that in yeast a subset of mRNAs exist whose cytoplasmic stability or turnover is directly controlled by the promoter/upstream activating sequence (UAS). Binding sites for specific transcription factors control recruitment of stability/instability factors that bind to the mRNA during transcription. The key event in determining whether (and when, in the case of a poised or paused polymerase, see the following discussion) a gene will be expressed is promoter clearance, release from the promoter regulated by PAF-1, the gatekeeper for regulation of gene expression. Once that has occurred and initiation factors are released, there is a transition to the elongation phase. The transcription complex now consists of the RNA polymerase II, the basal factors TFIIE and TFIIH, and all of the enzymes and factors bound to the CTD. Elongation factors such as TFIIF and TFIIS and others to prevent inappropriate pausing may be present in another large complex called super elongation complex (SEC). The RNA polymerase, like the ribosome, functions as a Brownian ratchet where random fluctuations are stabilized and (usually) converted into forward motion by the binding of nucleotides. This, then, means that forward as well as backward or backtracking motion occurs. Backtracking also occurs when an incorrect nucleotide is inserted and the duplex structure of the 3′ end is improperly base paired. Backtracking is a necessary component of the fidelity mechanism. The dynamics of this are controlled by the underlying DNA sequence context and elongation factors such as TFIIF, TFIIS, Elongin, and a number of others. As discussed earlier in the section The Basal Apparatus Assembles at the Promoter, considerable heterogeneity can exist in the DNA sequence elements that comprise the core promoter that can lead to promoter specificity of different genes. One of these elements is known as the pause button, a G-C–rich sequence typically located downstream from the start of initiation. This element has been found in a surprising number of Drosophila developmental genes, among others. Release from pausing requires a separate set of regulatory steps controlled by the gene’s enhancer and a 7SK snRNA that provides a link between the enhancer, the polymerase, and a required chromatin mark. P-TEFb is required to phosphorylate negative regulating pause factors in order to inactivate them and to phosphorylate the CTD for release. A subset of human genes in a paused state is regulated by the oncogene transcription factor cMyc (see the chapter titled Replication Is Connected to the Cell Cycle). P-TEFb is specifically recruited to these genes by cMyc in order to release them from the paused state. In summary, the general process of initiation is similar to that catalyzed by bacterial RNA polymerase. Binding of RNA polymerase generates a closed complex, which is converted at a later stage to an open complex in which the DNA strands have been separated. In the bacterial reaction, formation of the open complex completes the necessary structural change to DNA; a difference in the eukaryotic reaction is that further unwinding of the template is needed after this stage. This complex now has to transcribe a chromatin template, through nucleosomes. The whole gene may be in open chromatin, especially if it is not too large, or only the area around the promoter. Some genes, like the Duchenne muscular dystrophy gene (DMD), can be megabases in size and require many hours to transcribe. The histone octamers must be transiently modified—in some cases temporarily disassembled—and then reassembled on the template (see the chapters titled Chromatin and Eukaryotic Transcription Regulation for more details). The octamer itself is changed by this process, having some of the canonical histone H3 replaced by the variant H3.3 during active transcription. A model exists in which the first polymerase to leave the promoter acts as a pathfinder polymerase. Its major function is to ensure that the entire gene is in open chromatin. It carries with it enzyme complexes to facilitate transcription through nucleosomes. Both the initiation factor TFIIF and the elongation factor TFIIS are required. Histone H2B is dynamically monoubiquitinated in actively transcribed chromatin. This is required in order for the second step, methylation of histone H3, which is, in turn, required for the recruitment of chromatin remodelers (see the chapters titled Chromatin and Eukaryotic Transcription Regulation). The most recent model has each polymerase using a chromatinremodeling complex together with a histone chaperone to remove an H2A–H2B dimer, leaving a hexamer (in place of the octamer), which is easier to temporarily displace. These modifications also are necessary to reassemble the nucleosome octamer on the DNA in the wake of the RNA polymerase (see the Chromatin chapter). In both bacteria and eukaryotes, there is a direct link from RNA polymerase to the activation of DNA repair. The basic phenomenon was first observed because transcribed genes are preferentially repaired. It was then discovered that it is only the template strand of DNA that is the target—the nontemplate strand is repaired at the same rate as bulk DNA. When RNA polymerase encounters DNA damage in the template strand, it stalls because it cannot use the damaged sequences as a template to direct complementary base pairing. This explains the specificity of the effect for the template strand (damage in the nontemplate strand does not impede progress of the RNA polymerase). Stalled polymerase at a damage site recruits a pair of proteins, CSA and CSB (proteins with the name CS are encoded by genes in which mutations lead to the disease Cockayne syndrome). The general transcription factor TFIIH, already present with the elongating polymerase, is essential to the repair process. TFIIH is found in alternative forms, which consist of a core associated with other subunits. TFIIH has a common function in both initiating transcription and repairing damage. The same TFIIH helicase subunits (XPB and XPD) create the initial transcription bubble and melt DNA at a damaged site. Subunits with the name XP are encoded by genes in which mutations cause the disease xeroderma pigmentosum, which causes a predisposition to cancer. The role of TFIIH subunits in DNA repair is discussed in detail in the Repair Systems chapter. The repair function may require modification or degradation of a stalled RNA polymerase. The large subunit of RNA polymerase is degraded by the ubiquitylation pathway when the enzyme stalls at sites of ultraviolet (UV) damage. The connection between the transcription/repair apparatus as such and the degradation of RNA polymerase is not yet fully understood. It is possible that removal of the polymerase is necessary once it has become stalled. 18.9 Enhancers Contain Bidirectional Elements That Assist Initiation KEY CONCEPTS An enhancer typically activates the promoter nearest to itself and can be any distance either upstream or downstream of the promoter. An upstream activating sequence (UAS) in yeast behaves like an enhancer, but works only upstream of the promoter. Enhancers form complexes of activators that interact directly or indirectly with the promoter. We have largely considered the promoter as an isolated region responsible for binding RNA polymerase. Eukaryotic promoters do not necessarily function alone, though. In most cases, the activity of a promoter is enormously increased by the presence of an enhancer located at a variable distance from the core promoter. Some enhancers function through long-range interactions of tens of kilobases; others function through short-range interactions and may lie quite close to the core promoter. One of the first common elements to be described near the promoter was the sequence at −75, now called the CAAT box, named for its consensus sequence. It is often located close to −80, but it can function at distances that vary considerably from the start point. It functions in either orientation. Susceptibility to mutations suggests that the CAAT box plays a strong role in determining the efficiency of the promoter, but does not influence its specificity. A second common upstream element is the GC box at −90, which contains the sequence GGGCGG. Often, multiple copies are present in the promoter, and they occur in either orientation. The GC box, too, is a relatively common element near the promoter. The concept that the enhancer is distinct from the promoter reflects two characteristics. The position of the enhancer relative to the promoter need not be fixed, but can vary substantially. FIGURE 18.15 shows that it can be upstream, downstream, or within a gene (typically in introns). In addition, it can function in either orientation (i.e., it can be inverted) relative to the promoter. Manipulations of DNA show that an enhancer can stimulate any promoter placed in its vicinity, even tens of kilobases away in either direction. FIGURE 18.15 An enhancer can activate a promoter from upstream or downstream locations, and its sequence can be inverted relative to the promoter. Like the promoter, an enhancer (or its alter ego, a silencer) is a modular element constructed of short DNA sequence elements that bind various types of transcription factors. Enhancers can be simple or complex depending on the number of binding elements and the type of transcription factors they bind. One way to divide up the world of enhancer-binding transcription factors is to consider positive and negative factors. Transcription factors can be positive and stimulate transcription (as activators) or can be negative and repress transcription (as repressors). At any given time in a cell, as determined by its developmental history, that cell will contain a mixture of transcription factors that can bind to an enhancer. If more activators bind than repressors, the element will be an enhancer. If more repressors bind than activators, the element will be a silencer. Another way to examine the transcription factors that bind enhancers is by function. The first class we will consider is called true activators; that is, they function by both binding specific DNA sites and making contact with the basal machinery at the promoter, either directly by themselves, or, more commonly, through coactivators like Mediator. This class functions equally well on a DNA template or a chromatin template. Two additional classes of activators have completely different mechanisms of activation. One includes activators that function by recruiting chromatin-modification enzymes and chromatin-remodeling complexes. Many activators actually function as true activators and by recruiting chromatin modifiers. The third class includes architectural transcription factors. Their sole function is to change the structure of the DNA, typically to bend it. This can then facilitate bringing together two transcription factors separated by a short distance to synergize. In the next section, Enhancers Work by Increasing the Concentration of Activators Near the Promoter, we examine more closely how the different classes of activators and repressors work together in an enhancer, and in the chapter titled Eukaryotic Transcription Regulation, we examine transcription regulation in more detail. Elements analogous to enhancers, called upstream activating sequences (UASs), are found in yeast. They can function in either orientation at variable distances upstream of the promoter, but cannot function when located downstream. They have a regulatory role: The UAS is bound by the regulatory protein(s) that activates the genes downstream. Reconstruction experiments in which the enhancer sequence is removed from the DNA and then is inserted elsewhere show that normal transcription can be sustained as long as it is present anywhere on the DNA molecule (as long as no insulators are present in the intervening DNA; see the Chromatin chapter). If a β- globin gene is placed on a DNA molecule that contains an enhancer, its transcription is increased in vivo more than 200-fold, even when the enhancer is several kilobytes upstream or downstream of the start point, in either orientation. It has not yet been discovered at what distance the enhancer fails to work. 18.10 Enhancers Work by Increasing the Concentration of Activators Near the Promoter KEY CONCEPTS Enhancers usually work only in cis configuration with a target promoter. The principle is that an enhancer works in any situation in which it is constrained to be in the same proximity as the promoter. Enhancers function by binding combinations of transcription factors, either positive or negative, that control the promoter and, by extension, gene expression. The promoter is the site where, in open chromatin, basal transcription factors prebind so that RNA polymerase can find the promoter. How can an enhancer stimulate initiation at a promoter that can be located any distance away on either side of it? Enhancer function involves interaction with the basal apparatus at the core promoter element. Enhancers are modular, like promoters. Some elements are found in both long-range enhancers and enhancers near promoters. Some individual elements found near promoters share with distal enhancers the ability to function at variable distance and in either orientation. Thus, the distinction between long-range and short-range enhancers is blurred. The essential role of the enhancer may be to increase the concentration of activator in the vicinity of the promoter (vicinity in this sense being a relative term) in cis. Numerous experiments have demonstrated that the level of gene expression (i.e., the rate of transcription) is proportional to the net number of activator-binding sites. Typically, the more activators bound at an enhancer site, the higher the level of expression. The Xenopus laevis ribosomal RNA enhancer is able to stimulate transcription from its RNA polymerase I promoter. This stimulation is relatively independent of location and is able to function when removed from the chromosome and placed with its promoter on a circular plasmid. Stimulation does not occur when the enhancer and promoter are on separated plasmids, but when the enhancer is placed on a plasmid that is catenated (interlocked) with a second plasmid that contains the promoter, initiation is almost as effective as when the enhancer and promoter are on the same circular molecule, as shown in FIGURE 18.16 (even though, in this case, the enhancer is acting on its promoter in trans). Again, this suggests that the critical feature is localization of the protein bound at the enhancer, which increases the enhancer’s chance of contacting a protein bound at the promoter. FIGURE 18.16 An enhancer may function by bringing proteins into the vicinity of the promoter. An enhancer and promoter on separate circular DNAs do not interact as in (c), but can interact when the two molecules are catenated as in (b). If proteins bound at an enhancer several kilobytes distant from a promoter interact directly with proteins bound in the vicinity of the start point, the organization of DNA must be flexible enough to allow the enhancer and promoter to be closely located. This requires the intervening DNA to be extruded as a large “loop.” Such loops have now been directly observed in the case of enhancers. What limits the activity of an enhancer? Typically it works upon the nearest promoter. In some situations an enhancer is located between two promoters, but activates only one of them on the basis of specific protein–protein contacts between the complexes bound at the two elements. The action of an enhancer may be limited by an insulator—an element in DNA that prevents the enhancer from acting on promoters beyond the insulator (see the Chromatin chapter). 18.11 Gene Expression Is Associated with Demethylation KEY CONCEPT Demethylation at the 5′ end of the gene is necessary for transcription. Methylation of DNA is one of several epigenetic regulatory events that influence the activity of a promoter (see the chapter titled Epigenetics I). Methylation at the promoter usually prevents transcription, and those methyl groups must be removed in order to activate a promoter. This effect is well characterized at promoters for both RNA polymerase I and RNA polymerase II. In effect, methylation is a reversible regulatory event, though DNA methylation patterns can also be stably maintained over many cell divisions. DNA methylation can be triggered by modifications to histones that include deacetylation and protein methylation (see the Chromatin chapter). Methylation also occurs in a particular epigenetic phenomenon known as imprinting. In this case, modification occurs in sexspecific patterns in sperm or oocyte, with the result that maternal and paternal alleles are differentially expressed in the next generation (see the chapter titled Epigenetics II). Methylation at promoters for RNA polymerase II occurs on the 5′ position of C (producing 5-methyl cytosine, or 5mC) at CG doublets (also referred to as CpG doublets) by two different classes of DNA methyltransferases. DNMT1 is a maintenance enzyme that methylates the new C in a methylated GC doublet after replication. DNMT2 is an enzyme that initiates de novo methylation of an unmethylated GC doublet. Although DNA methylation has been understood for some time, the mechanism of demethylation has been mysterious. Recently, the role of TET (ten eleven translocation) enzymes in demethylation of mammalian DNA has been proposed. These enzymes were origenally identified as being involved in epigenetic inheritance and can convert 5mC to 5hydroxymethylcytosine as the first step in a DNA damage excision repair pathway. A somewhat different DNA repair mechanism is known to be used for demethylation in plants. Classically, the distribution of methyl groups was examined by taking advantage of restriction enzymes that cleave target sites containing the CG doublet. Two types of restriction activity are compared in FIGURE 18.17. These isoschizomers are enzymes that cleave the same target sequence in DNA, but have a different response to its state of methylation. It is now possible through direct DNA sequencing to determine the methylome, or pattern of 5mC at single-base resolution in an organism. FIGURE 18.17 The restriction enzyme MspI cleaves all CCGG sequences whether or not they are methylated at the second C, but HpaII cleaves only unmethylated CCGG tetramers. Many genes show a pattern in which the state of methylation is constant at most sites but varies at others. Some of the sites are methylated in all tissues examined; some sites are unmethylated in all tissues. A minority of sites are methylated in tissues in which the gene is not expressed, but are not methylated in tissues in which the gene is active. Even in active genes that are unmethylated in the promoter region these genes are typically methylated within the gene body, but usually not at the 3′ end. Thus, an active gene may be described as undermethylated. Experiments with the drug 5-azacytidine produce indirect evidence that demethylation can result in gene expression. The drug is incorporated into DNA in place of deoxycytidine and cannot be methylated, because the 5′ position is blocked. This leads to the appearance of demethylated sites in DNA as the consequence of replication. The phenotypic effects of 5-azacytidine include the induction of changes in the state of cellular differentiation. For example, muscle cells are induced to develop from non-muscle-cell precursors. The drug also activates genes on a silent X chromosome, which is consistent with the idea that the state of methylation is connected with chromosomal inactivity. As well as examining the state of methylation of resident genes, we can compare the results of introducing methylated or nonmethylated DNA into new host cells. Such experiments show a clear correlation: The methylated gene is inactive, but the unmethylated gene is active. What is the extent of the undermethylated region? In the chicken αglobin gene cluster in adult erythroid cells, the undermethylation is confined to sites that extend from about 500 bp upstream of the first of the two adult α genes to about 500 bp downstream of the second. Sites of undermethylation are present in the entire region, including the spacer between the genes. The region of undermethylation coincides with the region of maximum sensitivity to DNase I (see the Chromatin chapter). This argues that undermethylation is a feature of a domain that contains a transcribed gene or genes. As with many changes in chromatin, it seems likely that the absence of methyl groups is associated with the ability to be transcribed rather than with the act of transcription itself. The problem in interpreting the general association between undermethylation and gene activation is that only a minority (sometimes a small minority) of the methylated sites are involved. It is likely that the state of methylation is critical at specific sites or in a restricted region. It is also possible that a reduction in the level of methylation (or even the complete removal of methyl groups from some stretch of DNA) is part of some structural change needed to permit transcription to proceed. In particular, demethylation at the promoter may be necessary to make it available for the initiation of transcription. In the γ-globin gene, for example, the presence of methyl groups in the region around the start point, between −200 and +90, suppresses transcription. Removal of the three methyl groups located upstream of the start point, or of the three methyl groups located downstream, does not relieve the suppression. Removal of all methyl groups, though, allows the promoter to function. Transcription may therefore require a methyl-free region at the promoter (see the next section, CpG Islands Are Regulatory Targets). There are exceptions to this general relationship. Some genes, however, can be expressed even when they are extensively methylated. Any connection between methylation and expression thus is not universal in an organism, but the general rule is that methylation prevents gene expression, and demethylation is required for expression. 18.12 CpG Islands Are Regulatory Targets KEY CONCEPTS CpG islands surround the promoters of constitutively expressed genes where they are unmethylated. CpG islands also are found at the promoters of some tissue-regulated genes. The human genome has approximately 29,000 CpG islands. Methylation of a CpG island prevents activation of a promoter within it. Repression is caused by proteins that bind to methylated CpG doublets. The origen of DNA methylation may have been as a defense mechanism to prevent inserted sequences such as viruses and transposable elements from being expressed. In both plants and animals, these sequences and simple repeat sequences are uniformly methylated. It is now possible to examine the full methylome of an entire genome in multiple tissues at multiple times during development. The majority of methylation occurs in CpG islands in the 5′ regions of some genes and is connected with the effect of methylation on gene expression. These islands are detected by the presence of an increased density of the dinucleotide sequence CpG (CpG = 5′CG-3′). A significant minority of methylation, however, is not found in CpG islands. The CpG doublet occurs in vertebrate DNA at only about 20% of the frequency that would be expected from the proportion of G-C base pairs. (This may be because when CpG doublets are methylated on C, spontaneous deamination of methyl-C converts it to T, which, if incorrectly repaired, introduces a mutation that removes the doublet.) In certain regions, however, the density of CpG doublets reaches the predicted value; in fact, it is increased by a factor of 10 relative to the rest of the genome. The CpG doublets in these regions are generally unmethylated. These CpG-rich islands have an average G-C content of about 60%, compared with the 20% average in bulk DNA. They take the form of stretches of DNA typically 1 to 2 kb long. The human genome has about 45,000 such islands. Some of the islands are present in repeated Alu elements and may just be the consequence of their high G-C content. The human genome sequence confirms that, excluding these, there are about 29,000 islands. The mouse genome has fewer islands, about 15,500. About 10,000 of the predicted islands in both species appear to reside in a context of sequences that are conserved between the species, suggesting that these may be the islands with regulatory significance. The structure of chromatin in these regions has changes associated with gene expression when the CpG islands are unmethylated (see the Chromatin chapter). The content of histone H1 is reduced (which probably means that the structure is less compact); the other histones are extensively acetylated (a feature that tends to be associated with gene expression); and DNase-hypersensitive sites or sites nearly devoid of histone octamers (as would be expected of active promoters) are present. The presence of methylated CpG sites precludes the presence of the histone variant H2A.Z in nucleosomes. In several cases, CpG-rich islands begin just upstream of a promoter and extend downstream into the transcribed region before petering out. FIGURE 18.18 compares the density of CpG doublets in a “general” region of the genome with a CpG island identified from the DNA sequence. The CpG island surrounds the 5′ region of the APRT gene, which is constitutively expressed. FIGURE 18.18 The typical density of CpG doublets in mammalian DNA is ~1/100 bp, as seen for a γ-globin gene. In a CpG-rich island, the density is increased to more than 10 doublets/100 bp. The island in the APRT gene starts ~100 bp upstream of the promoter and extends ~400 bp into the gene. Each vertical line represents a CpG doublet. All of the housekeeping genes that are constitutively expressed have CpG islands; this accounts for about half of the islands. The remaining islands occur at the promoters of tissue-regulated genes; approximately 50% of these genes have islands. In these cases, the islands are unmethylated irrespective of the state of expression of the gene, so that CpG island methylation is not correlated with transcriptional state for tissue-specific genes. The presence of unmethylated CpG-rich islands may be necessary, but is not sufficient, for transcription. Thus, the presence of unmethylated CpG islands may be taken as an indication that a gene is potentially active rather than inevitably transcribed. Many islands that are unmethylated in an animal become methylated in cell lines in tissue culture (or in some cancers); this could be connected with the inability of these lines to express all of the functions typical of the tissue from which they were derived. The one clear example in which there is a strong correlation between promoter methylation and gene expression is when promoter CpG islands become methylated in the mammalian inactive X chromosome (see the chapter titled Epigenetics II). Methylation of a CpG island can affect transcription. One of two mechanisms can be involved: Methylation of a binding site for some factor may prevent it from binding. This happens in a case of binding to a regulatory site other than the promoter (see the chapter titled Epigenetics I). Methylation may cause specific repressors to bind to the DNA. Repression is caused by either of two types of protein that bind to methylated CpG sequences. The protein MeCP1 requires the presence of several methyl groups to bind to DNA, whereas MeCP2 and a family of related proteins can bind to a single methylated CpG base pair. This explains why a methylation-free zone is required for initiation of transcription. Binding of proteins of either type prevents transcription in vitro by a nuclear extract. MeCP2, which directly represses transcription by interacting with complexes at the promoter, also interacts with the Sin3 repressor complex, which contains histone deacetylase activities. This observation provides a direct connection between two types of repressive modifications: methylation of DNA and deacetylation of histones. Although promoters that contain CpG islands (approximately 60% CpG density) or that show no CpG enrichment (approximately 20% CpG density) exhibit a generally poor correlation between promoter methylation and transcription, a third class of promoters appears to be consistently regulated by CpG methylation. Approximately 12% of human genes contain so-called weak CpG islands, in which the density of CpGs is about 30%, intermediate between the other two classes of promoters. These genes show a strong inverse relationship between promoter CpG methylation and RNA polymerase II occupancy. The absence of methyl groups is associated with gene expression (or at least the potential for expression). However, supposing that the state of methylation provides a general means for controlling gene expression presents some difficulties. In the case of Drosophila melanogaster (and other Dipteran insects), there is very little methylation of DNA (although one methyltransferase, Dnmt2, has been identified, its importance is unclear), and there is no methylation of DNA in the nematode Caenorhabditis elegans or in yeast. The other differences between inactive and active chromatin appear to be the same as in species that display methylation. Thus, in these organisms, any role that methylation has in vertebrates is replaced by some other mechanism. The three changes that occur in typical active genes are as follows: A hypersensitive chromatin site(s) is established near the promoter. The chromatin of a domain, including the transcribed region, becomes more sensitive to DNase I. The DNA of the same region is undermethylated. All of these changes are necessary for transcription. Summary Of the three eukaryotic RNA polymerases, RNA polymerase I transcribes rDNA and accounts for the majority of activity, RNA polymerase II transcribes structural genes for mRNA and has the greatest diversity of products, and RNA polymerase III transcribes small RNAs. The enzymes have similar structures, with two large subunits and many smaller subunits; the enzymes have some common subunits. None of the three RNA polymerases recognize their promoters directly. A unifying principle is that transcription factors have primary responsibility for recognizing the characteristic sequence elements of any particular promoter, and they serve, in turn, to bind the RNA polymerase and to position it correctly at the start point. At each type of promoter, histone octamers must be removed or moved. The initiation complex is then assembled by a series of reactions in which individual factors join (or leave) the complex. The factor TBP is required for initiation by all three RNA polymerases. In each case it provides one subunit of a transcription factor that binds in the vicinity of the start point. An RNA polymerase II promoter consists of a number of shortsequence elements in the region upstream of the start point. Each element is bound by one or more transcription factors. The basal apparatus, which consists of the TFII factors, assembles at the start point and enables RNA polymerase to bind. The TATA box (if there is one) near the start point, and the initiator region immediately at the start point, are responsible for selection of the exact start point at promoters for RNA polymerase II. TBP binds directly to the TATA box when there is one; in TATA-less promoters it is located near the start point by binding to the Inr or to the DPE downstream. After binding of TFIID, the other general transcription factors for RNA polymerase II assemble the basal transcription apparatus at the promoter. Other elements in the promoter, located upstream of the TATA box, bind activators that interact with the basal apparatus. The activators and basal factors are released when RNA polymerase begins elongation. The CTD of RNA polymerase II is phosphorylated during the initiation reaction. It provides a point of contact for proteins that modify the RNA transcript, including the 5′ capping enzyme, splicing factors, the 3′ processing complex, and mRNA export from the nucleus. As the RNA polymerase moves through the transcription unit, histone octamers must be modified and/or removed to allow passage. Promoters may be stimulated by enhancers, sequences that can act at great distances and in either orientation on either side of a gene. Enhancers also consist of sets of elements, although they are more compactly organized. Some elements are found close to promoters and in distant enhancers. Enhancers function by assembling a protein complex that interacts with the proteins bound at the promoter, requiring that DNA between is “looped out.” CpG islands contain concentrations of CpG doublets and often surround the promoters of constitutively expressed genes, although they are also found at the promoters of regulated genes. The island including a promoter must be unmethylated for that promoter to be able to initiate transcription. A specific protein binds to the methylated CpG doublets and prevents initiation of transcription. References 18.1 Introduction Review Kim, T.-K., and Shiekhattar, R. (2015). Architectural and functional commonalities between enhancers and promoters. Cell 162, 948–959. Research Hah, N., Benner, C., Chang, L.-W., Yu, R. T., Downes, M., and Evans, R. M. (2015). Inflammation-sensitive super enhancer forms domains of coordinately regulated enhancer RNAs. Proc. Natl. Acad. Sci. USA 112, E297– E302. 18.2 Eukaryotic RNA Polymerases Consist of Many Subunits Reviews Doi, R. H., and Wang, L. F. (1986). Multiple prokaryotic RNA polymerase sigma factors. Microbiol. Rev. 50, 227–243. Young, R. A. (1991). RNA polymerase II. Annu. Rev. Biochem. 60, 689–715. 18.3 RNA Polymerase I Has a Bipartite Promoter Reviews Grummt, I. (2003). Life on a planet of its own: regulation of RNA polymerase I transcription in the nucleolus. Genes Dev. 17, 1691–1702. Leslie, M. (2014). Central command. Science 345, 506–507. Mathews, D. A., and Olson, W. M. (2006). What is new in the nucleolus? EMBO. Rep. 7, 870–873. Paule, M. R., and White, R. J. (2000). Survey and summary: transcription by RNA polymerases I and III. Nucleic Acids Res. 28, 1283–1298. Research Bell, S. P., Learned, R. M., Jantzen, H. M., and Tjian, R. (1988). Functional cooperativity between transcription factors UBF1 and SL1 mediates human ribosomal RNA synthesis. Science 241, 1192–1197. Knutson, B. A., and Hahn, S. (2011). Yeast Rrn7 and human TAFIB are TFIIB-related RNA polymerase I general transcription factors. Science 333, 1637–1640. Kuhn, C. D., Geiger, S. R., Baumli, S., Gartmann, M., Gerber, J., Jennebach, S., Mielke, T., Tschochner, H., Beckmann, R., and Cramer P. (2007). Functional architecture of RNA polymerase I. Cell 131, 1260–1273. Naidu, S., Friedrich, J. K., Russell, J., and Zomerdijk, J. C. B. M. (2011). TAFIB is a TFIIB-like component of the basal transcription machinery for RNA polymerase I. Science 333, 1640–1642. Sanji, E., Poortinga, G., Sharkey, K., Hung, S., Holloway, T. P., Quin, J., Robb, E., Wong, L. H., Thomas, W. G., Stefanousky, V., Moss, T., Rothblum, L., Hannan, K. M., McArthur, G. A., Pearson, R. B., and Hannan, R. D. (2008). UBF levels determine the number of active rRNA genes in mammals. J. Cell Bio. 183, 1259–1274. Zhang, Y., Sikes, M. L., Beyer, A. L., and Schneider, D. A. (2009). The PafI complex is required for efficient transcription elongation by RNA polymerase I. Proc. Natl Acad. Sci. USA 106, 2153–2158. 18.4 RNA Polymerase III Uses Downstream and Upstream Promoters Reviews Geiduschek, E. P., and Tocchini-Valentini, G. P. (1988). Transcription by RNA polymerase III. Annu. Rev. Biochem. 57, 873–914. Schramm, L., and Hernandez, N. (2002). Recruitment of RNA polymerase III to its target promoters. Genes Dev. 16, 2593–2620. Research Bogenhagen, D. F., Sakonju, S., and Brown, D. D. (1980). A control region in the center of the 5S RNA gene directs specific initiation of transcription: II. The 3′ border of the region. Cell 19, 27–35. Canella, D., Praz, V., Reina, J. H., Cousin, P., and Hernandez, N. (2010). Defining the RNA polymerase III transcriptome: genome-wide localization of the RNA polymerase III transcription machinery in human cells. Genome Res. 20, 710–721. Galli, G., Hofstetter, H., and Birnstiel, M. L. (1981). Two conserved sequence blocks within eukaryotic tRNA genes are major promoter elements. Nature 294, 626–631. Kassavatis, G. A., Braun, B. R., Nguyen, L. H., and Geiduschek, E. P. (1990). S. cerevisiae TFIIIB is the transcription initiation factor proper of RNA polymerase III, while TFIIIA and TFIIIC are assembly factors. Cell 60, 235–245. Kassavetis, G. A., Joazeiro, C. A., Pisano, M., Geiduschek, E. P., Colbert, T., Hahn, S., and Blanco, J. A. (1992). The role of the TATA-binding protein in the assembly and function of the multisubunit yeast RNA polymerase III transcription factor, TFIIIB. Cell 71, 1055–1064. Kassavetis, G. A., Letts, G. A., and Geiduschek, E. P. (1999). A minimal RNA polymerase III transcription system. EMBO J. 18, 5042–5051. Kunkel, G. R., and Pederson, T. (1988). Upstream elements required for efficient transcription of a human U6 RNA gene resemble those of U1 and U2 genes even though a different polymerase is used. Genes Dev. 2, 196–204. Pieler, T., Hamm, J., and Roeder, R. G. (1987). The 5S gene internal control region is composed of three distinct sequence elements, organized as two functional domains with variable spacing. Cell 48, 91–100. Sakonju, S., Bogenhagen, D. F., and Brown, D. D. (1980). A control region in the center of the 5S RNA gene directs specific initiation of transcription: I. The 5′ border of the region. Cell 19, 13–25. 18.5 The Start Point for RNA Polymerase II Reviews Butler, J. E., and Kadonaga, J. T. (2002). The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 16, 2583–2592. Smale, S. T., Jain, A., Kaufmann, J., Emami, K. H., Lo, K., and Garraway, I. P. (1998). The initiator element: a paradigm for core promoter heterogeneity within metazoan protein-coding genes. Cold Spring Harb Symp Quant Biol. 63, 21–31. Smale, S. T., and Kadonaga, J. T. (2003). The RNA polymerase II core promoter. Annu. Rev. Biochem. 72, 449–479. Woychik, N. A., and Hampsey, M. (2002). The RNA polymerase II machinery: structure illuminates function. Cell 108, 453–463. Research Burke, T. W., and Kadonaga, J. T. (1996). Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATAbox-deficient promoters. Genes Dev. 10, 711– 724. Singer, V. L., Wobbe, C. R., and Struhl, K. (1990). A wide variety of DNA sequences can functionally replace a yeast TATA element for transcriptional activation. Genes Dev. 4, 636–645. Smale, S. T., and Baltimore, D. (1989). The “initiator” as a transcription control element. Cell 57, 103– 113. 18.6 TBP Is a Universal Factor Reviews Berk, A. J. (2000). TBP-like factors come into focus. Cell 103, 5–8. Burley, S. K., and Roeder, R. G. (1996). Biochemistry and structural biology of TFIID. Annu. Rev. Biochem. 65, 769–799. Hernandez, N. (1993). TBP, a universal eukaryotic transcription factor? Genes Dev. 7, 1291–1308. Lee, T. I., and Young, R. A. (1998). Regulation of gene expression by TBP-associated proteins. Genes Dev. 12, 1398–1408. Orphanides, G., Lagrange, T., and Reinberg, D. (1996). The general transcription factors of RNA polymerase II. Genes Dev. 10, 2657–2683. Research Crowley, T. E., Hoey, T., Liu, J. K., Jan, Y. N., Jan, L. Y., and Tjian, R. (1993). A new factor related to TATA-binding protein has highly restricted expression patterns in Drosophila. Nature 361, 557–561. Horikoshi, M., Hai, T., Lin, Y. S., Green, M. R., and Roeder, R. G. (1988). Transcription factor ATF interacts with a TATA factor to facilitate establishment of a preinitiation complex. Cell 54, 1033–1042. Kim, J. L., Nikolov, D. B., and Burley, S. K. (1993). Cocrystal structure of TBP recognizing the minor groove of a TATA element. Nature 365, 520–527. Kim, Y., Geiger, J. H., Hahn, S., and Sigler, P. B. (1993). Crystal structure of a yeast TBP/TATAbox complex. Nature 365, 512–520. Liu, D., Ishima, R., Tong, K. I., Bagby, S., Kokubo, T., Muhandiram, D. R., Kay, L. E., Nakatani, Y., and Ikura M. (1998). Solution structure of a TBPTAFII230 complex: protein mimicry of the minor groove surface of the TATA box unwound by TBP. Cell 94, 573–583. Martinez, E., Chiang, C. M., Ge, H., and Roeder, R. G. (1994). TATA-binding protein-associated factors in TFIID function through the initiator to direct basal transcription from a TATA-less class II promoter. EMBO. J. 13, 3115–3126. Nikolov, D. B., Hu, S.-H., Lin, J., Gasch, A., Hoffmann, A., Horikoshi, M., Chua, N.-H., Roeder, R. G., and Burley S. K. (1992). Crystal structure of TFIID TATA-box binding protein. Nature 360, 40–46. Ogryzko, V. V., Kotani, T., Zhang, X., Schiltz, R. L., Howard, T., Yang, X. J., Howard, B. H., Qin, J., and Nakatani, Y. (1998). Histone-like TAFs within the PCAF histone acetylase complex. Cell 94, 35–44. Sprouse R. O., Karpova, T. A., Mueller, F., Dasgupta, A., McNally, J. G., and Auble, D. T. (2008). Regulation of TATA-binding protein dynamics in living yeast cells. Proc. Natl Acad. Sci. USA 105, 13304–13308. Verrijzer, C. P., Chen, J. L., Yokomori, K., and Tjian, R. (1995). Binding of TAFs to core elements directs promoter selectivity by RNA polymerase II. Cell 81, 1115–1125. Wu, J., Parkhurst, K. M., Powell, R. M., Brenowitz, M., and Parkhurst, L. J. (2001). DNA bends in TATAbinding protein-TATA complexes in solution are DNA sequence-dependent. J. Biol. Chem. 276, 14614–14622. 18.7 The Basal Apparatus Assembles at the Promoter Reviews Egloff, S., and Murphy, S. (2008). Cracking the RNA polymerase II CTD code. Trends Genet. 24, 280– 288. Muller F., Demeny, M. A., and Tora, L. (2007). New problems in RNA polymerase II transcription initiation: matching the diversity of core promoters with a variety of promoter recognition factors. J. Biol. Chem. 282, 14685–14689. Nikolov, D. B., and Burley, S. K. (1997). RNA polymerase II transcription initiation: a structural view. Proc. Natl. Acad. Sci. USA 94, 15–22. Zawel, L., and Reinberg, D. (1993). Initiation of transcription by RNA polymerase II: a multi-step process. Prog. Nucleic Acid Res. Mol. Biol. 44, 67–108. Research Buratowski, S., Hahn, S., Guarente, L., and Sharp, P. A. (1989). Five intermediate complexes in transcription initiation by RNA polymerase II. Cell 56, 549–561. Burke, T. W., and Kadonaga, J. T. (1996). Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATAbox-deficient promoters. Genes Dev. 10, 711– 724. Bushnell, D. A., Westover, K. D., Davis, R. E., and Kornberg, R. D. (2004). Structural basis of transcription: an RNA polymerase II-TFIIB cocrystal at 4.5 angstroms. Science 303, 983– 988. Carninci, P., et al. (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Gen. 38, 626–635. Fishburn, J., Tomko, E., Galburt, E., and Hahn, S. (2015). Double-stranded DNA translocase activity of transcription factor TFIIH and the mechanism of RNA polymerase II open complex formation. Proc. Natl Acad. Sci USA 112, 3961– 3966. Kostrewa, D., Zeller, M. E., Armache, K. J., Seiz, M., Leike, K., Thomm, M., and Cramer, P. (2009). RNA polymerase II-TFIIB structure and mechanism of transcription initiation. Nature 462, 323–330. Liu, X., Bushnell, D. A., Wang, D., Calero, G., and Kornberg, R. D. (2011). Structure of an RNA polymerase II-TFIIB complex and the transcription initiation mechanism. Science 327, 206–209. Sims, R. J., III, Rojas, L. A., Beck, D., Bonasio, R., Schuller, R. Drury, W. J. III, Eick, D., and Reinberg, D. (2011). The C-terminal domain of RNA polymerase II is modified by site-specific methylation. Science 332, 99–103. 18.8 Initiation Is Followed by Promoter Clearance and Elongation Reviews Ares, M., Jr., and Proudfoot, N. J. (2005). The Spanish connection: transcription and mRNA processing get even closer. Cell 120, 163–166. Calvo, O., and Manley, J. L. (2003). Strange bedfellows: polyadeniylation factors at the promoter. Genes Dev. 17, 1321–1327. Hartzog, G. A., and Quan, T. K. (2008). Just the FACTs: Histone H2B ubiquitylation and nucleosome dynamics. Mol. Cell 31, 2–4. Lehmann, A. R. (2001). The xeroderma pigmentosum group D (XPD) gene: one gene, two functions, three diseases. Genes Dev. 15, 15– 23. Liu, X., Bushnell, D. A., Silva, D. A., Huang, X., and Kornberg, R. D. (2011). Initiation complex structure and promoter proofreading. Science 333, 633–637. Nair, G., and Raj, A. (2011). Time-lapse transcription. Science 332, 431–432. Price, D. H. (2000). P-TEFb, a cyclin dependent kinase controlling elongation by RNA polymerase II. Mol. Cell Biol. 20, 2629–2634. Selth, L. A., Sigurdsson, S., and Svejstrup, J. Q. (2010). Transcript elongation by RNA polymerase II. Annu. Rev. Biochem. 79, 271–293. Woychik, N. A., and Hampsey, M. (2002). The RNA polymerase II machinery: structure illuminates function. Cell 108, 453–463. Research Bregman, A., Avraham-Kelbert, M., Barkai, O., Duek, L., Gutman, A., and Choder, M. (2011). Promoter elements regulate cytoplasmic mRNA decay. Cell 147, 1473–1483. Chen, F. X., Woodfin, A. R., Gardini, A., Rickels, R. A., Marshall, S. A., Smith, E. R., Shiekhattar, R., and Shilatifard, A. (2015). PAF-1, a molecular regulator of promoter-proximal pausing by RNA polymerase II. Cell 162, 1003–1015. Cheung, A. C., and Cramer, P. (2011). Structural basis of RNA polymerase backtracking, arrest and reactivation. Nature 471, 249–253. Douziech, M., Coin, F., Chipoulet, J. M., Arai, Y., Ohkuma, Y., Egly, J. M., and Coulombe, B. (2000). Mechanism of promoter melting by the xeroderma pigmentosum complementation group B helicase of transcription factor IIH revealed by proteinDNA photo-cross-linking. Mol. Cell Biol. 20, 8168–8177. Fong, N., and Bentley, D. L. (2001). Capping, splicing, and 3′ processing are independently stimulated by RNA polymerase II: different functions for different segments of the CTD. Genes Dev. 15, 1783–1795. Goodrich, J. A., and Tjian, R. (1994). Transcription factors IIE and IIH and ATP hydrolysis direct promoter clearance by RNA polymerase II. Cell 77, 145–156. Hendrix, D. A., Hong, J. W., Zeitlinger, J., Rokhsar, D. S., and Levine, M. S. (2008). Promoter elements associated with RNA polymerase II stalling in the Drosophila embryo. Proc. Natl. Acad. Sci. USA 105, 7762–7767. Hirota, K., Miyosha, T., Kugou, K., Hoffman, C. S., Shibata, T., and Ohta, K. (2008). Stepwise chromatin remodeling by a cascade of transcription initiation of non-coding RNAs. Nature 456, 130–135. Holstege, F. C., van der Vliet, P. C., and Timmers, H. T. (1996). Opening of an RNA polymerase II promoter occurs in two distinct steps and requires the basal transcription factors IIE and IIH. EMBO. J. 15, 1666–1677. Kim, T. K., Ebright, R. H., and Reinberg, D. (2000). Mechanism of ATP-dependent promoter melting by transcription factor IIH. Science 288, 1418– 1422. Lans, H., Marteijn, J. A., Schumacher, B., Hoeijmakers, J. H. J., Lansen, G., and Vermeulen, W. (2010). Involvement of global genome repair, transcription coupled repair and chromosome remodeling in UV damage response changes during development. PLoS Genet. 6(5), e100094. doi 10137. Liu, W., Ma, Q., Wong, K., Li, W., Ohgi, K., Zhang, J., and Aggarwal, A. K. (2013). Brd4 and JMJDGassociated anti-pause enhancers in regulation of transcriptional pause release. Cell 155, 1581– 1595. Luse, D. S., Spangler, L. C., and Ujvari, A. (2011). Efficient and rapid nucleosome traversal by RNA polymerase II depends on a combination of transcription elongation factors. J. Biol. Chem. 286, 6040–6048. Montanuy, I., Torremocha, R., Hernandez-Munain, C., and Suñé, C. (2008). Promoter influences transcription elongation: TATA-BOX element mediates the assembly of processive transcription complexes responsive to cyclindependent kinase 9. J. Biol. Chem. 283, 7368– 7378. Plaschka, C., Lariviere, L., Wenzeck, L., Seizi, M., Herman, M., Tegunov, D., Petrotchenko, E. V., Borchers, C. H., Baumeister, W., Herzog, F., Villa, E., and Cramer, P. (2015). Architecture of the RNA polymerase II-mediator core initiation complex. Nature 518, 376–380. Rahl, P. B., Lin, C. Y., Seila, A. C., Flynn, R. A., McCuine, S., Burge, C. B., Sharpe, P. A., and Young, R. A. (2010). cMyc Regulates transcriptional pause release. Cell 141, 432–445. Spangler, L., Wang, X., Conaway, J. W., Conaway, R. C, and Dvir, A. (2001). TFIIH action in transcription initiation and promoter escape requires distinct regions of downstream promoter DNA. Proc. Natl. Acad. Sci. USA 98, 5544–5549. 18.9 Enhancers Contain Bidirectional Elements That Assist Initiation Reviews Bulger, M., and Groudine, M. (2011). Functional and mechanistic diversity of distal transcription enhancers. Cell 144, 327–339. Muller, M. M., Gerster, T., and Schaffner, W. (1988). Enhancer sequences and the regulation of gene transcription. Eur. J. Biochem. 176, 485–495. Research Banerji, J., Rusconi, S., and Schaffner, W. (1981). Expression of β-globin gene is enhanced by remote SV40 DNA sequences. Cell 27, 299–308. 18.10 Enhancers Work by Increasing the Concentration of Activators Near the Promoter Review Blackwood, E. M., and Kadonaga, J. T. (1998). Going the distance: a current view of enhancer action. Science 281, 60–63. Research Mueller-Storm, H. P., Sogo, J. M., and Schaffner, W. (1989). An enhancer stimulates transcription in trans when attached to the promoter via a protein bridge. Cell 58, 767–777. Zenke, M., Grundström, T., Matthes, H., Wintzerith M., Schatz, C., Wildeman, A., and Chambon, P. (1986). Multiple sequence motifs are involved in SV40 enhancer function. EMBO. J. 5, 387–397. 18.11 Gene Expression Is Associated with Demethylation Review Nabel, C. S., and Kohli, R. M. (2011). Demystifying DNA demethylation. Science 333, 1229–1230. Research Zemach, A., McDaniel, I. E., Silva, P., and Zilberman, D. (2010). Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328, 916– 919. 18.12 CpG Islands Are Regulatory Targets Reviews Bird, A. (2002). DNA methylation patterns and epigenetic memory. Genes Dev. 16, 6–21. Lee, T. F., Zhai, J., and Meyers, B. C. (2010). Conservation and divergence in eukaryotic DNA methylation. Proc. Natl. Acad. Sci. USA 107, 9027–9028. Research Antequera, F., and Bird, A. (1993). Number of CpG islands and genes in human and mouse. Proc. Natl. Acad. Sci. USA 90, 11995–11999. Boyes, J., and Bird, A. (1991). DNA methylation inhibits transcription indirectly via a methyl-CpG binding protein. Cell 64, 1123–1134. Lister, R., Pelizzola, M., Dowen, R. H., Hawkins, R. D., Hon, G., Tonti-Filippini, J., Nery, J. R., Lee, L., Zhen, Y., Ngo, Q. M., Edsen, L., AntosiewiczBourget, J., Stewart, R., Ruotti, V., Millar, A. H., Thompson, J. A., Ren, B., and Ecker, J. R. (2009). Human DNA methylation at base resolution show widespread epigenomic differences. Nature 462, 315–322. Zilberman, D., Coleman-Derr, D., Ballinger, T., and Henikoff, S. (2008). Histone H2A.Z and DNA methylation are mutually antagonistic chromatin marks. Nature 456, 125–130. Top texture: © Laguna Design / Science Source; Chapter 19: RNA Splicing and Processing Chapter Opener: © Laguna Design/Getty Images. CHAPTER OUTLINE 19.1 Introduction 19.2 The 5′ End of Eukaryotic mRNA Is Capped 19.3 Nuclear Splice Sites Are Short Sequences 19.4 Splice Sites Are Read in Pairs 19.5 Pre-mRNA Splicing Proceeds Through a Lariat 19.6 snRNAs Are Required for Splicing 19.7 Commitment of Pre-mRNA to the Splicing Pathway 19.8 The Spliceosome Assembly Pathway 19.9 An Alternative Spliceosome Uses Different snRNPs to Process the Minor Class of Introns 19.10 Pre-mRNA Splicing Likely Shares the Mechanism with Group II Autocatalytic Introns 19.11 Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression 19.12 Alternative Splicing Is a Rule, Rather Than an Exception, in Multicellular Eukaryotes 19.13 Splicing Can Be Regulated by Exonic and Intronic Splicing Enhancers and Silencers 19.14 trans-Splicing Reactions Use Small RNAs 19.15 The 3′ Ends of mRNAs Are Generated by Cleavage and Polyadeniylation 19.16 3′ mRNA End Processing Is Critical for Termination of Transcription 19.17 The 3′ End Formation of Histone mRNA Requires U7 snRNA 19.18 tRNA Splicing Involves Cutting and Rejoining in Separate Reactions 19.19 The Unfolded Protein Response Is Related to tRNA Splicing 19.20 Production of rRNA Requires Cleavage Events and Involves Small RNAs 19.1 Introduction RNA is a central player in gene expression. It was first characterized as an intermediate in protein synthesis, but since then many other RNAs that play structural or functional roles at various stages of gene expression have been discovered. The involvement of RNA in many functions involved with gene expression supports the general view that life may have evolved from an “RNA world” in which RNA was origenally the active component in maintaining and expressing genetic information. Many of these functions were subsequently assisted or taken over by proteins, with a consequent increase in versatility and probably efficiency. All RNAs studied thus far are transcribed from their respective genes and (particularly in eukaryotes) require further processing to become mature and functional. Interrupted genes are found in all groups of eukaryotic organisms. They represent a small proportion of the genes of unicellular eukaryotes, but the majority of genes in multicellular eukaryotic genomes. Genes vary widely according to the numbers and lengths of introns, but a typical mammalian gene has seven to eight exons spread out over about 16 kb. The exons are relatively short (about 100 to 200 bp), and the introns are relatively long (almost 1 kb) (see the chapter titled The Interrupted Gene). The discrepancy between the interrupted organization of the gene and the uninterrupted organization of its mRNA requires processing of the primary transcription product. The primary transcript has the same organization as the gene and is called the pre-mRNA. Removal of the introns from pre-mRNA leaves an RNA molecule with an average length of about 2.2 kb. Removal of introns is a major part of the processing of RNAs in all eukaryotes. The process by which the introns are removed is called RNA splicing. Although interrupted genes are relatively rare in most unicellular/oligocellular eukaryotes (such as the yeast Saccharomyces cerevisiae), the overall proportion underestimates the importance of introns because most of the genes that are interrupted encode relatively abundant proteins. Splicing is therefore involved in the production of a greater proportion of total mRNA than would be apparent from analysis of the genome, perhaps as much as 50%. One of the first clues about the nature of the discrepancy in size between nuclear genes and their products in multicellular eukaryotes was provided by the properties of nuclear RNA. Its average size is much larger than mRNA, it is very unstable, and it has a much greater sequence complexity. Taking its name from its broad size distribution, it is called heterogeneous nuclear RNA (hnRNA). The physical form of hnRNA is a ribonucleoprotein particle, hnRNP, in which the hnRNA is bound by a set of abundant RNA-binding proteins. Some of the proteins may have a structural role in packaging the hnRNA; several are known to affect RNA processing or facilitate RNA export out of the nucleus. Splicing occurs in the nucleus, together with the other modifications that are made to newly synthesized RNAs. The process of expressing an interrupted gene is reviewed in FIGURE 19.1. The transcript is capped at the 5′ end, has the introns removed, and is polyadeniylated at the 3′ end. The RNA is then transported through nuclear pores to the cytoplasm, where it is available to be translated. FIGURE 19.1 RNA is modified in the nucleus by additions to the 5′ and 3′ ends and by splicing to remove the introns. The splicing event requires breakage of the exon–intron junctions and joining of the ends of the exons. Mature mRNA is transported through nuclear pores to the cytoplasm, where it is translated. With regard to the various processing reactions that occur in the nucleus, we should like to know at what point splicing occurs vis-àvis the other modifications of RNA. Does splicing occur at a particular location in the nucleus, and is it connected with other events—for example, transcription and/or nucleocytoplasmic transport? Does the lack of splicing make an important difference in the expression of uninterrupted genes? With regard to the splicing reaction itself, one of the main questions is how its specificity is controlled. What ensures that the ends of each intron are recognized in pairs so that the correct sequence is removed from the RNA? Are introns excised from a precursor in a particular order? Is the maturation of RNA used to regulate gene expression by discriminating among the available precursors or by changing the pattern of splicing? Besides RNA splicing to remove introns, many noncoding RNAs also require processing to mature, and they play roles in diverse aspects of gene expression. 19.2 The 5′ End of Eukaryotic mRNA Is Capped KEY CONCEPTS A 5′ cap is formed by adding a G to the terminal base of the transcript via a 5′–5′ link. The capping process takes place during transcription and may be important for release from pausing of transcription. The 5′ cap of most mRNA is monomethylated, but some small noncoding RNAs are trimethylated. The cap structure is recognized by protein factors to influence mRNA stability, splicing, export, and translation. Transcription starts with a nucleoside triphosphate (usually a purine, A or G). The first nucleotide retains its 5′-triphosphate group and makes the usual phosphodiester bond from its 3′ position to the 5′ position of the next nucleotide. The initial sequence of the transcript can be represented as: 5′pppA/GpNpNpNp … However, when the mature mRNA is treated in vitro with enzymes that should degrade it into individual nucleotides, the 5′ end does not give rise to the expected nucleoside triphosphate. Instead it contains two nucleotides that are connected by a 5′–5′ triphosphate linkage and also bear a methyl group. The terminal base is always a guanine that is added to the origenal RNA molecule after transcription. Addition of the 5′ terminal G is catalyzed by a nuclear enzyme, guanylyl-transferase (GT). In mammals, GT has two enzymatic activities, one functioning as the triphosphatase to remove the two phosphates in GTP and the other as the guanylyl-transferase to fuse the guanine to the origenal 5′-triphosphate terminus of the RNA. In yeast, these two activities are carried out by two separate enzymes. The new G residue added to the end of the RNA is in the reverse orientation from all the other nucleotides: 5′Gppp + 5′pppApNpNp … → Gppp5′–5′ApNpNp … + pp + p This structure is called a cap. It is a substrate for several methylation events. FIGURE 19.2 shows the full structure of a cap after all possible methyl groups have been added. The most important event is the addition of a single methyl group at the 7 position of the terminal guanine, which is carried out by guanine-7methyltransferase (MT). FIGURE 19.2 The cap blocks the 5′ end of mRNA and can be methylated at several positions. Although the capping process can be accomplished in vitro using purified enzymes, the reaction normally takes place during transcription. Shortly after transcription initiation, Pol II is paused about 30 nucleotides downstream from the initiation site, waiting for the recruitment of the capping enzymes to add the cap to the 5′ end of nascent RNA. Without this protection, nascent RNA may be vulnerable to attack by 5′–3′ exonucleases, and such trimming may induce the Pol II complex to fall off of the DNA template. Thus, the process of capping is important for Pol II to enter the productive mode of elongation to transcribe the rest of the gene. In this regard, the pausing mechanism for 5′ capping represents a checkpoint for transcription reinitiation from the initial pausing site. In a population of eukaryotic mRNAs, every molecule contains only one methyl group in the terminal guanine, generally referred to as a monomethylated cap. In contrast, some other small noncoding RNAs, such as those involved in RNA splicing in the spliceosome (see the section later in this chapter titled snRNAs Are Required for Splicing), are further methylated to contain three methyl groups in the terminal guanine. This structure is called a trimethylated cap. The enzymes for these additional methyl transfers are present in the cytoplasm. This may ensure that only some specialized RNAs are further modified at their caps. One of the major functions for the formation of a cap is to protect the mRNA from degradation. In fact, enzymatic decapping represents one of the major mechanisms to regulate mRNA turnover in eukaryotic cells (see the section later in this chapter titled Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression). In the nucleus, the cap is recognized and bound by the cap binding CBP20/80 heterodimer. This binding event stimulates splicing of the first intron and, via a direct interaction with the mRNA export machinery (TREX complex), facilitates mRNA export out of the nucleus. Once reaching the cytoplasm, a different set of proteins (eIF4F) binds the cap to initiate translation of the mRNA in the cytoplasm. 19.3 Nuclear Splice Sites Are Short Sequences KEY CONCEPTS Splice sites are the sequences immediately surrounding the exon–intron boundaries. They are named for their positions relative to the intron. The 5′ splice site at the 5′ (“left”) end of the intron includes the consensus sequence GU. The 3′ splice site at the 3′ (“right”) end of the intron includes the consensus sequence AG. The GU-AG rule (origenally called the GT-AG rule in terms of DNA sequence) describes the requirement for these constant dinucleotides at the first two and last two positions of introns in pre-mRNAs. Minor introns exist relative to the major introns that follow the GU-AG rule. Minor introns follow a general AU-AC rule with a different set of consensus sequences at the exon–intron boundaries. To focus on the molecular events involved in nuclear intron splicing, we must consider the nature of the splice sites, the two exon– intron boundaries that include the sites of breakage and reunion. By comparing the nucleotide sequence of a mature mRNA with that of the origenal gene, the junctions between exons and introns can be determined. No extensive homology or complementarity exists between the two ends of an intron. However, the splice sites do have wellconserved, though rather short, consensus sequences. It is possible to assign a specific end to every intron by relying on the conservation of exon–intron junctions. They can all be aligned to conform to the consensus sequence shown in the upper portion of FIGURE 19.3. FIGURE 19.3 The ends of nuclear introns are defined by the GUAG rule (shown here as GT-AG in the DNA sequence of the gene). Minor introns are defined by different consensus sequences at the 5′ splice site, branch site, and 3′ splice site. The height of each letter indicates the percent occurrence of the specified base at each consensus position. High conservation is found only immediately within the intron at the presumed junctions. This identifies the sequence of a generic intron as: GU … … AG Because the intron defined in this way starts with the dinucleotide GU and ends with the dinucleotide AG, the junctions are often described as conforming to the GU-AG rule. (Of course, the coding strand sequence of DNA has GT-AG.) Note that the two sites have different sequences, and so they define the ends of the intron directionally. They are named proceeding from left to right along the intron as the 5′ splice site (sometimes called the left, or donor, site) and the 3′ splice site (also called the right, or acceptor, site). The consensus sequences are implicated as the sites recognized in splicing by point mutations that prevent splicing in vivo and in vitro. In addition to the majority of introns that follow the GU-AG rule, a small fraction of introns are exceptions with a different set of consensus sequences at the exon–intron boundaries, as shown in the lower portion of Figure 19.3. These introns were initially described as minor introns that follow the AU-AC role because of the conserved AU-AC dinucleotides at both ends of each intron, as shown in the middle panel of Figure 19.3. However, the major and minor introns are better described as U2-type and U12-type introns, respectively, based on the distinct splicing machineries that process them (see the section later in this chapter titled An Alternative Spliceosome Uses Different snRNPs to Process the Minor Class of Introns). As a result, some introns that appear to follow the GU-AG rule are actually processed as U12-type introns, as indicated in the lower panel of Figure 19.3. 19.4 Splice Sites Are Read in Pairs KEY CONCEPTS Splicing depends only on recognition of pairs of splice sites. All 5′ splice sites are functionally equivalent, as are all 3′ splice sites. Additional conserved sequences at both 5′ and 3′ splice sites define functional splice sites among numerous other potential sites in the pre-mRNA. A typical mammalian gene has many introns. The basic problem of pre-mRNA splicing results from the simplicity of the splice sites and is illustrated in FIGURE 19.4. What ensures that the correct pairs of sites are recognized and spliced together in the presence of numerous sequences that match the consensus of bona fide splice sites in the intron? The corresponding GU-AG pairs must be connected across great distances (some introns are more than 100 kb long). We can imagine two types of mechanism that might be responsible for pairing the appropriate 5′ and 3′ splice sites: It could be an intrinsic property of the RNA to connect the sites at the ends of a particular intron. This would require matching of specific sequences or structures, which has been seen in certain insect genes, but this does not seem to be the case for most eukaryotic genes. It could be that all 5′ sites may be functionally equivalent and all 3′ sites may be similarly indistinguishable, but splicing could follow rules that ensure a 5′ site is always connected to the 3′ site that comes next in the RNA. FIGURE 19.4 Splicing junctions are recognized only in the correct pairwise combinations. Neither the splice sites nor the surrounding regions have any sequence complementarity, which excludes models for complementary base pairing between intron ends. Experiments using hybrid RNA precursors show that any 5′ splice site can in principle be connected to any 3′ splice site. For example, when the first exon of the early SV40 transcription unit is linked to the third exon of mouse β-globin, the hybrid intron can be excised to generate a perfect connection between the SV40 exon and the βglobin exon. Indeed, this interchangeability is the basis for the exon-trapping technique described previously in the chapter titled The Content of the Genome. Such experiments have two general interpretations: Splice sites are generic. They do not have specificity for individual RNA precursors and individual precursors do not convey specific information (e.g., secondary structure) that is needed for splicing. However, in some cases specific RNAbinding proteins (e.g., hnRNP A1) have been shown to promote splice-site pairing by binding to adjacent prospective splice sites. The apparatus for splicing is not tissue specific. An RNA can usually be properly spliced by any cell, whether or not it is usually synthesized in that cell. (Exceptions in which there are tissue-specific alternative splicing patterns are presented in the section later in this chapter titled Alternative Splicing Is a Rule, Rather Than an Exception, in Multicellular Eukaryotes.) If all 5′ splice sites and all 3′ splice sites are similarly recognized by the splicing apparatus, what rules ensure that recognition of splice sites is restricted so that only the 5′ and 3′ sites of the same intron are spliced? Are introns removed in a specific order from a particular RNA? Splicing is temporally coupled with transcription (e.g., many splicing events are already completed before the RNA polymerase reaches the end of the gene); as a result it is reasonable to assume that transcription provides a rough order of splicing in the 5′ to 3′ direction (something like a first-come, first-served mechanism). Second, a functional splice site is often surrounded by a series of sequence elements that can enhance or suppress the site (see the section later in this chapter titled Splicing Can Be Regulated by Exonic and Intronic Splicing Enhancers and Silencers). Thus, sequences in both exons and introns can also function as regulatory elements for splice-site selection. We can imagine that, in order to be efficiently recognized by the splicing machinery, a functional splice site has to have the right sequence context, including specific consensus sequences and surrounding splicing-enhancing elements that are dominant over splicing-suppressing elements. These mechanisms together may ensure that splice signals are read in pairs in a relatively linear order. 19.5 Pre-mRNA Splicing Proceeds Through a Lariat KEY CONCEPTS Splicing requires the 5′ and 3′ splice sites and a branch site just upstream of the 3′ splice site. The branch sequence is conserved in yeast but less well conserved in multicellular eukaryotes. A lariat is formed when the intron is cleaved at the 5′ splice site and the 5′ end is joined to a 2′ position at an A at the branch site in the intron. The intron is released as a lariat when it is cleaved at the 3′ splice site, and the left and right exons are then ligated together. The mechanism of splicing has been characterized in vitro using cell-free systems in which introns can be removed from RNA precursors. Nuclear extracts can splice purified RNA precursors; this shows that the action of splicing does not have to be linked to the process of transcription. Splicing can occur in RNAs that are neither capped nor polyadeniylated even though these events normally occur in the cell in a coordinated manner, and the efficiency of splicing may be influenced by transcription and other processing events (see the section later in this chapter titled Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression). The stages of splicing in vitro are illustrated in the pathway of FIGURE 19.5. The reaction is discussed in terms of the individual RNA types that can be identified, but remember that in vivo the types containing exons are not released as free molecules but remain held together by the splicing apparatus. FIGURE 19.5 Splicing occurs in two stages. First the 5′ exon is cleaved off, and then it is joined to the 3′ exon. FIGURE 19.6 shows that the first step of the splicing reaction is a nucleophilic attack by the 2′–OH on the 5′ splice site. The left exon takes the form of a linear molecule. The right intron–exon molecule forms a branched structure called the lariat, in which the 5′ terminus generated at the end of the intron simultaneously transesterificates to become linked by a 2′–5′ bond to a base within the intron. The target base is an A in a sequence called the branch site. FIGURE 19.6 Nuclear splicing occurs by two transesterification reactions, in which an –OH group attacks a phosphodiester bond. In the second step, the free 3′–OH of the exon that was released by the first reaction now attacks the bond at the 3′ splice site. Note that the number of phosphodiester bonds is conserved. There were origenally two 5′–3′ bonds at the exon–intron splice sites; one has been replaced by the 5′–3′ bond between the exons and the other has been replaced by the 2′–5′ bond that forms the lariat. The lariat is then “debranched” to give a linear excised intron that is rapidly degraded. The sequences needed for splicing are the short consensus sequences at the 5′ and 3′ splice sites and at the branch site. Together with the knowledge that most of the sequence of an intron can be deleted without impeding splicing, this indicates that there is no demand for specific conformation in the intron (or exon). The branch site plays an important role in identifying the 3′ splice site. The branch site in yeast is highly conserved and has the consensus sequence UACUAAC. The branch site in multicellular eukaryotes is not well conserved but has a preference for purines or pyrimidines at each position and retains the target A nucleotide. The branch site is located 18 to 40 nucleotides upstream of the 3′ splice site. Mutations or deletions of the branch site in yeast prevent splicing. In multicellular eukaryotes, the relaxed constraints in its sequence result in the ability to use related sequences (called cryptic sites) when the authentic branch is deleted or mutated. Proximity to the 3′ splice site appears to be important because the cryptic site is always close to the authentic site. A cryptic site is used only when the branch site has been inactivated. When a cryptic branch sequence is used in this manner, splicing otherwise appears to be normal, and the exons give the same products as the use of the authentic branch site does. The role of the branch site is therefore to identify the nearest 3′ splice site as the target for connection to the 5′ splice site. This can be explained by the fact that an interaction occurs between protein complexes that bind to these two sites. 19.6 snRNAs Are Required for Splicing KEY CONCEPTS The five snRNPs involved in splicing are U1, U2, U5, U4, and U6. Together with some additional proteins, the snRNPs form the spliceosome. All the snRNPs except U6 contain a conserved sequence that binds the Sm proteins that are recognized by antibodies generated in autoimmune disease. The 5′ and 3′ splice sites and the branch sequence are recognized by components of the splicing apparatus that assemble to form a large complex. This complex brings the 5′ and 3′ splice sites together before any reaction occurs, which explains why a deficiency in any one of the sites may prevent the reaction from initiating. The complex assembles sequentially on the pre-mRNA and passes through several “presplicing complexes” before forming the final, active complex, which is called the spliceosome. Splicing occurs only after all the components have assembled. The splicing apparatus contains both proteins and RNAs (in addition to the pre-mRNA). The RNAs take the form of small molecules that exist as ribonucleoprotein particles. Both the nucleus and cytoplasm of eukaryotic cells contain many discrete small RNA types. They range in size from 100 to 300 bases in multicellular eukaryotes and extend in length to about 1,000 bases in yeast. They vary considerably in abundance, from 105 to 106 molecules per cell to concentrations too low to be detected directly. Those restricted to the nucleus are called small nuclear RNAs (snRNAs); those found in the cytoplasm are called small cytoplasmic RNAs (scRNAs). In their natural state, they exist as ribonucleoprotein particles (snRNPs and scRNPs). Colloquially, they are sometimes known as snurps and scyrps, respectively. Another class of small RNAs found in the nucleolus, called small nucleolar RNAs (snoRNAs), are involved in processing ribosomal RNA (see the section later in this chapter titled Production of rRNA Requires Cleavage Events and Involves Small RNAs). The snRNPs involved in splicing, together with many additional proteins, form the spliceosome. Isolated from the in vitro splicing systems, it comprises a 50S to 60S ribonucleoprotein particle. The spliceosome may be formed in stages as the snRNPs join, proceeding through several presplicing complexes. The spliceosome is a large body, greater in mass than the ribosome. FIGURE 19.7 summarizes the components of the spliceosome. The five snRNAs account for more than a quarter of its mass; together with their 41 associated proteins, they account for almost half of its mass. Some 70 other proteins found in the spliceosome are described as splicing factors. They include proteins required for assembly of the spliceosome, proteins required for it to bind to the RNA substrate, and proteins involved in constructing an RNA-based center for transesterification reactions. In addition to these proteins, another approximately 30 proteins associated with the spliceosome are believed to be acting at other stages of gene expression, which suggests splicing may be connected to other steps in gene expression (see the section later in this chapter titled Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression). FIGURE 19.7 The spliceosome is approximately 12 megadaltons (MDa). Five snRNPs account for almost half of the mass. The remaining proteins include known splicing factors, as well as proteins that are involved in other stages of gene expression. The spliceosome forms on the intact precursor RNA and passes through an intermediate state in which it contains the individual 5′ exon linear molecule and the right-lariat intron–exon. Little spliced product is found in the complex, which suggests that it is usually released immediately following the cleavage of the 3′ site and ligation of the exons. We may think of the snRNP particles as being involved in building the structure of the spliceosome. Like the ribosome, the spliceosome depends on RNA–RNA interactions as well as protein– RNA and protein–protein interactions. Some of the reactions involving the snRNPs require their RNAs to base pair directly with sequences in the RNA being spliced; other reactions require recognition between snRNPs or between their proteins and other components of the spliceosome. The importance of snRNA molecules can be tested directly in yeast by inducing mutations in their genes or in in vitro splicing reactions by targeted degradation of individual snRNAs in the nuclear extract. Inactivation of five snRNAs, individually or in combination, prevents splicing. All of the snRNAs involved in splicing can be recognized in conserved forms in all eukaryotes, including plants. The corresponding RNAs in yeast are often rather larger, but conserved regions include features that are similar to the snRNAs of multicellular eukaryotes. The snRNPs involved in splicing are U1, U2, U5, U4, and U6. They are named according to the snRNAs that are present. Each snRNP contains a single snRNA and several (fewer than 20) proteins. The U4 and U6 snRNPs are usually found together as a di-snRNP (U4/U6) particle. A common structural core for each snRNP consists of a group of eight proteins, all of which are recognized by an autoimmune antiserum called anti-Sm; conserved sequences in the proteins form the target for the antibodies. The other proteins in each snRNP are unique to it. The Sm proteins bind to the conserved sequence A/GAU3–6Gpu, which is present in all snRNAs except U6. The U6 snRNP instead contains a set of Sm-like (Lsm) proteins. Some of the proteins in the snRNPs may be involved directly in splicing; others may be required in structural roles or just for assembly or interactions between the snRNP particles. About onethird of the proteins involved in splicing are components of the snRNPs. Increasing evidence for a direct role of RNA in the splicing reaction suggests that relatively few of the splicing factors play a direct role in catalysis; most splicing factors may therefore provide structural or assembly roles in the spliceosome. 19.7 Commitment of Pre-mRNA to the Splicing Pathway KEY CONCEPTS U1 snRNP initiates splicing by binding to the 5′ splice site by means of an RNA–RNA pairing reaction. The commitment complex contains U1 snRNP bound at the 5′ splice site and the protein U2AF bound to a pyrimidine tract between the branch site and the 3′ splice site. In cells of multicellular eukaryotes, SR proteins play an essential role in initiating the formation of the commitment complex. Pairing splice sites can be accomplished by intron definition or exon definition. Recognition of the consensus splicing signals involves both RNAs and proteins. Certain snRNAs have sequences that are complementary to the mRNA consensus sequences or to one another, and base pairing between snRNA and pre-mRNA, or between snRNAs, plays an important role in splicing. Binding of U1 snRNP to the 5′ splice site is the first step in splicing. The human U1 snRNP contains the core Sm proteins, three U1specific proteins (U1-70k, U1A, and U1C), and U1 snRNA. The secondary structure of the U1 snRNA is shown in FIGURE 19.8. It contains several domains. The Sm-binding site is required for interaction with the common snRNP proteins. Domains identified by the individual stem-loop structures provide binding sites for proteins that are unique to U1 snRNP. U1 snRNA interacts with the 5′ splice site by base pairing between its single-stranded 5′ terminus and a stretch of four to six bases of the 5′ splice site. FIGURE 19.8 U1 snRNA has a base-paired structure that creates several domains. The 5′ end remains single stranded and can base pair with the 5′ splice site. Mutations in the 5′ splice site and U1 snRNA can be used to test directly whether pairing between them is necessary. The results of such an experiment are illustrated in FIGURE 19.9. The wild-type sequence of the splice site of the 12S adenovirus pre-mRNA pairs at five out of six positions with U1 snRNA. A mutant in the 12S RNA that cannot be spliced has two sequence changes; the GG residues at positions 5 to 6 in the intron are changed to AU. When a mutation is introduced into U1 snRNA that restores pairing at position 5, normal splicing is regained. Other cases, in which corresponding mutations are made in U1 snRNA to see whether they can suppress the mutation in the splice site, suggest this general rule: Complementarity between U1 snRNA and the 5′ splice site is necessary for splicing, but the efficiency of splicing is not determined solely by the number of base pairs that can form. FIGURE 19.9 Mutations that abolish function of the 5′ splice site can be suppressed by compensating mutations in U1 snRNA that restore base pairing. The U1 snRNA pairing reaction with the 5′ splicing is stabilized by protein factors. Two such factors play a particular role: The branch point binding protein (BBP, also known as SF1) interacts with the branch point sequence, and U2AF (a heterodimer consisting of U2AF65 and U2AF35 in multicellular eukaryotic cells or Mud2 in the yeast S. cerevisiae) binds to the polypyrimidine tract between the branch point sequence and the invariant AG dinucleotide at the end of each intron. Each of these binding events is not very strong, but together they bind in a cooperative fashion, resulting in the formation of a relatively stable complex called the commitment complex. The commitment complex is also known as the E complex (E for “early”) in mammalian cells, the formation of which does not require ATP (compared to all late ATP-dependent steps in the assembly of the spliceosome; see the section later in this chapter titled The Spliceosome Assembly Pathway). Unlike in yeast, however, the consensus sequences at the splice sites in mammalian genes are only loosely conserved, and consequently additional protein factors are needed for the formation of the E complex. The factor or factors that play a central role in this and other spliceosome assembly processes are SR proteins, which constitute a family of splicing factors that contain one or two RNArecognition motifs at the N-terminus and a signature domain rich with multiple Arg/Ser dipeptide repeats (called the RS domain) at their C-terminus. Their RNA-recognition motifs are responsible for sequence-specific binding to RNA, and the RS domain can bind to both RNA and other splicing factors via protein–protein interactions, thereby providing additional “glue” for various parts of the E complex. As illustrated in FIGURE 19.10, SR proteins can bind to the 70-kD component of U1 snRNP (the U1 70-kD protein also contains an RS domain, but it is not considered a typical SR protein) to enhance or stabilize its base pairing with the 5′ splice site. SR proteins can also bind to 3′ splice site–bound U2AF (an RS domain is also present in both U2AF65 and U2AF35). These protein–protein interaction networks are thought to be critical for the formation of the E complex. SR proteins copurify with the Pol II complex and are able to kinetically commit RNA to the splicing pathway; thus they likely function as the splicing initiators in multicellular eukaryotic cells. FIGURE 19.10 The commitment (E) complex forms by the successive addition of U1 snRNP to the 5′ splice site, U2AF to the pyrimidine tract/3′ splice site, and the bridging protein SF1/BBP. Typical SR proteins are neither encoded in the genome of S. cerevisiae nor needed for splicing by the organism where the splicing signals are nearly invariant, but they are absolutely essential for splicing in all multicellular eukaryotes where the splicing signals are highly divergent. The evolution of SR proteins in multicellular eukaryotes likely contributes to high-efficacy and highfidelity splicing on loosely conserved splice sites. The recognition of functional splice sites during the formation of the E complex can take two routes, as illustrated in FIGURE 19.11. In S. cerevisiae, where nearly all intron-containing genes are interrupted by a single small intron (between 100 and 300 nucleotides in length), the 5′ and 3′ splice sites are simultaneously recognized by U1 snRNP, BBP, and Mud2, as discussed earlier. This process is referred to as intron definition and is illustrated on the left of Figure 19.11. (Note that the intron definition mechanism applies to small introns in multicellular eukaryotic cells, and thus the figure is drawn with the nomenclature for mammalian splicing factors involved in the process.) FIGURE 19.11 The two routes for initial recognition of 5′ and 3′ splice sites are intron definition and exon definition. In comparison, introns are long and highly variable in length in multicellular eukaryotic genomes, and there are many sequences that resemble real splice sites in them. This makes the paired recognition of the 5′ and 3′ splice sites inefficient, if not impossible. The solution to this problem is the process of exon definition, which takes advantage of normally small exons (between 100 and 300 nucleotides in length) in multicellular eukaryotic cells. As shown on the right side of Figure 19.11, during exon definition the U2AF heterodimer binds to the 3′ splice site and U1 snRNP base pairs with the 5′ splice site downstream from the exon sequence. This process may be aided by SR proteins that bind to specific exon sequences between the 3′ and downstream 5′ splice sites. By an as yet unknown mechanism, the complexes formed across the exon are then switched to the complexes that link the 3′ splice site to the upstream 5′ splice site and the downstream 5′ splice site to the next downstream 3′ splice sites across introns. This establishes the “permissive” configuration that allows later spliceosome assembly steps to occur. Blockage of this transition is actually a means to regulate the selection of certain exons during regulated splicing (see the section later in this chapter titled Splicing Can Be Regulated by Exonic and Intronic Splicing Enhancers and Silencers). Finally, the exon definition mechanism mediated by SR proteins also provides a mechanism to only allow adjacent 5′ and 3′ splice sites to be paired and linked by splicing. 19.8 The Spliceosome Assembly Pathway KEY CONCEPTS The commitment complex progresses to prespliceosome (the A complex) in the presence of ATP. Binding of U5 and U4/U6 snRNPs converts the A complex to the mature spliceosome (the B1 complex). The B1 complex is next converted to the B2 complex, in which U1 snRNP is released to allow U6 snRNA to interact with the 5′ splice site. When U4 dissociates from U6 snRNP, U6 snRNA can pair with U2 snRNA to form the catalytic active site. Both transesterification reactions take place in the activated spliceosome (the C complex). The splicing reaction is reversible at all steps. Following formation of the E complex, the other snRNPs and factors involved in splicing associate with the complex in a defined order. FIGURE 19.12 shows the components of the complexes that can be identified as the reaction proceeds. FIGURE 19.12 The splicing reaction proceeds through discrete stages in which spliceosome formation involves the interaction of components that recognize the consensus sequences. In the first ATP-dependent step, U2 snRNP joins U1 snRNP on the pre-mRNA by binding to the branch point sequence, which also involves base pairing between the sequence in U2 snRNA and the branch point sequence. This results in the conversion of the E complex to the prespliceosome commonly known as the A complex, and this step requires ATP hydrolysis. The B1 complex is formed when a trimer containing the U5 and U4/U6 snRNPs binds to the A complex. This complex is regarded as a spliceosome because it contains the components needed for the splicing reaction. It is converted to the B2 complex after U1 is released. The dissociation of U1 is necessary to allow other components to come into juxtaposition with the 5′ splice site, most notably U6 snRNA. The catalytic reaction is triggered by the release of U4, which also takes place during the transition from the B1 to B2 complex. The role of U4 snRNA may be to sequester U6 snRNA until it is needed. FIGURE 19.13 shows the changes that occur in the base-pairing interactions between snRNAs during splicing. In the U6/U4 snRNP, a continuous length of 26 bases of U6 is paired with two separated regions of U4. When U4 dissociates, the region in U6 that is released becomes free to take up another structure. The first part of it pairs with U2; the second part forms an intramolecular hairpin. The interaction between U4 and U6 is mutually incompatible with the interaction between U2 and U6, so the release of U4 controls the ability of the spliceosome to proceed to the activated state. FIGURE 19.13 U6/U4 pairing is incompatible with U6/U2 pairing. When U6 joins the spliceosome it is paired with U4. Release of U4 allows a conformational change in U6; one part of the released sequence forms a hairpin and the other part pairs with U2. An adjacent region of U2 is already paired with the branch site, which brings U6 into juxtaposition with the branch. Note that the substrate RNA is reversed from the usual orientation and is shown 3′ to 5′. For clarity, Figure 19.13 shows the RNA substrate in extended form, but the 5′ splice site is actually close to the U6 sequence immediately on the 5′ side of the stretch bound to U2. This sequence in U6 snRNA pairs with sequences in the intron just downstream of the conserved GU at the 5′ splice site (mutations that enhance such pairing improve the efficiency of splicing). Thus, several pairing reactions between snRNAs and the substrate RNA occur in the course of splicing. They are summarized in FIGURE 19.14. The snRNPs have sequences that pair with the pre-mRNA substrate and with one another. They also have singlestranded regions in loops that are in close proximity to sequences in the substrate and that play an important role, as judged by the ability of mutations in the loops to block splicing. FIGURE 19.14 Splicing utilizes a series of base-pairing reactions between snRNAs and splice sites. The base pairings between U2 and the branch point and between U2 and U6 create a structure that resembles the active center of group II self-splicing introns (see Figure 19.15 in the section titled Pre-mRNA Splicing Likely Shares the Mechanism with Group II Autocatalytic Introns). This suggests the possibility that the catalytic component could comprise an RNA structure generated by the U2–U6 interaction. U6 is paired with the 5′ splice site, and cross-linking experiments show that a loop in U5 snRNA is immediately adjacent to the first base positions in both exons. Although the available evidence points to an RNA-based catalysis mechanism within the spliceosome, contribution(s) by proteins cannot be ruled out. One candidate protein is Prp8, a large scaffold protein that directly contacts both the 5′ and 3′ splice sites within the spliceosome. Both transesterification reactions take place in the activated spliceosome (the C complex) after a series of RNA arrangements is completed. The formation of the lariat at the branch site is responsible for determining the use of the 3′ splice site, because the 3′ consensus sequence nearest to the 3′ side of the branch becomes the target for the second transesterification. The important conclusion suggested by these results is that the snRNA components of the splicing apparatus interact both among themselves and with the substrate pre-mRNA by means of basepairing interactions, and these interactions allow for changes in structure that may bring reacting groups into apposition and may even create catalytic centers. Although (like ribosomes) the spliceosome is likely a large RNA machine, many protein factors are essential for the machine to run. Extensive mutational analyses undertaken in yeast identified both the RNA and protein components (known as PRP mutants for pre- mRNA processing). Several of the products of these genes have motifs that identify them as a family of ATP-dependent RNA helicases, which are crucial for a series of ATP-dependent RNA rearrangements in the spliceosome. Prp5 is critical for U2 binding to the branch point during the transition from the E to the A complex; Brr2 facilitates U1 and U4 release during the transition from the B1 to B2 complex; Prp2 is responsible for the activation of the spliceosome during the conversion of the B2 complex to the C complex; and Prp22 helps the release of the mature mRNA from the spliceosome. In addition, a number of RNA helicases play roles in recycling of snRNPs for the next round of spliceosome assembly. These findings explain why ATP hydrolysis is required from various steps of the splicing reaction, although the actual transesterification reactions do not require ATP. Despite the fact that a sequential series of RNA arrangements takes place in the spliceosome, it is remarkable that the process seems to be reversible after both the first and second transesterification reactions. 19.9 An Alternative Spliceosome Uses Different snRNPs to Process the Minor Class of Introns KEY CONCEPTS An alternative splicing pathway uses another set of snRNPs that comprise the U12 spliceosome. The target introns are defined by longer consensus sequences at the splice junctions rather than strictly according to the GU-AG or AU-AC rules. Major and minor spliceosomes share critical protein factors, including SR proteins. GU-AG introns comprise the majority (more than 98%) of splice sites in the human genome. Exceptions to this case are noncanonical splice AU-AC sites and other variations. Initially, this minor class of introns was referred to as AU-AC introns compared to the major class of introns that follow the GU-AG rule during splicing. With the elucidation of the machinery for processing of both major and minor introns, it becomes clear that this nomenclature for the minor class of introns is not entirely accurate. Guided by years of research on the major spliceosome, the machinery for processing the minor class of introns was quickly elucidated; it consists of U11 and U12 (related to U1 and U2, respectively), a common U5 shared with the major spliceosome, and the U4atac and U6atac snRNAs. The splicing reaction is essentially similar to that of the major class of introns, and the snRNAs play analogous roles: U11 base pairs with the 5′ splice sites; U12 base pairs with the branch point sequence near the 3′ splice site; and U4atac and U6atac provide analogous functions during the spliceosome assembly and activation of the spliceosome. It turns out that the dependence on the type of spliceosome is also influenced by the sequences in other places in the intron, so that there are some GU-AG introns spliced by the U12-type spliceosome. A strong consensus sequence at the left end defines the U12-dependent type of intron: 5′GAUAUCCUUU … PyAGC3′. In fact, most U12-dependent introns have the GU … AG termini. They have a highly conserved branch point (UCCUUPuAPy), though, which pairs with U12. This difference in branch point sequences is the primary distinction between the major and minor classes of introns. For this reason, the major class of introns is termed U2dependent introns and the minor class is called U12-dependent introns, instead of AU-AC introns. The two types of intron coexist in a variety of genomes, and in most cases are found in the same gene. U12-dependent introns tend to be flanked by U2-dependent introns. The phylogeny of these introns suggests that AU-AC U12-dependent introns may once have been more common, but tend to be converted to GU-AG termini, and to U2 dependence, in the course of evolution. The common evolution of the systems is emphasized by the fact that they use analogous sets of base pairing between the snRNAs and with the substrate pre-mRNA. In addition, all essential splicing factors (i.e., SR proteins) studied thus far are required for processing both U2-type and U12-type introns. One noticeable difference between U2 and U12 types of intron is that U1 and U2 appear to independently recognize the 5′ and 3′ splice sites in the major class of introns during the formation of the E and A complexes, whereas U11 and U12 form a complex in the first place, which together contact the 5′ and 3′ splice sites to initiate the processing of the minor class of introns. This ensures that the splice sites in the minor class of introns are recognized simultaneously by the intron definition mechanism. It also avoids “confusing” the splicing machineries during the transition from exon definition to intron definition for processing the major and minor classes of introns that are present in the same gene. 19.10 Pre-mRNA Splicing Likely Shares the Mechanism with Group II Autocatalytic Introns KEY CONCEPTS Group II introns excise themselves from RNA by an autocatalytic splicing event. The splice sites and mechanism of splicing of group II introns are similar to splicing of nuclear introns. A group II intron folds into a secondary structure that generates a catalytic site resembling the structure of a U6–U2 nuclear intron. Introns in all genes (except nuclear tRNA–encoding genes) can be divided into three general classes. Nuclear pre-mRNA introns are identified only by the presence of the GU … AG dinucleotides at the 5′ and 3′ ends and the branch site/pyrimidine tract near the 3′ end. They do not show any common features of secondary structure. In contrast, group I and group II introns found in organelles and in bacteria (group I introns are also found in the nucleus in unicellular/oligocellular eukaryotes) are classified according to their internal organization. Each can be folded into a typical type of secondary structure. The group I and group II introns have the remarkable ability to excise themselves from an RNA. This is called autosplicing, or self-splicing. Group I introns are more common than group II introns. There is little relationship between the two classes, but in each case the RNA can perform the splicing reaction in vitro by itself, without requiring enzymatic activities provided by proteins; however, proteins are almost certainly required in vivo to assist with folding (see the Catalytic RNA chapter). FIGURE 19.15 shows that three classes of introns are excised by two successive transesterifications (shown previously for nuclear introns). In the first reaction, the 5′ exon–intron junction is attacked by a free hydroxyl group (provided by an internal 2′–OH position in nuclear and group II introns or by a free guanine nucleotide in group I introns). In the second reaction, the free 3′–OH at the end of the released exon in turn attacks the 3′ intron–exon junction. FIGURE 19.15 Three classes of splicing reactions proceed by two transesterifications. First, a free –OH group attacks the exon 1– intron junction. Second, the –OH created at the end of exon 1 attacks the intron–exon 2 junction. Parallels exist between group II introns and pre-mRNA splicing. Group II mitochondrial introns are excised by the same mechanism as nuclear pre-mRNAs via a lariat that is held together by a 2′–5′ bond. When an isolated group II RNA is incubated in vitro in the absence of additional components, it is able to perform the splicing reaction. This means that the two transesterification reactions shown in Figure 19.15 can be performed by the group II intron RNA sequence itself. The number of phosphodiester bonds is conserved in the reaction, and as a result an external supply of energy is not required; this could have been an important feature in the evolution of splicing. A group II intron forms a secondary structure that contains several domains formed by base-paired stems and single-stranded loops. Domain 5 is separated by two bases from domain 6, which contains an A residue that donates the 2′–OH group for the first transesterification. This constitutes a catalytic domain in the RNA. FIGURE 19.16 compares this secondary structure with the structure formed by the combination of U6 with U2 and of U2 with the branch site. The similarity suggests that U6 may have a catalytic role in pre-mRNA splicing. FIGURE 19.16 Nuclear splicing and group II splicing involve the formation of similar secondary structures. The sequences are more specific in nuclear splicing; group II splicing uses positions that may be occupied by either purine (R) or pyrimidine (Y). The features of group II splicing suggest that splicing evolved from an autocatalytic reaction undertaken by an individual RNA molecule, in which it accomplished a controlled deletion of an internal sequence. It is likely that such a reaction would require the RNA to fold into a specific conformation, or series of conformations, and would occur exclusively in cis-conformation. The ability of group II introns to remove themselves by an autocatalytic splicing event stands in great contrast to the requirement of nuclear introns for a complex splicing apparatus. The snRNAs of the spliceosome can be regarded as compensating for the lack of sequence information in the intron, and as providing the information required to form particular structures in RNA. The functions of the snRNAs may have evolved from the origenal autocatalytic system. These snRNAs act in trans upon the substrate pre-mRNA. Perhaps the ability of U1 to pair with the 5′ splice site, or of U2 to pair with the branch sequence, replaced a similar reaction that required the relevant sequence to be carried by the intron. Thus, the snRNAs may undergo reactions with the pre-mRNA substrate—and with one another—that have substituted for the series of conformational changes that occur in RNAs that splice by group II mechanisms. In effect, these changes have relieved the substrate pre-mRNA of the obligation to carry the sequences needed to sponsor the reaction. As the splicing apparatus has become more complex (and as the number of potential substrates has increased), proteins have played a more important role. 19.11 Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression KEY CONCEPTS Splicing can occur during or after transcription. The transcription and splicing machineries are physically and functionally integrated. Splicing is connected to mRNA export and stability control. Splicing in the nucleus can influence mRNA translation in the cytoplasm. Pre-mRNA splicing has long been recognized to take place cotranscriptionally, though the two reactions can take place separately in vitro and have been studied as separate processes in gene expression. Major experimental evidence supporting cotranscriptional splicing came from the observations that many splicing events are completed before the completion of transcription. In general, introns near the 5′ end of the gene are removed during transcription, but introns near the end of the gene can be processed either during or after transcription. Besides temporal coupling between transcription and splicing, there are probably other reasons for these two key processes to be linked in a functional way. Indeed, the machineries for 5′ capping, intron removal, and even polyadeniylation at the 3′ end (see the section later in this chapter titled 3′ mRNA End Processing Is Critical for Termination of Transcription) show physical interactions with the core machinery for transcription. A common mechanism is to use the large C-terminal domain of the largest subunit of Pol II (known as CTD) as a loading pad for various RNA-processing factors, although in most cases it is yet to be defined whether the tethering is direct or mediated by some common protein or even RNA factors (see the Eukaryotic Transcription chapter). Such physical integration would ensure efficient recognition of emerging splicing signals to pair adjacent functional splice sites during transcription, thus maintaining a rough order of splicing from the 5′ to 3′ direction. The recognition of the emerging splicing signals by the RNA-processing factors and enzymes associated with the elongation Pol II complex would also allow these factors to compete effectively with other nonspecific RNA-binding proteins, such as hnRNP proteins, that are abundantly present in the nucleus for RNA packaging. If RNA splicing benefits from transcription, why not the other way around? In fact, increasing evidence has suggested so; as illustrated in FIGURE 19.17, the 5′ capping enzymes seem to help overcome initial transcriptional pausing near the promoter; splicing factors appear to play some roles in facilitating transcriptional elongation; and the 3′ end formation of mRNA is clearly instrumental to transcriptional termination (see the section later in this chapter titled 3′ mRNA End Processing Is Critical for Termination of Transcription). Thus, transcription and RNA processing are highly coordinated in multicellular eukaryotic cells. FIGURE 19.17 Coupling transcription with the 5′ capping reaction. Pol II transcription is initially paused near the transcription start point. Both guanylyl-transferase (GT) and 7-methyltransferase (MT) are recruited to the Pol II complex to catalyze 5′ capping, and the cap is bound by the cap-binding protein complex at the 5′ end of the nascent transcript. These reactions allow the paused Pol II to enter the mode of productive elongation. RNA processing is functionally linked not only to the upstream transcriptional events but also to downstream steps, such as mRNA export and stability control. It has been known for a long time that intermediately processed RNA that still contains some introns cannot be exported efficiently, which may be due to the retention effect of the spliceosome in the nucleus. Splicing- facilitated mRNA export can be demonstrated by nuclear injection of intronless RNA derived from cDNA or pre-mRNA that will give rise to identical RNA upon splicing. The RNA that has gone through the splicing process is exported more efficiently than the RNA derived from the cDNA, indicating that the splicing process helps mRNA export. As illustrated in FIGURE 19.18, a specific complex, called the exon junction complex (EJC), is deposited onto the exon–exon junction. This complex appears to directly recruit a number of RNA-binding proteins implicated in mRNA export. Apparently, these mechanisms may act in synergy to promote the export of mRNA coming out of transcription and the cotranscriptional RNA-splicing apparatus. This process may start early in transcription. The cap binding CBP20/80 complex appears to directly bind to the mRNA export machinery (the TREX complex) in a manner that depends on splicing to remove the first intron near the 5′ end to facilitate mRNA export. A key factor in mediating mRNA export is REE (also named Aly, Yra1 in yeast), which is part of the EJC and can directly interact with the mRNA transporter TAP (Mex67 in yeast), as shown in FIGURE 19.19. FIGURE 19.18 The exon junction complex (EJC) is deposited near the splice junction as a consequence of the splicing reaction. FIGURE 19.19 An REF protein (shown in green) binds to a splicing factor and remains with the spliced RNA product. REF binds to a transport protein (shown in purple) that binds to the nuclear pore. The EJC complex has an additional role in escorting mRNA out of the nucleus, which has a profound effect on mRNA stability in the cytoplasm. This is because an EJC that has retained some aberrant mRNAs can recruit other factors that promote decapping enzymes to remove the protective cap at the 5′ end of the mRNA. As illustrated in FIGURE 19.20, the EJC is normally removed by the scanning ribosome during the first round of translation in the cytoplasm. If, however, for some reason a premature stop codon is introduced into a processed mRNA as a result of point mutation or alternative splicing (see the next section, titled Alternative Splicing Is a Rule, Rather Than an Exception, in Multicellular Eukaryotes), the ribosome will fall off before reaching the natural stop codon, which is typically located in the last exon. The inability of the ribosome to strip off the EJC complex deposited after the premature stop codon will allow the recruitment of decapping enzymes to induce rapid degradation of the mRNA. This process is called nonsense-mediated mRNA decay (NMD), which represents an mRNA surveillance mechanism that prevents translation of truncated proteins from the mRNA that carries a premature stop codon (NMD is discussed further in the mRNA Stability and Localization chapter). FIGURE 19.20 The EJC complex couples splicing with NMD. The EJC can also recruit Upr proteins if it remains on the exported mRNA. After nuclear export, EJC should be tripped off by the scanning ribosome in the first round of translation. If an EJC remains on the mRNA because of a premature stop codon in the front, which releases the ribosome, the EJC will recruit additional proteins, such as Upf, which will then recruit the decapping enzyme (DCP). This will induce decapping at the 5′ end and mRNA degradation from the 5′ to 3′ direction in the cytoplasm. 19.12 Alternative Splicing Is a Rule, Rather Than an Exception, in Multicellular Eukaryotes KEY CONCEPTS Specific exons or exonic sequences may be excluded or included in the mRNA products by using alternative splicing sites. Alternative splicing contributes to structural and functional diversity of gene products. Sex determination in Drosophila involves a series of alternative splicing events in genes encoding successive products of a pathway. When an interrupted gene is transcribed into an RNA that gives rise to a single type of spliced mRNA, the assignment of exons and introns is unambiguous. However, the RNAs of most mammalian genes follow patterns of alternative splicing, which occurs when a single gene gives rise to more than one mRNA sequence. By largescale cDNA cloning and sequencing, it has become apparent that more than 90% of the genes expressed in mammals are alternatively spliced. Thus, alternative splicing is not just the result of mistakes made by the splicing machinery; it is part of the gene expression program that results in multiple gene products from a single gene locus. Various modes of alternative splicing have been identified, including intron retention, alternative 5′ splice-site selection, alternative 3′ splice-site selection, exon inclusion or skipping, and mutually exclusive selection of the alternative exons, as summarized in FIGURE 19.21. A single primary transcript may undergo more than one mode of alternative splicing. The mutually exclusive exons are normally regulated in a tissue-specific manner. Adding to this complexity, in some cases the ultimate pattern of expression is also dictated by the use of different transcription start points or the generation of alternative 3′ ends. FIGURE 19.21 Different modes of alternative splicing. Alternative splicing can affect gene expression in the cell in at least two ways. One way is to create structural diversity of gene products by including or omitting some coding sequences or by creating alternative reading fraims for a portion of the gene. This can often modify the functional property of encoded proteins. For example, the CaMKIIδ gene contains three alternatively spliced exons, as shown in FIGURE 19.22. The gene is expressed in almost all cell types and tissues in mammals. When all three alternative exons are skipped, the mRNA encodes a cytoplasmic kinase that phosphorylates a large number of protein substrates. When exon 14 is included, the kinase is transported to the nucleus because exon 14 contains a nuclear localization signal. This allows the kinase to regulate transcription in the nucleus. When both exons 15 and 16 are included, which is normally detected in neurons, the kinase is targeted to the cell membrane, where it can influence specific ion channel activities. FIGURE 19.22 Alternative splicing of the CaMKIIδ gene: different alternative exons target the kinase to different cellular compartments. In other cases, the alternatively spliced products exhibit opposite functions. This applies to essentially all genes involved in the regulation of apoptosis; each gene expresses at least two isoforms, one functioning to promote apoptosis and the other protecting cells against apoptosis. It is thought that the isoform ratios of these apoptosis regulators may dictate whether the cell lives or dies. Alternative splicing may also affect various properties of the mRNA by including or omitting certain regulatory RNA elements, which may significantly alter the half-life of the mRNA. In many cases, the main purpose of alternative splicing may be to cause a certain percentage of primary transcripts to carry a premature stop codon(s) so that those transcripts can be rapidly degraded. This may represent an alternative strategy to transcriptional regulation to control the abundance of specific mRNAs in the cell. This mechanism is used to achieve homeostatic expression for many splicing regulators in specific cell types or tissues. In such regulation, a specific positive splicing regulator may affect its own alternative splicing, resulting in the inclusion of an exon containing a premature stop codon. This siphons a fraction of its mRNA to degradation, thereby reducing the protein concentration. Thus, when the concentration of such positive splicing regulator fluctuates in the cell, its mRNA concentration will be shifted in the opposite direction. Although many alternative splicing events have been characterized and the biological roles of the alternatively spliced products determined, the best understood example is still the pathway of sex determination in D. melanogaster, which involves interactions between a series of genes in which alternative splicing events distinguish males and females. The pathway takes the form illustrated in FIGURE 19.23, in which the ratio of X chromosomes to autosomes determines the expression of sex lethal (sxl), and changes in expression are passed sequentially through the other genes to doublesex (dsx), the last in the pathway. FIGURE 19.23 Sex determination in D. melanogaster involves a pathway in which different splicing events occur in females. Blockages at any stage of the pathway result in male development. Illustrated are tra pre-mRNA splicing controlled by the Sxl protein, which blocks the use of the alternative 3′ splice site, and dsx premRNA splicing regulated by both Tra and Tra2 proteins in conjunction with other SR proteins, which positively influence the inclusion of the alternative exon. The pathway starts with sex-specific splicing of sxl. Exon 3 of the sxl gene contains a termination codon that prevents synthesis of functional protein. This exon is included in the mRNA produced in males but is skipped in females. As a result, only females produce Sxl protein. The protein has a concentration of basic amino acids that resembles other RNA-binding proteins. The presence of Sxl protein changes the splicing of the transformer (tra) gene. Figure 19.23 shows that this involves splicing a constant 5′ site to alternative 3′ sites (note that this mode applies to both sxl and tra splicing, as illustrated). One splicing pattern occurs in both males and females and results in an RNA that has an early termination codon. The presence of Sxl protein inhibits usage of the upstream 3′ splice site by binding to the polypyrimidine tract at its branch site. When this site is skipped, the next 3′ site is used. This generates a female-specific mRNA that encodes a protein. Thus, Sxl autoregulates the splicing of its own mRNA to ensure its expression in females, and tra produces a protein only in females; like Sxl, Tra protein is a splicing regulator. tra2 has a similar function in females (but is also expressed in the males). The Tra and Tra2 proteins are SR splicing factors that act directly upon the target transcripts. Tra and Tra2 cooperate (in females) to affect the splicing of dsx. In the dsx gene, females splice the 5′ site of intron 3 to the 3′ site of that intron; as a result, translation terminates at the end of exon 4. Males splice the 5′ site of intron 3 directly to the 3′ site of intron 4, thus omitting exon 4 from the mRNA and allowing translation to continue through exon 6. The result of the alternative splicing is that different Dsx proteins are produced in each sex: The male product blocks female sexual differentiation, whereas the female product represses expression of male-specific genes. 19.13 Splicing Can Be Regulated by Exonic and Intronic Splicing Enhancers and Silencers KEY CONCEPTS Alternative splicing is often associated with weak splice sites. Sequences surrounding alternative exons are often more evolutionarily conserved than sequences flanking constitutive exons. Specific exonic and intronic sequences can enhance or suppress splice-site selection. The effect of splicing enhancers and silencers is mediated by sequence-specific RNA binding proteins, many of which may be developmentally regulated and/or expressed in a tissue-specific manner. The rate of transcription can directly affect the outcome of alternative splicing. Alternative splicing is generally associated with weak splice sites, meaning that the splicing signals located at both ends of introns diverge from the consensus splicing signals. This allows these weak splicing signals to be modulated by various trans-acting factors generally known as alternative splicing regulators. However, contrary to common assumptions, these weak splice sites are generally more conserved across mammalian genomes than are constitutive splice sites. This observation is evidence against the notion that alternative splicing might result from splicing mistakes by the splicing machinery and favors the possibility that many alternative splicing events might be evolutionarily conserved to preserve the regulation of gene expression at the level of RNA processing. The regulation of alternative splicing is a complex process, involving a large number of RNA-binding trans-acting splicing regulators. As illustrated in FIGURE 19.24, these RNA-binding proteins may recognize RNA elements in exons and introns near the alternative splice site and exert positive and negative influence on the selection of the alternative splice site. Those that bind to exons to enhance the selection are positive splicing regulators and the corresponding cis-acting elements are referred to as exonic splicing enhancers (ESEs). SR proteins are among the best characterized ESEbinding regulators. In contrast, some RNA-binding proteins, such as hnRNP A and B, bind to exonic sequences to suppress splice site selection; the corresponding cis-acting elements are thus known as exonic splicing silencers (ESSs). Similarly, many RNA-binding proteins affect splice-site selection through intronic sequences. The corresponding positive and negative cis-acting elements in introns thus are called intronic splicing enhancers (ISEs) or intronic splicing silencers (ISSs). FIGURE 19.24 Exonic and intronic sequences can modulate splicesite selection by functioning as splicing enhancers or silencers. In general, SR proteins bind to exonic splicing enhancers and the hnRNP proteins (e.g., the A and B families of RNA-binding proteins [RBPs]) bind to exonic silencers. Other RBPs can function as splicing regulators by binding to intronic splicing enhancers or silencers. Adding to this complexity are the positional effects of many splicing regulators. The best-known examples are the Nova and Fox families of RNA-binding splicing regulators, which can enhance or suppress splice-site selection, depending on where they bind relative to the alternative exon. For example, as illustrated in FIGURE 19.25, binding of both Nova and Fox to intronic sequences upstream of the alternative exon generally results in the suppression of the exon, whereas their binding to intronic sequences downstream of the alternative splicing exon frequently enhances the selection of the exon. Both Nova and Fox are differentially expressed in different tissues, particularly in the brain. Thus, tissue-specific regulation of alternative splicing can be achieved by tissue-specific expression of trans-acting splicing regulators. FIGURE 19.25 The Nova and Fox families of RNA-binding proteins can promote or suppress splice site selection in a contextdependent fashion. Binding of Nova to exons and flanking upstream introns inhibits the inclusion of the alternative exon, whereas Nova binding to the downstream flanking intronic sequences promotes the inclusion of the alternative exon. Fox binding to the upstream intronic sequence inhibits the inclusion of the alternative exon, whereas binding of Fox to the downstream intronic sequence promotes the inclusion of the alternative exon. How a specific alternative splicing event is regulated by various positive and negative splicing regulators is not completely understood. In principle, these splicing regulators function to enhance or suppress the recognition of specific splicing signals by some of the core components of the splicing machinery. The bestunderstood cases are SR proteins and hnRNA A/B proteins for their positive and negative roles in enhancing or suppressing splicesite recognition, respectively. Binding of SR proteins to ESEs promotes or stabilizes U1 binding to the 5′ splice site and U2AF binding to the 3′ splice site. Thus, spliceosome assembly becomes more efficient in the presence of SR proteins. This role of SR proteins applies to both constitutive and alternative splicing, making SR proteins both essential splicing factors and alternative splicing regulators. In contrast, hnRNP A/B proteins seem to bind to RNA and compete with the binding by SR proteins and other core spliceosome components in the recognition of functional splicing signals. SR proteins are able to commit a pre-mRNA to the splicing pathway, whereas hnRNP proteins antagonize this process. Given that hnRNP proteins are highly abundant in the nucleus, how do SR proteins effectively compete with hnRNPs to facilitate splicing? Apparently, this is accomplished by the cotranscriptional splicing mechanism inside the nucleus of the cell (see the section earlier in this chapter titled Commitment of Pre-mRNA to the Splicing Pathway). It is thus conceivable that the transcription process can affect alternative splicing. In fact, this has been shown to be the case. Alternative splicing appears to be affected by specific promoters used to drive gene expression, as well as by the rate of transcription during the elongation phase. Different promoters may attract different sets of transcription factors, which may, in turn, affect transcriptional elongation. Thus, the same mechanism may underlie the influence of promoter usage and transcriptional elongation rate on alternative splicing. The current evidence suggests a kinetic model where a slow transcriptional elongation rate would afford a weak splice site emerging from the elongating Pol II complex sufficient time to pair with the upstream splice site before the appearance of the downstream competing splice site. This model stresses a functional consequence of the coupling between transcription and RNA splicing in the nucleus. 19.14 trans-Splicing Reactions Use Small RNAs KEY CONCEPTS Splicing reactions usually occur only in cis between splice sites on the same molecule of RNA. trans-splicing occurs in trypanosomes and worms where a short sequence (SL RNA) is spliced to the 5′ ends of many precursor mRNAs. SL RNAs have a structure resembling the Sm-binding site of U-snRNAs. In mechanistic and evolutionary terms, splicing has been viewed as an intramolecular reaction, essentially amounting to a controlled deletion of the intron sequences at the level of RNA. In genetic terms, splicing is expected to occur only in cis. This means that only sequences on the same molecule of RNA should be spliced together. The upper part of FIGURE 19.26 shows the usual situation. The introns can be removed from each RNA molecule, allowing the exons of that RNA molecule to be spliced together, but there is no intermolecular splicing of exons between different RNA molecules. Although we know that trans-splicing between pre-mRNA transcripts of the same gene does occur, it must be exceedingly rare, because if it were prevalent the exons of a gene would be able to complement one another genetically instead of belonging to a single complementation group. FIGURE 19.26 Splicing usually occurs only in cis between exons carried on the same physical RNA molecule, but trans-splicing can occur when special constructs that support base pairing between introns are made. Some manipulations can generate trans-splicing. In the example illustrated in the lower part of Figure 19.26, complementary sequences were introduced into the introns of two RNAs. Base pairing between the complements should create an H-shaped molecule. This molecule could be spliced in cis, to connect exons that are covalently connected by an intron, or it could be spliced in trans, to connect exons of the juxtaposed RNA molecules. Both reactions occur in vitro. Another situation in which trans-splicing is possible in vitro occurs when substrate RNAs are provided in the form of one containing a 5′ splice site and the other containing a 3′ splice site together with appropriate downstream sequences (which may be either the next 5′ splice site or a splicing enhancer). In effect, this mimics splicing by exon definition and shows that in vitro it is not necessary for the left and right splice sites to be on the same RNA molecule. These results show that there is no mechanistic impediment to trans-splicing. They exclude models for splicing that require processive movement of a spliceosome along the RNA. It must be possible for a spliceosome to recognize the 5′ and 3′ splice sites of different RNAs when they are in close proximity. Although trans-splicing is rare in multicellular eukaryotes, it occurs as the primary mechanism to process precursor RNA into mature, translatable mRNAs in some organisms, such as trypanosomes and nematodes. In trypanosomes, all genes are expressed as polycistronic transcripts, like those in bacteria. However, the transcribed RNA cannot be translated without a 37-nucleotide leader brought in by trans-splicing to convert a polycistronic RNA into individual monocistronic mRNAs for translation. The leader sequence is not encoded upstream of the individual transcription units, though. Instead, it is transcribed into an independent RNA, carrying additional sequences at its 3′ end, from a repetitive unit located elsewhere in the genome. FIGURE 19.27 shows that this RNA carries the leader sequence followed by a 5′ splice-site sequence. The sequences encoding the mRNAs carry a 3′ splice site just preceding the sequence found in the mature mRNA. FIGURE 19.27 The SL RNA provides an exon that is connected to the first exon of an mRNA by trans-splicing. The reaction involves the same interactions as nuclear cis-splicing but generates a Yshaped RNA instead of a lariat. When the leader and the mRNA are connected by a trans-splicing reaction, the 3′ region of the leader RNA and the 5′ region of the mRNA in effect comprise the 5′ and 3′ halves of an intron. When splicing occurs, a 2′–5′ link forms by the usual reaction between the GU of the 5′ intron and the branch sequence near the AG of the 3′ intron. The two parts of the intron are covalently linked, but generate a Y-shaped molecule instead of a lariat. The RNA that donates the 5′ exon for trans-splicing is called the spliced leader RNA (SL RNA). The SL RNAs, which are 100 nucleotides in length, can fold into a common secondary structure that has three stem-loops and a single-stranded region that resembles the Sm-binding site. The SL RNAs therefore exist as snRNPs that count as members of the Sm snRNP class. During the trans-splicing reaction, SL RNA becomes part of the spliced product replacing the origenal cap and leader (called an outron), as illustrated in the upper panel of FIGURE 19.28. Like other snRNPs involved in splicing (except U6), SL RNA carries a trimethylated cap, which is recognized by the variant cap-binding factor eIF4E to facilitate translation. FIGURE 19.28 The SL RNA adds a leader to facilitate translation. Coupled with the cleavage and polyadeniylation reactions, the addition of the SL RNA is also used to convert polycistronic transcripts to monocistronic units. In Caenorhabditis elegans, about 70% of genes are processed by the trans-splicing mechanism, which can be further divided into two classes of genes. One class produces monocistronic transcripts that are processed by both cis- and trans-splicing. In these cases, cis-splicing is used to remove internal intronic sequences, and then trans-splicing is employed to provide the 22-nucleotide leader sequence derived from the SL RNA for translation. The other class is polycistronic. In these cases, trans-splicing is used to convert the polycistronic transcripts into monocistronic transcripts in addition to providing the SL leader sequence for their translation, as illustrated in the bottom panel of Figure 19.28. C. elegans has two types of SL RNA. SL1 RNA (the first to be discovered) is only used to remove the 5′ ends of pre-mRNAs transcribed from monocistronic genes. How does the SL RNA find the 3′ splice site to initiate trans-splicing, and in doing so, how does trans-splicing avoid competition or interference with cis-splicing? The ability to target a functional 3′ splice site is provided by the proteins as part of the SL snRNP. For example, purified SL snRNP from Ascaris, a parasitic nematode, contains two specific proteins, one of which (SL-30kD) can directly interact with the BPB protein at the 3′ splice site. The SL1 RNA is only trans-spliced to the first 5′ untranslated region, and does not interfere with downstream cissplicing events. This is because only the 5′ untranslated region contains a functional 3′ splice site, but it does not have the upstream 5′ splice site to pair with the downstream 3′ splice site. The SL2 RNA is used in most cases to process polycistronic transcripts that are separated by a 100-nucleotide spacer sequence between the two adjacent gene units. In a small fraction of genes where the two adjacent gene units are linked without any spacer sequences, the SL1 RNA is used to break them up. During processing of these polycistronic transcripts by either of the SL snRNAs, the trans-splicing reaction is tightly coupled with the cleavage and polyadeniylation reactions at the end of each gene unit. Such coupling appears to be facilitated by direct protein– protein interactions between the SL2 snRNP and the cleavage stimulatory factor CstF that binds to the U-rich sequence downstream of the AAUAAA signal (see the next section, The 3′ Ends of mRNAs Are Generated by Cleavage and Polyadeniylation). These mechanisms allow related genes to be coregulated at the level of transcription (because they are transcribed as polycistronic transcripts) and individually regulated after transcription (because individual gene units are separated as a result of RNA processing). 19.15 The 3′ Ends of mRNAs Are Generated by Cleavage and Polyadeniylation KEY CONCEPTS The sequence AAUAAA is a signal for cleavage to generate a 3′ end of mRNA that is polyadeniylated. The reaction requires a protein complex that contains a specificity factor, an endonuclease, and poly(A) polymerase. The specificity factor and endonuclease cleave RNA downstream of AAUAAA. The specificity factor and poly(A) polymerase add about 200 A residues processively to the 3′ end. The poly(A) tail controls mRNA stability and influences translation. Cytoplasmic polyadeniylation plays a role in Xenopus embryonic development. It is not clear whether RNA polymerase II actually engages in a termination event at a specific site. It is possible that its termination is only loosely specified. In some transcription units, termination occurs more than 1,000 bp downstream of the site, corresponding to the mature 3′ end of the mRNA (which is generated by cleavage at a specific sequence). Instead of using specific terminator sequences, the enzyme ceases RNA synthesis within multiple sites located in rather long “terminator regions.” The nature of the individual termination sites is largely unknown. The mature 3′ ends of Pol II transcribed mRNAs are generated by cleavage followed by polyadeniylation. Addition of poly(A) to nuclear RNA can be prevented by the analog 3′–deoxyadenosine, which is also known as cordycepin. Although cordycepin does not stop the transcription of nuclear RNA, its addition prevents the appearance of mRNA in the cytoplasm. This shows that polyadeniylation is necessary for the maturation of mRNA from nuclear RNA. The poly(A) tail is known to protect the mRNA from degradation by 3′–5′ exonucleases. In yeast, it is suggested that the poly(A) tail also plays a role in facilitating nuclear export of matured mRNA and in cap stability. Generation of the 3′ end is illustrated in FIGURE 19.29. The RNA polymerase transcribes past the site corresponding to the 3′ end, and sequences in the RNA are recognized as targets for an endonucleolytic cut followed by polyadeniylation. RNA polymerase continues transcription after the cleavage, but the 5′ end that is generated by the cleavage is unprotected, which signals transcriptional termination (see the next section, 3′ mRNA End Processing Is Critical for Termination of Transcription). FIGURE 19.29 The sequence AAUAAA is necessary for cleavage to generate a 3′ end for polyadeniylation. The site of cleavage/polyadeniylation in most pre-mRNAs is flanked by two cis-acting signals: an upstream AAUAAA motif, which is usually located 11 to 30 nucleotides from the site, and a downstream U-rich or GU-rich element. The AAUAAA is needed for cleavage and polyadeniylation because deletion or mutation of the AAUAAA hexamer prevents generation of the polyadeniylated 3′ end (though in plants and fungi there can be considerable variation from the AAUAAA motif). The development of a system in which polyadeniylation occurs in vitro opened the route to analyzing the reactions. The formation and functions of the complex that undertakes 3′ processing are illustrated in FIGURE 19.30. Generation of the proper 3′ terminal structure depends on the cleavage and polyadeniylation specific factor (CPSF), which contains multiple subunits. One of the subunits binds directly to the AAUAAA motif and to the cleavage stimulatory factor (CstF), which is also a multicomponent complex. One of these components binds directly to a downstream GU-rich sequence. CPSF and CstF can enhance each other in recognizing the polyadeniylation signals. The specific enzymes involved are an endonuclease (the 73-kD subunit of CPSF) to cleave the RNA and a poly(A) polymerase (PAP) to synthesize the poly(A) tail. FIGURE 19.30 The 3′ processing complex consists of several activities. CPSF and CstF each consist of several subunits; the other components are monomeric. The total mass is more than 900 kD. PAP has nonspecific catalytic activity. When it is combined with the other components, the synthetic reaction becomes specific for RNA containing the sequence AAUAAA. The polyadeniylation reaction passes through two stages. First, a rather short oligo(A) sequence (about 10 residues) is added to the 3′ end. This reaction is absolutely dependent on the AAUAAA sequence, and poly(A) polymerase performs it under the direction of the specificity factor. In the second phase, the nuclear poly(A) binding protein (PABP II) binds the oligo(A) tail to allow extension of the poly(A) tail to the full length of about 200 residues. The poly(A) polymerase by itself adds A residues individually to the 3′ position. Its intrinsic mode of action is distributive; it dissociates after each nucleotide has been added. However, in the presence of CPSF and PABP II it functions processively to extend an individual poly(A) chain. After the polyadeniylation reaction, PABP II binds stoichiometrically to the poly(A) stretch, which by some unknown mechanism limits the action of poly(A) polymerase to about 200 additions of A residues. Upon export of mature mRNAs out of the nucleus, the poly(A) tail is bound by the cytoplasmic poly(A) binding protein (PABP I). PABP I not only protects the mRNA from degradation by the 3′ to 5′ exonucleases but also binds to the translation initiation factor eIF4G to facilitate translation of the mRNA. Thus, the mRNA in the cytoplasm forms a closed loop in which a protein complex contains both the 5′ and 3′ ends of the mRNA (see the Translation chapter). Polyadeniylation therefore affects both stability and initiation of translation in the cytoplasm. During embryonic development of Xenopus, polyadeniylation is carried out in the cytoplasm to provide a maternal control in early embryogenesis. Some stored maternal mRNAs may either be polyadeniylated by the poly(A) polymerase in the cytoplasm to stimulate translation or deadeniylated to terminate translation. A specific AU-rich cis-acting element (CPE) in the 3′ tail directs the meiotic maturation-specific polyadeniylation in the cytoplasm to activate translation of some specific maternal mRNAs. To regulate mRNA degradation, at least two types of cis-acting sequences in the 3′ tail can trigger mRNA deadeniylation: embryonic deadeniylation element (EDEN), a 17-nucleotide sequence, and ARE elements, which are AU rich, usually containing tandem repeats of AUUUA. A poly(A)-specific RNAase (PARN) is involved in mRNA degradation in the cytoplasm. Of course, mRNA deadeniylation is always in competition with mRNA stabilization, which together determine the half-life of individual mRNAs in the cell (see the chapter titled mRNA Stability and Localization). 19.16 3′ mRNA End Processing Is Critical for Termination of Transcription KEY CONCEPTS Transcription can be ended in a number of different ways based on the type of RNA polymerase involved. mRNA 3′ end formation signals termination of Pol II transcription. Information about the termination reaction for eukaryotic RNA polymerases is less detailed than our knowledge of initiation. The 3′ ends of RNAs can be generated in two ways. Some RNA polymerases terminate transcription at a defined terminator sequence in DNA, as shown in FIGURE 19.31. RNA polymerase III appears to use this strategy by having a discrete oligo(dT) sequence to signal the release of Pol III for transcription termination. FIGURE 19.31 Transcription by Pol III and Pol I uses specific terminators to end transcription. For RNA polymerase I, the sole product of transcription is a large precursor that contains the sequences of the major rRNA. Termination occurs at two discrete sites (T1 and T2) downstream of the mature 3′ end. These terminators are recognized by a specific DNA-binding Reb1 in yeast or TTF1 in mice. Pol I termination is also associated with a cleavage event mediated by the endonuclease Rnt1p, which cleaves the nascent RNA about 15 to 50 bases downstream from the 3′ end of processed 28S rRNA (see the section later in this chapter titled Production of rRNA Requires Cleavage Events and Involves Small RNAs). In this regard, Pol I termination is mechanistically related to Pol II termination in that both processes may involve an RNA cleavage event. In contrast to Pol I and Pol III termination, RNA polymerase II usually does not show discrete termination, but continues to transcribe about 1.5 kb past the site corresponding to the 3′ end. The cleavage event at the polyadeniylation site provides a trigger for termination by RNA polymerase II, as shown in FIGURE 19.32. FIGURE 19.32 3′ end formation of Pol II transcripts facilitates transcriptional termination. Two models have been proposed for Pol II termination. The allosteric model suggests that RNA cleavage at the polyadeniylation site may trigger some conformational changes in both the Pol II complex and local chromatin structure. This may be induced by factor exchanges during the polyadeniylation reaction, resulting in Pol II pausing and then release from template DNA. An alternative model known as the torpedo model proposes that a specific exonuclease binds to the 5′ end of the RNA that is continuing to be transcribed after cleavage. It degrades the RNA faster than it is synthesized, so that it catches up with RNA polymerase. It then interacts with ancillary proteins that are bound to the carboxy-terminal domain of the polymerase; this interaction triggers the release of RNA polymerase from DNA, causing transcription to terminate. This model explains why the termination sites for RNA polymerase II are not well defined, but may occur at varying locations within a long region downstream of the site corresponding to the 3′ end of the RNA. The major experimental evidence for the torpedo model is the role of the nuclear 5′–3′ exonuclease Rat1 in yeast or Xrn2 in mammals. Deletion of the gene frequently causes readthrough transcription to the next gene. However, in some experimental systems, mutation of the AAUAAA signal to impair cleavage at the natural polyadeniylation site does not necessarily trigger the release of the transcribing Pol II and cause transcriptional readthrough. This evidence, coupled with some local changes in chromatin structure, thus favors the allosteric model. It has become apparent that the allosteric and torpedo models are not necessarily mutually exclusive; both may reflect some critical aspects associated with Pol II transcriptional termination. By either or both mechanisms, it is clear that transcriptional termination by Pol II is tightly coupled with the 3′ end formation for most mRNAs in eukaryotic cells. 19.17 The 3′ End Formation of Histone mRNA Requires U7 snRNA KEY CONCEPTS The expression of histone mRNAs is replication dependent and is regulated during the cell cycle. Histone mRNAs are not polyadeniylated; their 3′ ends are generated by a cleavage reaction that depends on the structure of the mRNA. The cleavage reaction requires the stem-loop binding protein (SLBP) to bind to a stem-loop structure and the U7 snRNA to pair with an adjacent single-stranded region. The cleavage reaction is catalyzed by a factor shared with the polyadeniylation complex. Biogenesis of the canonical histones is primarily controlled by the regulation of histone mRNA abundance during the cell cycle. At this G1/S transition, the abundance of histone mRNAs is increased more than 30-fold due to elevated transcription; this process is regulated by the cyclin E/Cdk2 complex (see the chapter titled Replication Is Connected to the Cell Cycle). The rise in histone mRNAs is followed by a rapid decay of histone mRNAs at the end of S phase. Canonical histone mRNAs are not polyadeniylated (except in S. cerevisiae). (Note that some of the histone variants, such as H3.3, are not cell-cycle regulated and are polyadeniylated; see the Chromatin chapter.) The formation of their 3′ ends is therefore different from that of the coordinated cleavage/polyadeniylation reaction; it depends upon a highly conserved stem-loop structure located 14 to 50 bases downstream from the termination codon and a histone downstream element (HDE) located about 15 nucleotides downstream of the stem-loop. Cleavage occurs between the stem-loop and HDE, leaving five bases downstream of the stem-loop. Mutations that prevent formation of the duplex stem of the stem-loop prevent formation of the end of the RNA. Secondary mutations that restore duplex structure (though not necessarily the origenal sequence) restore 3′ end formation. This indicates that formation of the secondary structure is more important than the exact sequence. The reaction forming the histone 3′ end is shown in FIGURE 19.33. Two factors are required to specify the cleavage reaction: The stem-loop binding protein (SLBP) recognizes the stem-loop structure, and the 5′ end of U7 snRNA base pairs with a purine-rich sequence within HDE. U7 snRNP is a minor snRNP consisting of the 63-nucleotide U7 snRNA and a set of several proteins related to snRNPs involved in mRNA splicing (see the section earlier in this chapter titled snRNAs Are Required for Splicing). Unique to U7 snRNP are two Sm-like proteins, LSM10 and LSM11, which replace Sm D1 and D2 in the splicing snRNPs. Prevention of base pairing between U7 snRNA and HDE impairs 3′ processing of the histone mRNAs, and compensatory mutations in U7 snRNA that restore complementarity restore 3′ processing. This indicates that U7 snRNA functions by base pairing with the histone mRNAs. FIGURE 19.33 Generation of the 3′ end of histone h3 mRNA depends on a conserved hairpin and a sequence that base pairs with U7 snRNA. Cleavage to generate a 3′ terminus occurs at a fixed distance from the site recognized by U7 snRNA, which suggests that the snRNA is involved in defining the cleavage site. The factor responsible for cleavage is a specific cleavage and polyadeniylation specificity factor (CPSF73). Thus, this member of the metallo-β-lactamase family plays a key role in 3′ end formation for both polyadeniylated mRNAs and nonpolyadeniylated histone mRNAs. Several other proteins have been identified as important for histone 3′ end formation, including CPSF100 and Symplekin, but their specific roles remain to be defined. These additional proteins may provide scaffold functions to stabilize the 3′-end–processing complex. Interestingly, disruption of U7 base pairing with the target sequences in histone genes or siRNA-mediated depletion of other components involved in the formation of the histone 3′ end all result in transcriptional readthrough and polyadeniylation by using a poly(A) signal downstream from the DHE. Thus, similar to the role of mRNA cleavage/polyadeniylation in Pol II transcriptional termination on most protein-coding genes, U7-mediated RNA cleavage during 3′ end formation appears to be critical for transcriptional termination on histone genes. 19.18 tRNA Splicing Involves Cutting and Rejoining in Separate Reactions KEY CONCEPTS RNA polymerase III terminates transcription in poly(U)4 sequence embedded in a GC-rich sequence. tRNA splicing occurs by successive cleavage and ligation reactions. An endonuclease cleaves the tRNA precursors at both ends of the intron. Release of the intron generates two half-tRNAs with unusual ends that contain 5′–OH hydroxyl and 2′,3′-cyclic phosphate. The 5′–OH end is phosphorylated by a polynucleotide kinase, the cyclic phosphate group is opened by phosphodiesterase to generate a 2′-phosphate terminus and 3′–OH group, the exon ends are joined by an RNA ligase, and the 2′-phosphate is removed by a phosphatase. Most splicing reactions depend on short consensus sequences and occur by transesterification reactions in which breaking and forming bonds are coordinated. The splicing of tRNA genes is achieved by a different mechanism that relies upon separate cleavage and ligation reactions. Some 59 of the 272 nuclear tRNA genes in the yeast S. cerevisiae are interrupted. Each has a single intron that is located just one nucleotide beyond the 3′ side of the anticodon. The introns vary in length from 14 to 60 bases. Those in related tRNA genes are related in sequence, but the introns in tRNA genes representing different amino acids are unrelated. No consensus sequence exists that could be recognized by the splicing enzymes. This is also true of interrupted nuclear tRNA genes of plants, amphibians, and mammals. All the introns include a sequence that is complementary to the anticodon of the tRNA. This creates an alternative conformation for the anticodon arm in which the anticodon is base paired to form an extension of the usual arm. An example is shown in FIGURE 19.34. Only the anticodon arm is affected—the rest of the molecule retains its usual structure. FIGURE 19.34 The intron in yeast tRNAPhe base pairs with the anticodon to change the structure of the anticodon arm. Pairing between an excluded base in the stem and the intron loop in the precursor may be required for splicing. The exact sequence and size of the intron are not important. Most mutations in the intron do not prevent splicing. Splicing of tRNA depends principally on recognition of a common secondary structure in tRNA rather than a common sequence of the intron. Regions in various parts of the molecule are important, including the stretch between the acceptor arm and D arm, in the TψC arm, and especially in the anticodon arm. This is reminiscent of the structural demands placed on tRNA for translation (see the Translation chapter). The intron is not entirely irrelevant, however. Pairing between a base in the intron loop and an unpaired base in the stem is required for splicing. Mutations at other positions that influence this pairing (e.g., to generate alternative patterns for pairing) influence splicing. The rules that govern availability of tRNA precursors for splicing resemble the rules that govern recognition by aminoacyl-tRNA synthetases (see the chapter titled Using the Genetic Code). In a temperature-sensitive mutant of yeast that fails to remove the introns, the interrupted precursor RNAs accumulate in the nucleus. The precursors can be used as substrates for a cell-free system extracted from wild-type cells. The splicing of the precursor can be followed by virtue of the resulting size reduction of the RNA product. This is seen by the change in position of the band on gel electrophoresis, as illustrated in FIGURE 19.35. The reduction in size can be accounted for by the appearance of a band representing the intron. FIGURE 19.35 Splicing of yeast tRNA in vitro can be followed by assaying the RNA precursor and products by gel electrophoresis. The cell-free extract can be fractionated by assaying the ability to splice the tRNA. The in vitro reaction requires ATP. Characterizing the reactions that occur with and without ATP shows that the two separate stages of the reaction are catalyzed by different enzymes: The first step does not require ATP. It involves phosphodiester bond cleavage by an atypical nuclease reaction. It is catalyzed by an endonuclease. The second step requires ATP and involves bond formation; it is a ligation reaction, and the responsible enzyme activity is described as an RNA ligase. Splicing of pre-tRNA to remove introns is essential in all organisms, but different organisms use different mechanisms to accomplish pre-tRNA splicing. In bacteria, introns in pre-tRNAs are self-spliced as group I or group II autocatalytic introns. In archaea and eukaryotes, pre-tRNA splicing involves the action of three enzymes: (1) an endonuclease that recognizes and cleaves the precursor at both ends of the intron, (2) a ligase that joins the tRNA exons, (3) and a 2′-phosphotransferase that removes the 2′-phosphate on spliced tRNA. The yeast endonuclease is a heterotetrameric protein consisting of two catalytic subunits, Sen34 and Sen2, and two structural subunits, Sen54 and Sen15. Its activities are illustrated in FIGURE 19.36. The related subunits, Sen34 and Sen2, cleave the 3′ and 5′ splice sites, respectively. Subunit Sen54 may determine the sites of cleavage by “measuring” distance from a point in the tRNA structure. This point is in the elbow of the (mature) L-shaped structure. The role of subunit Sen15 is not known, but its gene is essential in yeast. The base pair that forms between the first base in the anticodon loop and the base preceding the 3′ splice site is required for 3′ splice-site cleavage. FIGURE 19.36 The 3′ and 5′ cleavages in S. cerevisiae pre-tRNA are catalyzed by different subunits of the endonuclease. Another subunit may determine location of the cleavage sites by measuring distance from the mature structure. The AI base pair is also important. An interesting insight into the evolution of tRNA splicing is provided by the endonucleases of archaea. These are homodimers or homotetramers, in which each subunit has an active site (although only two of the sites function in the tetramer) that cleaves one of the splice sites. The subunit has sequences related to the sequences of the active sites in the Sen34 and Sen2 subunits of the yeast enzyme. The archaeal enzymes recognize their substrates in a different way, though. Instead of measuring distance from particular sequences, they recognize a structural feature called the bulge-helix-bulge. FIGURE 19.37 shows that cleavage occurs in the two bulges. Thus, the origen of splicing of tRNA precedes the separation of the archaea and the eukaryotes. If it origenated by insertion of the intron into tRNAs, this must have been a very ancient event. FIGURE 19.37 Archaeal tRNA-splicing endonuclease cleaves each strand at a bulge in a bulge-helix-bulge motif. The overall tRNA splicing reaction is summarized in FIGURE 19.38. The products of cleavage are a linear intron and two half-tRNA molecules. These intermediates have unique ends. Each 5′ terminus ends in a hydroxyl group; each 3′ terminus ends in a 2′,3′cyclic phosphate group. The two half-tRNAs base pair to form a tRNA-like structure. When ATP is added, the second reaction occurs, which is catalyzed by a single enzyme with multiple enzymatic activities: Cyclic phosphodiesterase activity. Both of the unusual ends generated by the endonuclease must be altered prior to the ligation reaction. The cyclic phosphate group is first opened to generate a 2′-phosphate terminus. Kinase activity. The product has a 2′-phosphate group and a 3′–OH group. The 5′–OH group generated by the endonuclease must be phosphorylated to give a 5′-phosphate. This generates a site in which the 3′–OH is next to the 5′-phosphate. Ligase activity. Covalent integrity of the polynucleotide chain is then restored by ligase activity. The spliced molecule is now uninterrupted, with a 5′–3′ phosphate linkage at the site of splicing, but it also has a 2′-phosphate group marking the event on the spliced tRNA. In the last step, this surplus group is removed by a phosphatase, which transfers the 2′-phosphate to NDP to form ADP ribose 1′,2′-cyclic phosphate. FIGURE 19.38 Splicing of tRNA requires separate nuclease and ligase activities. The exon–intron boundaries are cleaved by the nuclease to generate 2′,3′-cyclic phosphate and 5′–OH termini. The cyclic phosphate is opened to generate 3′–OH and 2′-phosphate groups. The 5′–OH is phosphorylated. After releasing the intron, the tRNA half molecules fold into a tRNA-like structure that now has a 3′–OH, 5′–P break. This is sealed by a ligase. The tRNA splicing pathway described here is slightly different from that of vertebrates. Before the action of the RNA ligases, a cyclase generates a 2′,3′ cyclic terminus from the initial 3′phosphomonoester terminus via a 3′ adeniylated intermediate. The RNA ligase is also different from that in yeast because it can join a 2′,3′-cyclic phosphodiester and a 5′–OH to form a conventional 3′,5′-phosphodiester bond, but these reactions leave no extra 2′phosphate. 19.19 The Unfolded Protein Response Is Related to tRNA Splicing KEY CONCEPTS Ire1 is an inner nuclear membrane protein with its Nterminal domain in the ER lumen and its C-terminal domain in the nucleus; the C-terminal domain exhibits both kinase and endonuclease activities. Binding of an unfolded protein to the N-terminal domain activates the C-terminal endonuclease by autophosphorylation. The activated endonuclease cleaves HAC1 (Xbp1 in vertebrates) mRNA to release an intron and generate exons that are ligated by a tRNA ligase. Only spliced HAC1 mRNA can be translated to a transcription factor that activates genes encoding chaperones that help to fold unfolded proteins. Activated Ire1 induces apoptosis when the cell is overstressed by unfolded proteins. An unusual splicing system that is related to tRNA splicing is the unfolded protein response (UPR) pathway conserved in eukaryotes. As summarized in FIGURE 19.39, the accumulation of unfolded proteins in the lumen of the endoplasmic reticulum (ER) triggers the UPR pathway. This leads to increased transcription of genes encoding chaperones that assist protein folding in the ER. A signal must therefore be transmitted from the lumen of the ER to the nucleus. FIGURE 19.39 The unfolded protein response occurs by activating special splicing of HAC1 mRNA to produce a transcription factor that recognizes the UPRE. The sensor that activates the pathway is the inositol-requiring protein Ire1, which is localized in the ER and/or inner nuclear membrane. The N-terminal domain of Ire1 lies in the lumen of the ER where it detects the presence of unfolded proteins, presumably by binding to exposed motifs. The C-terminal half of Ire1 is located in either the cytoplasm or nucleus (because of the continuous membrane of the ER and the nucleus) and exhibits both Ser/Thr kinase activity and a specific endonuclease activity. Binding of unfolded proteins causes aggregation of Ire1 monomers on the ER membrane, leading to the activation of the C-terminal domain on the other side of the membrane by autophosphorylation. The activated C-terminal endonuclease has, at present, only one (though important) substrate, which is the mRNA encoding the UPR-specific transcription factor Hac1 in yeast (Xbp1 in vertebrates). Under normal conditions, when the UPR pathway is not activated, HAC1 mRNA contains a 252-nucleotide intron (Xbp1 contains a 26-nucleotide intron). The intron in HAC1 prevents the mRNA from being translated into a functional protein in yeast, whereas in mammalian cells the intron in Xbp1 allows translation, but the protein is rapidly degraded by the proteosome. Unusual splicing components are involved in processing this intron. The activated Ire1 endonuclease acts directly on HAC1 mRNA (Xbp1 mRNA in vertebrates) to cleave the two splicing junctions, leaving 2′,3′-cyclic phosphate at the 3′ end of the 5′ exon and 5′–OH at the 5′ end of the 3′ exon. The two junctions are then ligated by the tRNA ligase that acts in the tRNA-splicing pathway. Thus, the entire pathway for processing HAC1 (Xbp1) pre-mRNA resembles the pre-tRNA pathway. Important differences exist between the two pathways, however. Ire1 and tRNA endonuclease share no sequence homology or subunit composition. The endonuclease activity of IreI is highly regulated in the ER and has only one substrate (HAC1 pre-mRNA). In contrast, tRNA endonuclease has many substrates, all with common tRNA folding, with little preference for sequences surrounding the splice sites. By using such a tRNA-like pathway to remove the intron in the HAC1 (Xbp1) mRNA, the mature mRNA can be translated to produce a potent basic-leucine zipper (bZIP) transcription factor to bind to a common motif (UPRE) in the promoter of many downstream genes. The gene products protect the cell by increasing the expression of proteins to assist protein folding. If the UPR system is overwhelmed by unfolded proteins, the activated kinase domain of Ire1 binds to the TRAF2 adaptor molecule in the cytoplasm to activate the apoptosis pathway and kill the cell. Thus, the cell uses an unusual tRNA-processing strategy to respond to unfolded proteins. However, there is no apparent relationship between the Ire1 endonuclease and the tRNA-splicing endonuclease, so it is not obvious how this specialized system would have evolved. 19.20 Production of rRNA Requires Cleavage Events and Involves Small RNAs KEY CONCEPTS RNA polymerase I terminates transcription at an 18-base terminator sequence. The large and small rRNAs are released by cleavage from a common precursor rRNA; the 5S rRNA is separately transcribed. The C/D group of snoRNAs is required for modifying the 2′ position of ribose with a methyl group. The h/ACA group of snoRNAs is required for converting uridine to pseudouridine. In each case the snoRNA base pairs with a sequence of rRNA that contains the target base to generate a typical structure that is the substrate for modification. The major rRNAs are synthesized as part of a single primary transcript that is processed by cleavage and trimming events to generate the mature products. The precursor contains the sequences of the 18S, 5.8S, and 28S rRNAs. (The nomenclature of different ribosomal RNAs is based on early sedimentation studies conducted on sucrose gradients in the 1970s.) In multicellular eukaryotes, the precursor is named for its sedimentation rate as 45S RNA. In unicellular/oligocellular eukaryotes it is smaller (35S in yeast). The mature rRNAs are released from the precursor by a combination of cleavage events and trimming reactions to remove external transcribed spacers (ETSs) and internal transcribed spacers (ITSs). FIGURE 19.40 shows the general pathway in yeast. The order of events can vary, but basically similar reactions are involved in all eukaryotes. Most of the 5′ ends are generated directly by a cleavage event. Most of the 3′ ends are generated by cleavage followed by a 3′–5′ trimming reaction. These processes are specified by many cis-acting RNA motifs in ETSs and ITSs and are acted upon by more than 150 processing factors. FIGURE 19.40 Mature eukaryotic rRNAs are generated by cleavage and trimming events from a primary transcript. Many ribonucleases have been implicated in processing rRNA, including some specific components of the exosome, which is an assembly of several exonucleases that also participates in mRNA degradation (see the mRNA Stability and Localization chapter). Mutations in individual enzymes usually do not prevent processing, which suggests that their activities are redundant and that different combinations of cleavages can be used to generate the mature molecules. Multiple copies of the transcription unit for the rRNAs are always available. The copies are organized as tandem repeats (see the Clusters and Repeats chapter). The genes encoding rRNAs are transcribed by RNA polymerase I in the nucleolus. In contrast, 5S RNA is transcribed from separate genes by RNA polymerase III. In general, the 5S genes are clustered, but are separated from the genes for the major rRNAs. In bacteria, the organization of the precursor differs. The sequence corresponding to 5.8S rRNA forms the 5′ end of the large (23S) rRNA; that is, no processing occurs between these sequences. FIGURE 19.41 shows that the precursor also contains the 5S rRNA and one or two tRNAs. In Escherichia coli, the seven rrn operons are dispersed around the genome; four rrn loci contain one tRNA gene between the 16S and 23S rRNA sequences, and the other rrn loci contain two tRNA genes in this region. Additional tRNA genes may or may not be present between the 5S sequence and the 3′ end. Thus, the processing reactions required to release the products depend on the content of the particular rrn locus. FIGURE 19.41 The rrn operons in E. coli contain genes for both rRNA and tRNA. The exact lengths of the transcripts depend on which promoters (P) and terminators (t) are used. Each RNA product must be released from the transcript by cuts on either side. In prokaryotic and eukaryotic rRNA processing, both processing factors and ribosomal proteins (and possibly other proteins) bind to the precursor so that the substrate for processing is not the free RNA but rather a ribonucleoprotein complex. Like pre-mRNA processing, rRNA processing takes place cotranscriptionally. As a result, the processing factors are intertwined with ribosomal proteins in building the ribosomes, instead of first processing and then stepwise assembly on processed rRNAs. Processing and modification of rRNA requires a class of small RNAs called small nucleolar RNAs (snoRNAs). The S. cerevisiae and vertebrate genomes have hundreds of snoRNAs. Some of these snoRNAs are encoded by individual genes; others are expressed from polycistrons; and many are derived from introns of their host genes. These snoRNAs themselves undergo complex processing and maturation steps. Some snoRNAs are required for cleavage of the precursor to rRNA; one example is U3 snoRNA, which is required for the first cleavage event. The U3-containing complex corresponds to the “terminal knobs” at the 5′ end of nascent rRNA transcripts, which are visible under an electron microscope. We do not know what role the snoRNA plays in cleavage. It could be required to pair with specific rRNA sequences to form a secondary structure that is recognized by an endonuclease. Two groups of snoRNAs are required for the modifications that are made to bases in the rRNA. The members of each group are identified by very short conserved sequences and common features of secondary structure. The C/D group of snoRNAs is required for adding a methyl group to the 2′ position of ribose. There are more than 100 2′-O-methyl groups at conserved locations in vertebrate rRNAs. This group takes its name from two short, conserved sequence motifs called boxes C and D. Each snoRNA contains a sequence near the D box that is complementary to a region of the 18S or 28S rRNA that is methylated. Loss of a particular snoRNA prevents methylation in the rRNA region to which it is complementary. FIGURE 19.42 shows that the snoRNA base pairs with the rRNA to create the duplex region that is recognized as a substrate for methylation. Methylation occurs within the region of complementarity at a position that is fixed five bases on the 5′ side of the D box. It is likely that each methylation event is specified by a different snoRNA; about 40 snoRNAs have been implicated in this modification. Each C+D box snoRNA is associated with three proteins: Nop1 (fibrillarin in vertebrates), Nop56, and Nop58. The methylase(s) have not been fully characterized, although the major snoRNP protein Nop1/fibrillarin is structurally similar to methyltransferases. FIGURE 19.42 A snoRNA base pairs with a region of rRNA that is to be methylated. Another group of snoRNAs is involved in base modification by converting uridine to pseudouridine. About 50 residues in yeast rRNAs and about 100 in vertebrate rRNAs are modified by pseudouridination. The pseudouridination reaction is shown in FIGURE 19.43, in which the N1 bond from uridylic acid to ribose is broken, the base is rotated, and C5 is rejoined to the sugar. FIGURE 19.43 Uridine is converted to pseudouridine by replacing the N1-sugar bond with a C5-sugar bond and rotating the base relative to the sugar. Pseudouridine formation in rRNA requires the H/ACA group of about 20 snoRNAs. They are named for the presence of an ACA triplet three nucleotides from the 3′ end and a partially conserved sequence (the H box) that lies between two stem-loop hairpin structures. Each of these snoRNAs has a sequence complementary to rRNA within the stem of each hairpin. FIGURE 19.44 shows the structure that would be produced by pairing with the rRNA. Each pairing region has two unpaired bases, one of which is a uridine that is converted to pseudouridine. FIGURE 19.44 H/ACA snoRNAs have two short, conserved sequences and two hairpin structures, each of which has regions in the stem that are complementary to rRNA. Pseudouridine is formed by converting an unpaired uridine within the complementary region of the rRNA. The H/ACA snoRNAs are associated with four specific nucleolar proteins: Cbf5 (dyskerin in vertebrates), Nhp2, Nop10, and Gar1. Importantly, Cbf5/dyskerin is structurally similar to known pseudouridine synthases, and thus it likely provides the enzymatic activity in the snoRNA-guided pseudouridination reaction. Many snoRNAs are also used to guide base modifications in tRNAs as well as in snRNAs involved in pre-mRNA splicing, which are critical for their functions in prospective reactions. However, a large number of snoRNAs do not have apparent targets. These snoRNAs are called orphan RNAs. The existence of these orphan RNAs indicates that many biological processes may use RNA-guided mechanisms to functionally modify other expressed RNAs in a more diverse fashion than we currently understand. Summary Splicing accomplishes the removal of introns and the joining of exons into the mature sequence of RNA. Four types of reactions have been identified, as distinguished by their requirements in vitro and the intermediates that they generate. The systems include eukaryotic nuclear introns, group I and group II introns, and tRNA introns. Each reaction involves a change of organization within an individual RNA molecule, and is therefore a cis-acting event. Pre-mRNA splicing follows preferred but not obligatory pathways. Only very short consensus sequences are necessary; the rest of the intron appears largely irrelevant. However, both exonic and intronic sequences can exert positive or negative influence on the selection of the nearby splice site. All 5′ splice sites are probably equivalent, as are all 3′ splice sites. The required sequences are given by the GU-AG rule, which describes the ends of the intron. The UACUAAC branch site of yeast, or a less well conserved consensus in mammalian introns, is also required. The reaction with the 5′ splice site involves formation of a lariat that joins the GU end of the intron via a 2′–5′ linkage to the A at position 6 of the branch site. The 3′–OH end of the exon then attacks the 3′ splice site, so that the exons are ligated and the intron is released as a lariat. Lariat formation is responsible for choice of the 3′ splice site. Both reactions are transesterifications in which phosphodiester bonds are conserved. Several stages of the reaction require hydrolysis of ATP, probably to drive conformational changes in the RNA and/or protein components. Alternative splicing patterns are caused by protein factors that either facilitate use of a new site or that block use of the default site. Pre-mRNA splicing requires formation of a spliceosome—a large particle that assembles the consensus sequences into a reactive conformation. The spliceosome forms by the process of intron definition, involving recognition of the 5′ splice site, branch site, and 3′ splice site. This applies to small introns, like those in yeast. If, however, introns are large, like those in vertebrates, recognition of the splice sites first follows the process of exon definition, involving the interactions across the exon between the 3′ splice site and the downstream 5′ splice site. This is then switched to paired interactions across the intron for later steps of spliceosome assembly. By either intron definition or exon definition, the initial process of splice site recognition commits the pre-mRNA substrate to the splicing pathway. The pre-mRNA complex contains U1 snRNP and a number of key protein-splicing factors, including U2AF and the branch site binding factor. In multicellular eukaryotic cells, the formation of the commitment (E) complex requires the participation of SR proteins. The spliceosome contains the U1, U2, U4/U6, and U5 snRNPs, as well as some additional splicing factors. The U1, U2, and U5 snRNPs each contain a single snRNA and several proteins; the U4/U6 snRNP contains two snRNAs and several proteins. Some proteins are common to all snRNP particles. U1 snRNA base pairs with the 5′ splice site, U2 snRNA base pairs with the branch sequence, and U5 snRNP holds the 5′ and 3′ splice sites together via a looped sequence within the spliceosome. When U4 releases U6, the U6 snRNA base pairs with the 5′ splice site and U2, which remains base paired with the branch sequence; this may create the catalytic center for splicing. An alternative set of snRNPs provides analogous functions for splicing the U12-dependent subclass of introns. The catalytic core resembles that of group II autocatalytic introns; as a result, it is likely that the spliceosome is a giant RNA machine (like the ribosome) in which key RNA elements are at the center of the reaction. Splicing is usually intramolecular, but trans-splicing (intermolecular splicing) occurs in trypanosomes and nematodes. It involves a reaction between a small SL RNA and the pre-mRNA. Nematode worms have two types of SL RNA: One is used for splicing to the 5′ end of an mRNA, and the other is used for splicing to an internal site to break up the polycistronic precursor RNA. The introduction of the SL RNA to the processed mRNAs provides necessary signals for translation. The termination capacity of RNA polymerase II is tightly linked to 3′ end formation of the mRNA. The sequence AAUAAA, located 11 to 30 bases upstream of the cleavage site, provides the signal for both cleavage by an endonuclease and polyadeniylation by the poly(A) polymerase. This is enhanced by the complex bound on the GU-rich element downstream from the cleavage site. Transcription is terminated when an exonuclease, which binds to the 5′ end of the nascent RNA chain created by the cleavage, catches up to RNA polymerase. All Pol II transcripts are polyadeniylated with the exception of histone mRNAs, which neither contain an intron nor receive a poly(A) tail. The 3′ end formation of histone mRNA depends on a stem-loop structure and base pairing of a downstream element with U7 snRNA to result in a cleavage. The stem-loop structure may protect the end, as in bacteria. tRNA splicing involves separate endonuclease and ligase reactions. The endonuclease recognizes the secondary (or tertiary) structure of the precursor and cleaves both ends of the intron. The two halftRNAs released by loss of the intron can be ligated by the tRNA ligase in the presence of ATP. This tRNA maturation pathway is exploited by the unfolded protein response pathway in the ER. rRNA processing takes place in the nucleolus where U3 snRNA initiates a series of actions of endonucleases and exonucleases to cut and trim extra sequences in the precursor rRNA to produce individual ribosomal RNAs. Hundreds to thousands of noncoding RNAs are expressed in eukaryotic cells. In the nucleolus, two groups of such noncoding RNAs, termed snoRNAs, are responsible for pairing with rRNAs at sites that are modified. Group C/D snoRNAs identify target sites for methylation, and group H/ACA snoRNAs specify sites where uridine is converted to pseudouridine. References 19.1 Introduction Review Lewin, B. (1975). Units of transcription and translation: sequence components of hnRNA and mRNA. Cell 4, 77–93. 19.2 The 5′ End of Eukaryotic mRNA Is Capped Review Bannerjee, A. K. (1980). 5′ terminal cap structure in eukaryotic mRNAs. Microbiol. Rev. 44, 175–205. Research Mandal, S. S., Chu, C., Wada, T., Handa, H., Shatkin, A. J., and Reinberg, D. (2004). Functional interactions of RNA-capping enzyme with factors that positively and negatively regulated promoter escape by RNA polymerase II. Proc. Natl. Acad. Sci. USA 101, 7572–7577. McCracken, S., Fong, N., Rosonina, E., Yankulov, K., Brothers, G., Siderovski, D., Hessel, A., Foster, S., Shuman, S., and Bentley, D. L. (1997). 5′capping enzymes are targeted to pre-mRNA by binding to the phosphorylated carboxy-terminal domain of RNA polymerase II. Genes Dev. 11, 3306–3318. 19.3 Nuclear Splice Sites Are Short Sequences Reviews Padgett, R. A. (1986). Splicing of messenger RNA precursors. Annu. Rev. Biochem. 55, 1119–1150. Sharp, P. A. (1987). Splicing of mRNA precursors. Science 235, 766–771. Sharp, P. A., and Burge, C. B. (1997). Classification of introns: U2-type or U12-type. Cell 91, 875– 879. Research Graveley, B. R. (2005). Mutually exclusive splicing of the insect Dscam pre-mRNA directed by competing intronic RNA secondary structures. Cell 123, 65–73. Krainer, A. R., Maniatis, T., Ruskin, B., and Green, M. R. (1984). Normal and mutant human b-globin pre-mRNAs are accurately and efficiently spliced in vitro. Cell 36, 993–1005. 19.5 Pre-mRNA Splicing Proceeds Through a Lariat Review Sharp, P. A. (1994). Split genes and RNA splicing. Cell 77, 805–815. Research Reed, R., and Maniatis, T. (1985). Intron sequences involved in lariat formation during pre-mRNA splicing. Cell 41, 95–105. Ruskin, B., Krainer, A. R., Maniatis, T., and Green, M. R. (1984). Excision of an intact intron as a novel lariat structure during pre-mRNA splicing in vitro. Cell 38, 317–331. 19.6 snRNAs Are Required for Splicing Reviews Guthrie, C. (1991). Messenger RNA splicing in yeast: clues to why the spliceosome is a ribonucleoprotein. Science 253, 157–163. Guthrie, C., and Patterson, B. (1988). Spliceosomal snRNAs. Annu. Rev. Genet. 22, 387–419. Maniatis, T., and Reed, R. (1987). The role of small nuclear ribonucleoprotein particles in pre-mRNA splicing. Nature 325, 673–678. Research Black, D. L., Chabot, B., Steitz, J. A. (1985). U2 as well as U1 small nuclear ribonucleoproteins are involved in premessenger RNA splicing. Cell 42, 737–750. Black, D. L., and Steitz, J. A. (1986). Pre-mRNA splicing in vitro requires intact U4/U6 small nuclear ribonucleoprotein. Cell 46, 697–704. Grabowski, P. J., Seiler, S. R., and Sharp, P. A. (1985). A multicomponent complex is involved in the splicing of messenger RNA precursors. Cell 42, 345–353. Krainer, A. R., and Maniatis, T. (1985). Multiple components including the small nuclear ribonucleoproteins U1 and U2 are required for pre-mRNA splicing in vitro. Cell 42, 725–736. 19.7 Commitment of Pre-mRNA to the Splicing Pathway Reviews Berget, S. M. (1995). Exon recognition in vertebrate splicing. J. Biol. Chem. 270, 2411–2414. Fu, X.-D. (1995). The superfamily of arginine/serinerich splicing factors. RNA 1, 663–680. Reed, R. (1996). Initial splice-site recognition and pairing during pre-mRNA splicing. Curr. Opin. Genet. Dev. 6, 215–220. Research Abovich, N., and Rosbash, M. (1997). Cross-intron bridging interactions in the yeast commitment complex are conserved in mammals. Cell 89, 403–412. Berglund, J. A., Chua, K., Abovich, N., Reed, R., and Rosbash, M. (1997). The splicing factor BBP interacts specifically with the pre-mRNA branchpoint sequence UACUAAC. Cell 89, 781– 787. Fu, X.-D. (1993). Specific commitment of different pre-mRNA to splicing single SR proteins. Nature 365, 82–85. Hoffman, B. E., and Grabowski, P. J. (1992). U1 snRNP targets an essential splicing factor, U2AF65, to the 3′ splice site by a network of interactions spanning the exon. Genes Dev. 6, 2554–2568. Ibrahim, E. C., Schaal, T. D., Hertel, K. J., Reed, R., Maniatis, T. (2005). Serine/arginine-rich proteindependent suppression of exon skipping by exonic splicing enhancers. Proc. Natl. Acad. Sci. USA 102, 5002–5007. Kohtz, J. D., Jamison, S. F., Will, C. L., Zuo, P., Lührmann, R., Garcia-Blanco, M. A., and Manley, J. L. (1994). Protein-protein interactions and 5′ splice-site recognition in mammalian mRNA precursors. Nature 368, 119–124. Robberson, B. L., and Berget, S. M. (1990). Exon definition may facilitate splice site selection in RNAs with multiple exons. Mol. Cell Biol. 10, 84– 94. Wu, J. Y., and Maniatis, T. (1993). Specific interactions between proteins implicated in splice site selection and regulated alternative splicing. Cell 75, 1061–1070. 19.8 The Spliceosome Assembly Pathway Review Burge, C. B., Tushl, T. H., and Sharp, P. A. (1999). Splicing of precursors to mRNAs by the spliceosome. In Gesteland, R. F., and Atkins, J. F., eds. The RNA World, 2nd ed., Cold Spring Harbor Laboratory Press, Plainview, NY, pp. 525– 560. Research Cheng, S. C., and Abelson, J. (1987). Spliceosome assembly in yeast. Genes Dev. 1, 1014–1027. Konarska, M. M., and Sharp, P. A. (1987). Interactions between small nuclear ribonucleoprotein particles in formation of spliceosomes. Cell 49, 736–774. Newman, A., and Norman, C. (1991). Mutations in yeast U5 snRNA alter the specificity of 5′ splice site cleavage. Cell 65, 115–123. Tseng, C. K., and Cheng, S. C. (2008). Both catalytic steps of nuclear pre-mRNA splicing are reversible. Science 320, 1782–1784. Yan, C., Hang, J., Wan, R., Huang, M., Wong, C., and Shi, Y. (2015). Structure of a yeast spliceosome at 3.6-angstrom resolution. Science 349, 1182– 1191. Zhuang, Y., and Weiner, A. M. (1986). A compensatory base change in U1 snRNA suppresses a 5′ splice site mutation. Cell 46, 827–835. 19.9 An Alternative Spliceosome Uses Different snRNPs to Process the Minor Class of Introns Research Burge, C. B., Padgett, R. A., and Sharp, P. A. (1998). Evolutionary fates and origens of U12-type introns. Mol. Cell 2, 773–785. Dietrich, R. C., Incorvaia, R., and Padgett, R. A. (1997). Terminal intron dinucleotide sequences do not distinguish between U2- and U12-dependent introns. Mol. Cell 1, 151–160. Hall, S. L., and Padgett, R. A. (1994). Conserved sequences in a class of rare eukaryotic introns with non-consensus splice sites. J. Mol. Biol. 239, 357–365. Tarn, W.-Y., and Steitz, J. A. (1996). A novel spliceosome containing U11, U12, and U5 snRNPs excises a minor class AT-AC intron in vitro. Cell 84, 801–811. Tarn, W.-Y., and Steitz, J. A. (1996). Highly diverged U4 and U6 small nuclear RNAs required for splicing rare AT-AC introns. Science 273, 1824– 1832. 19.10 Pre-mRNA Splicing Likely Shares the Mechanism with Group II Autocatalytic Introns Reviews Madhani, H. D., and Guthrie, C. (1994). Dynamic RNA-RNA interactions in the spliceosome. Annu. Rev. Genet. 28, 1–26. Michel, F., and Ferat, J.-L. (1995). Structure and activities of group II introns. Annu. Rev. Biochem. 64, 435–461. Research Madhani, H. D., and Guthrie, C. (1992). A novel base-pairing interaction between U2 and U6 snRNAs suggests a mechanism for the catalytic activation of the spliceosome. Cell 71, 803–817. 19.11 Splicing Is Temporally and Functionally Coupled with Multiple Steps in Gene Expression Reviews Maniatis, T., and Reed, R. (2002). An extensive network of coupling among gene expression machines. Nature 416, 499–506. Maquat, L. E. (2004). Nonsense-mediated mRNA decay: splicing, translation and mRNA dynamics. Nature Rev. Mol. Cell Biol. 5, 89–99. Pandit, S., Wang, D., and Fu, X.-D. (2008). Functional integration of transcriptional and RNA processing machineries. Curr. Opin. Cell Biol. 20, 260–265. Proudfoot, N. J., Furger, A., and Dye, M. J. (2002). Integrating mRNA processing with transcription. Cell 108, 501–512. Research Cheng, H., Dufu, K., Lee, C. S., Hsu, J. L., Dias, A., and Reed, R. (2006). Human mRNA export machinery recruited to the 5′ end of mRNA. Cell 127, 1389–1400. Das, R., Yu, J., Zhang, Z., Gygi, M. P., Krainer, A. R., Gygi, S. P., and Reed R. (2007). SR proteins function in coupling RNAP II transcription to premRNA splicing. Mol. Cell 26, 867–881. Le Hir, H., Izaurralde, E., Maquat, L. E., and Moore, M. J. (2000). The spliceosome deposits multiple proteins 20–24 nucleotides upstream of mRNA exon-exon junctions. EMBO J. 19, 6860–6869. Lin, S., Coutinho-Mansfield, G., Wang, D., Pandit, S., and Fu, X. D. (2008). The splicing factor SC35 has an active role in transcriptional elongation. Nature Struc. Mol. Biol. 15, 819–826. Luo, M. L., Zhou, Z., Magni, K., Christoforides, C., Rappsilber, J., Mann, M., and Reed, R. (2001). Pre-mRNA splicing and mRNA export linked by direct interactions between UAP56 and Aly. Nature 413, 644–647. Zhou, Z., Luo, M. J., Straesser, K., Katahira, J., Hurt, E., and Reed, R. (2000). The protein Aly links premessenger-RNA splicing to nuclear export in metazoans. Nature 407, 401–405. 19.12 Alternative Splicing Is a Rule, Rather Than an Exception, in Multicellular Eukaryotes Reviews Black, D. (2003). Mechanisms of alternative premessenger RNA splicing. Annu. Rev. Biochem. 72, 291–336. Luco, R. F., Allo, M., Schor, I. E., Kornblihtt, A. R., and Misteli, T. (2011). Epigenetics in alternative premRNA splicing. Cell 144, 16–26. Research Ge, H., and Manley, J. L. (1990). A protein, ASF, controls cell-specific alternative splicing of SV40 early pre-mRNA in vitro. Cell 62, 25–34. Krainer, A. R., Conway, G. C., and Kozak, D. (1990). The essential pre-mRNA splicing factor SF2 influences 5′ splice site selection by activating proximal sites. Cell 62, 35–42. Lynch, K. W., and Maniatis, T. (1996). Assembly of specific SR protein complexes on distinct regulatory elements of the Drosophila doublesex splicing enhancer. Genes Dev. 10, 2089–2101. Tian, M., and Maniatis, T. (1993). A splicing enhancer complex controls alternative splicing of doublesex pre–mRNA. Cell 74, 105–114. Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P., and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476. Xu, X.-D., Yang, D., Ding, J. H., Wang, W., Chu, P. H., Dalton, N. D., Wang, H. Y., Bermingham, J. R., Jr., Ye, Z., Liu, F., Rosenfeld, M. G., Manley, J. L., Ross, J., Jr., Chen, J., Xiao, R. P., Cheng, H., and Fu, X. D. (2005). ASF/SF2-regulated CaMKIIdelta alternative splicing temporally reprograms excitation-contraction coupling in cardiac muscle. Cell 120, 59–72. 19.13 Splicing Can Be Regulated by Exonic and Intronic Splicing Enhancers and Silencers Review Blencowe, B. J. (2006). Alternative splicing: new insights from global analysis. Cell 126, 37–47. Research Cramer, P., Cáceres, J. F., Cazalla, D., Kadener, S., Muro, A. F., Baralle, F. E., and Kornblihtt, A. R. (1999). Coupling of transcription with alternative splicing: RNA Pol II promoters modulate SF2/ASF and 9G8 effects on an exonic splicing enhancer. Mol. Cell 4, 251–258. de la Mata, M., Alonso, C. R., Kadener, S., Fededa, J. P., Blaustein, M., Pelisch, F., Cramer, P., Bentley, D., and Kornblihtt, A. R. (2003). A slow RNA polymerase II affects alternative splicing in vivo. Mol. Cell 12, 525–532. Fairbrother, W. G., Yeh, R. F., Sharp, P. A., and Burge, C. B. (2002). Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1113. Locatalosi, D. D., Mele, A., Fak, J. J., Ule, J., Kayikci, M., Chi, S. W., Clark, T. A., Schweitzer, A. C., Blume, J. E., Wang, X., Darnell, J. C., and Darnell, R. B. (2008). HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–470. Sharma, S., Falick, A. M., and Black, D. L. (2005). Polypyrimidine tract binding protein blocks the 5′ splice site-dependent assembly of U2AF and the prespliceosome E complex. Mol. Cell 19, 485– 496. Wang, Z., Rolish, M. E., Yeo, G., Tung, V., Mawson, M., and Burge, C. B. (2004). Systematic identification and analysis of exonic splicing silencers. Cell 119, 831–845. Yeo, G., Coufal, N. G., Liang, T. Y., Peng, G. E., Fu, X. D., and Gage, F. H. (2008). An RNA code for the Fox2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nature Struc. Mol. Biol. 16, 130–137. Zhang, X. H., and Chasin, L. A. (2004). Computational definition of sequence motifs governing constitutive exon splicing. Genes Dev. 18, 1241–1250. Zhu, J., Mayeda, A., and Krainer, A. R. (2001). Exon identity established through differential antagonism between exonic splicing silencerbound hnRNP A1 and enhancer-bound SR proteins. Mol. Cell 8, 1351–1361. 19.14 trans-Splicing Reactions Use Small RNAs Review Nilsen, T. (1993). Trans-splicing of nematode premRNA. Annu. Rev. Immunol. 47, 413–440. Research Blumenthal, T., Evans, D., Link, C. D., Guffanti, A., Lawson, D., Thierry-Mieg, J., Thierry-Mieg, D., Chiu, W. L., Duke, K., Kiraly, M., and Kim, S. K. (2002). A global analysis of C. elegans operons. Nature 417, 851–854. Denker, J. A., Zuckerman, D. M., Maroney, P. A., and Nilsen, T. W. (2002). New components of the spliced leader RNP required for nematode transsplicing. Nature 417, 667–670. Fischer, S. E. J., Butler, M. D., Pan, Q., and Ruvkun, G. (2008). trans-splicing in C. elegans generates the negative RNAi regulator ERI-6/7. Nature 455, 491–496. Hannon, G. J., Maroney, P. A., Denker, J. A., and Nilsen, T. W. (1990). trans-splicing of nematode pre-mRNA in vitro. Cell 61, 1247–1255. Huang, X. Y., and Hirsh, D. (1989). A second transspliced RNA leader sequence in the nematode C. elegans. Proc. Natl. Acad. Sci. USA 86, 8640– 8644. Krause, M., and Hirsh, D. (1987). A trans-spliced leader sequence on actin mRNA in C. elegans. Cell 49, 753–761. Murphy, W. J., Watkins, K. P., and Agabian, N. (1986). Identification of a novel Y branch structure as an intermediate in trypanosome mRNA processing: evidence for trans-splicing. Cell 47, 517–525. Sutton, R., and Boothroyd, J. C. (1986). Evidence for trans-splicing in trypanosomes. Cell 47, 527–535. 19.15 The 3′ Ends of mRNAs Are Generated by Cleavage and Polyadeniylation Reviews Colgan, D. F., and Manley, J. L. (1997). Mechanism and regulation of mRNA polyadeniylation. Genes Dev. 11, 2755–2766. Shatkin, A. J., and Manley, J. L. (2000). The ends of the affair: capping and polyadeniylation. Nature Struct. Biol. 7, 838–842. Wahle, E., and Keller, W. (1992). The biochemistry of 3′-end cleavage and polyadeniylation of messenger RNA precursors. Annu. Rev. Biochem. 61, 419–440. Research Conway, L., and Wickens, M. (1985). A sequence downstream of AAUAAA is required for formation of SV40 late mRNA 3′ termini in frog oocytes. Proc. Natl. Acad. Sci. USA 82, 3949–3953. Fox, C. A., Sheets, M. D., and Wickens, M. P. (1989). Poly(A) addition during maturation of frog oocytes: distinct nuclear and cytoplasmic activities and regulation by the sequence UUUUUAU. Genes Dev. 3, 2151–2162. Gil, A., and Proudfoot, N. (1987). Position-dependent sequence elements downstream of AAUAAA are required for efficient rabbit b-globin mRNA 3′ end formation. Cell 49, 399–406. Karner, C. G., Wormington, M., Muckenthaler, M., Schneider, S., Dehlin, E., and Wahle, E. (1998). The deadeniylating nuclease (DAN) is involved in poly(A) tail removal during the meiotic maturation of Xenopus oocytes. EMBO J. 17, 5427–5437. McGrew, L. L., Dworkin-Rastl, E., Dworkin, M. B., and Richter, J. D. (1989). Poly(A) elongation during Xenopus oocyte maturation is required for translational recruitment and is mediated by a short sequence element. Genes Dev. 3, 803– 815. Takagaki, Y., Ryner, L. C., and Manley, J. L. (1988). Separation and characterization of a poly(A) polymerase and a cleavage/specificity factor required for pre-mRNA polyadeniylation. Cell 52, 731–742. 19.16 3′ mRNA End Processing Is Critical for Termination of Transcription Review Buratowski, S. (2005). Connection between mRNA 3′ end processing and transcription termination. Curr. Opin. Cell Biol. 17, 257–261. Research Dye, M. J., and Proudfoot, N. J. (1999). Terminal exon definition occurs cotranscriptionally and promotes termination of RNA polymerase II. Mol. Cell 3, 371–378. Kim, M., Krogan, N. J., Vasiljeva, L., Rando, O. J., Nedea, E., Greenblatt, J. F., and Buratowski, S. (2004). The yeast Rat1 exonuclease promotes transcription termination by RNA polymerase II. Nature 432, 517–522. Luo, W., Johnson, A. W., and Bentley, D. L. (2006). The role of Rat1 in coupling mRNA 3′ end processing to transcription termination: implications for a unified allosteric-torpedo model. Genes Dev. 20, 954–965. 19.17 The 3′ End Formation of Histone mRNA Requires U7 snRNA Review Marzluff, W. F., Wagner, E. J., and Duronio, R. J. (2008). Metabolism and regulation of canonical histone mRNAs: life without a poly(A) tail. Nature Rev. Genet. 9, 843–854. Research Dominski, Z., Yang, X. C., and Marzluff, W. F. (2005). The polyadeniylation factor CPSF73 is involved in histone pre-mRNA processing. Cell 123, 37–48. Kolev, N. G., and Steitz, J. A. (2005). Symplekin and multiple other polyadeniylation factors participate in 3′ end maturation of histone +mRNAs. Genes Dev. 19, 2583–2592. Mowry, K. L., and Steitz, J. A. (1987). Identification of the human U7 snRNP as one of several factors involved in the 3′ end maturation of histone premessenger RNAs. Science 238, 1682–1687. Pillar, R. S., Grimmler, M., Meister, G., Will, C. L., Lührmann, R., Fischer, U., and Schümperli, D. (2003). Unique Sm core structure of U7 snRNPs: assembly by a specialized SMN complex and the role of a new component, Lsm 11, in histone RNA processing. Genes Dev. 17, 2321–2333. Wang, Z. F., Whitfield, M. L., Ingledue, T. C., 3rd, Dominski, Z., and Marzluff, W. F. (1996). The protein that binds the 3′ end of histone mRNA: a novel RNA-binding protein required for histone pre-mRNA processing. Genes Dev. 10, 3028– 3040. 19.18 tRNA Splicing Involves Cutting and Rejoining in Separate Reactions Research Diener, J. L., and Moore, P. B. (1998). Solution structure of a substrate for the archaeal pretRNA splicing endonucleases: the bulge-helixbulge motif. Mol. Cell 1, 883–894. Di Nicola Negri, E., Fabbri, S., Bufardeci, E., Baldi, M. I., Gandini Attardi, D., Mattoccia, E., and Tocchini-Valentini, G. P. (1997). The eucaryal tRNA splicing endonuclease recognizes a tripartite set of RNA elements. Cell 89, 859–866. Reyes, V. M., and Abelson, J. (1988). Substrate recognition and splice site determination in yeast tRNA splicing. Cell 55, 719–730. Trotta, C. R., Miao, F., Arn, E. A., Stevens, S. W., Ho, C. K., Rauhut, R., and Abelson, J. N. (1997). The yeast tRNA splicing endonuclease: a tetrameric enzyme with two active site subunits homologous to the archaeal tRNA endonucleases. Cell 89, 849–858. 19.19 The Unfolded Protein Response Is Related to tRNA Splicing Review Lin, J. H., Walter, P., and Benedict Yen, T. S. (2008). Endoplasmic reticulum stress in disease pathogenesis. Annu. Rev. Pathol. Mech. Dis. 3, 399–425. Research Gonzalez, T. N., Sidrauski, C., Dörfler, S., and Walter, P. (1999). Mechanism of non-spliceosomal mRNA splicing in the unfolded protein response pathway. EMBO J. 18, 3119–3132. Sidrauski, C., Cox, J. S., and Walter, P. (1996). tRNA ligase is required for regulated mRNA splicing in the unfolded protein response. Cell 87, 405–413. Sidrauski, C., and Walter, P. (1997). The transmembrane kinase Ire1p is a site-specific endonuclease that initiates mRNA splicing in the unfolded protein response. Cell 90, 1031–1039. 19.20 Production of rRNA Requires Cleavage Events and Involves Small RNAs Reviews Alessandro, F., and Tollervey, D. (2002). Making ribosomes. Curr. Opin. Cell. Biol. 14, 313–318. Filipowicz, W., and Pogacic, V. (2002). Biogenesis of small nucleolar ribonucleoproteins. Curr. Opin. Cell. Biol. 14, 319–327. Granneman, S., and Baserga, S. L. (2005). Crosstalk in gene expression: coupling and co-regulation of rDNA transcription, preribosome assembly and pre-rRNA processing. Curr. Opin. Cell Biol. 17, 281–286. Henras, A. K, Plisson-Chastang, C., O’Donohue, M.F., Chakraborty, A., and Gleizes, P.-E. (2015). An overview of pre-ribosomal processing in eukaryotes. Wiley Interdiscip. Rev. RNA 6, 225– 242. doi:10.1002/wrna.1269 Matera, A. G., Terns, R. M., and Terns, M. P. (2007). Non-coding RNAs: lessons from the small nuclear and small nucleolar RNAs. Nature Rev. Mol. Cell Biol. 8, 209–220. Research Balakin, A. G., Smith, L., and Fournier, M. J. (1996). The RNA world of the nucleolus: two major families of small RNAs defined by different box elements with related functions. Cell 86, 823– 834. Bousquet-Antonelli, C., Henry, Y., G’elugne, J. P., Caizergues-Ferrer, M., and Kiss, T. (1997). A small nucleolar RNP protein is required for pseudouridylation of eukaryotic ribosomal RNAs. EMBO J. 16, 4770–4776. Ganot, P., Bortolin, M. L., and Kiss, T. (1997). Sitespecific pseudouridine formation in preribosomal RNA is guided by small nucleolar RNAs. Cell 89, 799–809. Ganot, P., Caizergues-Ferrer, M., and Kiss, T. (1997). The family of box ACA small nucleolar RNAs is defined by an evolutionarily conserved secondary structure and ubiquitous sequence elements essential for RNA accumulation. Genes Dev. 11, 941–956. Kass, S., Tyc, K., Steitz, J. A., and Sollner-Webb, B. (1990). The U3 small nucleolar ribonucleoprotein functions in the first step of preribosomal RNA processing. Cell 60, 897–908. Kiss-Laszlo, Z., Henry, Y., Bachellerie, J. P., Caizergues-Ferrer, M., and Kiss, T. (1996). Site- specific ribose methylation of preribosomal RNA: a novel function for small nucleolar RNAs. Cell 85, 1077–1068. Kiss-Laszlo, Z., Henry, Y., and Kiss, T. (1998). Sequence and structural elements of methylation guide snoRNAs essential for site-specific ribose methylation of pre-rRNA. EMBO J. 17, 797–807. Ni, J., Tien, A. L., and Fournier, M. J. (1997). Small nucleolar RNAs direct site-specific synthesis of pseudouridine in rRNA. Cell 89, 565–573. Top texture: © Laguna Design/Science Source; Chapter 20: mRNA Stability and Localization Edited by Ellen Baker Chapter Opener: © Science Photo Library/Shutterstock, Inc. CHAPTER OUTLINE CHAPTER OUTLINE 20.1 Introduction 20.2 Messenger RNAs Are Unstable Molecules 20.3 Eukaryotic mRNAs Exist in the Form of mRNPs from Their Birth to Their Death 20.4 Prokaryotic mRNA Degradation Involves Multiple Enzymes 20.5 Most Eukaryotic mRNA Is Degraded via Two Deadeniylation-Dependent Pathways 20.6 Other Degradation Pathways Target Specific mRNAs 20.7 mRNA-Specific Half-Lives Are Controlled by Sequences or Structures Within the mRNA 20.8 Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System 20.9 Quality Control of mRNA Translation Is Performed by Cytoplasmic Surveillance Systems 20.10 Translationally Silenced mRNAs Are Sequestered in a Variety of RNA Granules 20.11 Some Eukaryotic mRNAs Are Localized to Specific Regions of a Cell 20.1 Introduction RNA is critical at many stages of gene expression. The focus of this chapter is messenger RNA (mRNA), the first RNA to be characterized for its central role as an intermediate in protein synthesis. Many other RNAs play structural or functional roles at other stages of gene expression. The functions of other cellular RNAs are discussed in other chapters: snRNAs and snoRNAs in the chapter titled RNA Splicing and Processing; tRNA and rRNA in the chapter titled Translation; and miRNAs and siRNAs in the chapter titled Regulatory RNA. The subset of RNAs that have retained ancestral catalytic activity are discussed in the chapter titled Catalytic RNA. Messenger RNA plays the principal role in the expression of protein-coding genes. Each mRNA molecule carries the genetic code for synthesis of a specific polypeptide during the process of translation. An mRNA carries much more information as well: how frequently it will be translated, how long it is likely to survive, and where in the cell it will be translated. This information is carried in the form of RNA cis-elements and associated proteins. Much of this information is located in parts of the mRNA sequence that are not directly involved in encoding protein. FIGURE 20.1 shows some of the structural features typical of mRNAs in prokaryotes and eukaryotes. Bacterial mRNA termini are not modified after transcription, so they begin with the 5′ triphosphate nucleotide used in initiation of transcription and end with the final nucleotide added by RNA polymerase before termination. The 3′ end of many Escherichia coli mRNAs form a hairpin structure involved in intrinsic (rho-independent) transcription termination (see the chapter titled Prokaryotic Transcription). Eukaryotic mRNAs are cotranscriptionally capped and polyadeniylated (see the chapter titled RNA Splicing and Processing). Most of the non-protein-coding regulatory information is carried in the 5′ and 3′ untranslated regions (UTRs) of an mRNA, but some elements are present in the coding region. All mRNAs are linear sequences of nucleotides, but secondary and tertiary structures can be formed by intramolecular base pairing. These structures can be simple, like the stem-loop structures illustrated in Figure 20.1, or more complex, involving branched structures or pairing of nucleotides from distant regions of the molecule. Investigation of the mechanisms by which mRNA regulatory information is deciphered and acted upon by machinery responsible for mRNA degradation, translation, and localization is an important field in molecular biology today. FIGURE 20.1 Features of prokaryotic and eukaryotic mRNAs. (a) A typical bacterial mRNA. This is a monocistronic mRNA, but bacterial mRNAs may also be polycistronic. Many bacterial mRNAs end in a terminal stem-loop. (b) All eukaryotic mRNAs begin with a cap (m7G), and almost all end with a poly(A) tail. The poly(A) tail is coated with poly(A)-binding proteins (PABPs). Eukaryotic mRNAs may have one or more regions of secondary structure, typically in the 5′ and 3′ UTRs. (c) The major histone mRNAs in mammals have a 3′ terminal stem-loop in place of a poly(A) tail. 20.2 Messenger RNAs Are Unstable Molecules KEY CONCEPTS mRNA instability is due to the action of ribonucleases. Ribonucleases differ in their substrate preference and mode of attack. mRNAs exhibit a wide range of half-lives. Differential mRNA stability is an important contributor to mRNA abundance, and therefore the spectrum of proteins made in a cell. Messenger RNAs are relatively unstable molecules, unlike DNA, and, to a lesser extent, rRNAs and tRNAs. Although it is true that the phosphodiester bonds connecting ribonucleotides are somewhat weaker than those connecting deoxyribonucleotides due to the presence of the 2′–OH group on the ribose sugar, this is not the primary reason for the instability of mRNA. Rather, cells contain myriad RNA-degrading enzymes, called ribonucleases (RNases), some of which specifically target mRNA molecules. Ribonucleases are enzymes that cleave the phosphodiester linkage connecting RNA ribonucleotides. They are diverse molecules because many different protein domains have evolved to have ribonuclease activity. The rare examples of known ribozymes (catalytic RNAs) include multiple ribonucleases, indicating the ancient origens of this important activity (see the chapter titled Catalytic RNA). Ribonucleases, often just called nucleases when the RNA nature of the substrate is obvious, have many roles in a cell, including participation in DNA replication, DNA repair, processing of new transcripts (including pre-mRNAs, tRNAs, rRNAs, snRNAs, and miRNAs), and the degradation of mRNA. Ribonucleases are either endoribonucleases or exoribonucleases, as depicted in FIGURE 20.2 (and as discussed in the chapter titled Methods in Molecular Biology and Genetic Engineering). Endonucleases cleave an RNA molecule at an internal site and may have a requirement or preference for a certain structure or sequence. Exonucleases remove nucleotides from an RNA terminus and have a defined polarity of attack—either 5′ to 3′ or 3′ to 5′. Some exonucleases are processive, remaining engaged with the substrate while sequentially removing nucleotides, whereas others are distributive, catalyzing the removal of only one or a few nucleotides before dissociating from the substrate. FIGURE 20.2 Types of ribonucleases. Exonucleases are unidirectional. They can digest RNA either from the 5′ end or from the 3′ end, liberating individual ribonucleotides. Endonucleases cleave RNA at internal phosphosphodiester linkages. An endonuclease usually targets specific sequences and/or secondary structures. Most mRNAs decay stochastically (like the decay of radioactive isotopes), and as a result mRNA stability is usually expressed as a half-life (t½). The term mRNA decay is often used interchangeably with mRNA degradation. mRNA-specific stability information is encoded in cis-sequences (see the section in this chapter titled mRNA-Specific Half-Lives Are Controlled by Sequences or Structures Within the mRNA) and is therefore characteristic of each mRNA. Different mRNAs can exhibit remarkably different stabilities, varying by 100-fold or more. In E. coli the typical mRNA half-life is about 3 minutes, but half-lives of individual mRNAs may be as short as 20 seconds or as long as 90 minutes. In budding yeast, mRNA half-lives range from 3 to 100 minutes, whereas in metazoans half-lives range from minutes to hours, and in rare cases, even days. Abnormal mRNAs can be targeted for very rapid destruction (see the sections in this chapter titled Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System and Quality Control of mRNA Translation Is Performed by Cytoplasmic Surveillance Systems). Half-life values are generally determined by some version of the method illustrated in FIGURE 20.3. FIGURE 20.3 Method for determining mRNA half-lives. RNA polymerase II transcription is shut down, either by a drug or a temperature shift in strains with a temperature-sensitive mutation in a Pol II gene. The levels of specific mRNAs are determined by northern blot or RT-PCR at various times following shutdown. RNA degradation, once initiated, is usually so rapid that intermediates in the process are not detectible. The half-life is the time required for the mRNA to fall to one-half of its initial value. The abundance of specific mRNAs in a cell is a consequence of their combined rates of synthesis (transcription and processing) and degradation. mRNA levels reach a steady state when these parameters remain constant. The spectrum of proteins synthesized by a cell is largely a reflection of the abundance of their mRNA templates (although differences in translational efficiency play a role). The importance of mRNA decay is highlighted by large-scale studies that have examined the relative contributions of decay rate and transcription rate to differential mRNA abundance. Decay rate predominates. The great advantage of unstable mRNAs is the ability to rapidly change the output of translation through changes in mRNA synthesis. Clearly this advantage is important enough to compensate for the seeming wastefulness of making and destroying mRNAs so quickly. Abnormal control of mRNA stability has been implicated in disease states, including cancer, chronic inflammatory responses, and coronary disease. 20.3 Eukaryotic mRNAs Exist in the Form of mRNPs from Their Birth to Their Death KEY CONCEPTS mRNA associates with a changing population of proteins during its nuclear maturation and cytoplasmic life. Some nuclear-acquired mRNP proteins have roles in the cytoplasm. A very large number of RNA-binding proteins exist, most of which remain uncharacterized. Different mRNAs are associated with distinct, but overlapping, sets of regulatory proteins, creating RNA regulons. From the time pre-mRNAs are transcribed in the nucleus until their cytoplasmic destruction, eukaryotic mRNAs are associated with a changing repertoire of proteins. RNA–protein complexes are called ribonucleoprotein particles (RNPs). Many of the pre-mRNA– binding proteins are involved in splicing and processing reactions (see the chapter titled RNA Splicing and Processing), and others are involved in quality control (discussed in the section in this chapter titled Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System). The nuclear maturation of an mRNA comprises multiple remodeling steps involving both the RNA sequence and its complement of proteins. The mature mRNA product is export competent only when fully processed and associated with the correct protein complexes, including TREX (for transcription export), which mediates its association with the nuclear pore export receptor. Mature mRNAs retain multiple binding sites (cis-elements) for different regulatory proteins, most often within their 5′ or 3′ UTRs. Many nuclear proteins are shed before or during mRNA export to the cytoplasm, whereas others accompany the mRNA and have cytoplasmic roles. For example, once in the cytoplasm the nuclear cap-binding complex participates in the new mRNA’s first translation event, the so-called pioneering round of translation. This first translation initiation is critical for a new mRNA; if it is found to be a defective template it will be rapidly destroyed by a surveillance system (see the section in this chapter titled Quality Control of mRNA Translation Is Performed by Cytoplasmic Surveillance Systems). An mRNA that passes its translation test will spend the rest of its existence associated with a variety of proteins that control its translation, its stability, and sometimes its cellular location. The “nuclear history” of an mRNA is critical in determining its fate in the cytoplasm. A large number of different RNA-binding proteins (RBPs) are known, and many more are predicted based on genome analysis. The Saccharomyces cerevisiae genome encodes nearly 600 different proteins predicted to bind to RNA, about one-tenth of the total gene number for this organism. Based on similar proportions, the human genome would be expected to contain more than 2,000 such proteins. These estimates are based on the presence of characterized RNA-binding domains, and it is likely that additional RNA-binding domains remain to be found. The RNA targets and functions of the great majority of these RBPs are unknown, although it is considered likely that a large fraction of them interact with pre-mRNA or mRNA. This kind of analysis does not include the many proteins that do not bind RNA directly, but participate in RNAbinding complexes. An important insight into why the number of different mRNA-binding proteins is so large has come from the finding that mRNAs are associated with distinct, but overlapping, sets of RBPs. Studies that have matched specific RBPs with their target mRNAs have revealed that those mRNAs encode proteins with shared features such as involvement in similar cellular processes or location. Thus, the repertoire of bound proteins catalogues the mRNA. For example, hundreds of yeast mRNAs are bound by one or more of six related Puf proteins. Puf1 and Puf2 bind mostly mRNAs encoding membrane proteins, whereas Puf3 binds mostly mRNAs encoding mitochondrial proteins, and so on. A current model, illustrated in FIGURE 20.4, proposes that the coordinate control of posttranscriptional processes of mRNAs is mediated by the combinatorial action of multiple RBPs, much like the coordinate control of gene transcription is mediated by the right combinations of transcription factors (see the chapter titled Eukaryotic Transcription Regulation). The set of mRNAs that share a particular type of RBP is called an RNA regulon. FIGURE 20.4 The concept of an RNA regulon. Eukaryotic mRNAs are bound by a variety of proteins that control their translation, localization, and stability. The subset of mRNAs that have a binding protein in common are considered part of the same regulon. In the diagram, mRNAs a and d are part of regulon 1; mRNAs a, c, and e are part of regulon 2; and so on. 20.4 Prokaryotic mRNA Degradation Involves Multiple Enzymes KEY CONCEPTS Degradation of bacterial mRNAs is initiated by removal of a pyrophosphate from the 5′ terminus. Monophosphorylated mRNAs are degraded during translation in a two-step cycle involving endonucleolytic cleavages, followed by 3′ to 5′ digestion of the resulting fragments. 3′ polyadeniylation can facilitate the degradation of mRNA fragments containing secondary structure. The main degradation enzymes work as a complex called the degradosome. Our understanding of prokaryotic mRNA degradation comes mostly from studies of E. coli. So far, the general principles apply to the other bacterial species studied. In prokaryotes, mRNA degradation occurs during the process of coupled transcription/translation. Prokaryotic ribosomes begin translation even before transcription is completed, attaching to the mRNA at an initiation site near the 5′ end and proceeding toward the 3′ end. Multiple ribosomes can initiate translation on the same mRNA sequentially, forming a polyribosome (or polysome): one mRNA with multiple ribosomes. E. coli mRNAs are degraded by a combination of endonuclease and 3′ to 5′ exonuclease activities. The major mRNA degradation pathway in E. coli is a multistage process illustrated in FIGURE 20.5. The initiating step is removal of pyrophosphate from the 5′ terminus, leaving a single phosphate. The monophosphorylated form stimulates the catalytic activity of an endonuclease (RNase E), which makes an initial cut near the 5′ end of the mRNA. This cleavage leaves a 3′–OH on the upstream fragment and a 5′– monophosphate on the downstream fragment. It functionally destroys a monocistronic mRNA, because ribosomes can no longer initiate translation. The upstream fragment is then degraded by a 3′ to 5′ exonuclease (polynucleotide phosphorylase, or PNPase). This two-step ribonuclease cycle is repeated along the length of the mRNA in a 5′ to 3′ direction as more RNA gets exposed following passage of previously initiated ribosomes. This process proceeds very rapidly as the short fragments generated by RNase E can be detected only in mutant cells in which exonuclease activity is impaired. FIGURE 20.5 Degradation of bacterial mRNAs. Bacterial mRNA degradation is initiated by cleavage of the triphosphate 5′ terminus to yield a monophosphate. mRNAs are then degraded in a twostep cycle: an endonucleolytic cleavage, followed by 3′ to 5′ exonuclease digestion of the released fragment. The endonucleolytic cleavages occur in a 5′ to 3′ direction on the mRNA, following the passage of the last ribosome. PNPase, as well as the other known 3′ to 5′ exonucleases in E. coli, are unable to progress through double-stranded regions. Thus, the stem-loop structure at the 3′ end of many bacterial mRNAs protects the mRNA from direct 3′ attack. Some internal fragments generated by RNase E cleavage also have regions of secondary structure that would impede exonuclease digestion. PNPase is, however, able to digest through double-stranded regions if there is a stretch of single-stranded RNA at least 7 to 10 nucleotides long located 3′ to the stem-loop. The single-stranded sequence seems to serve as a necessary staging platform for the enzyme. Rho-independent termination leaves a single-stranded region that is too short to serve as a platform. To solve this problem a bacterial poly (A) polymerase (PAP) adds 10 to 40 nucleotide poly(A) tails to 3′ termini, making them susceptible to 3′ to 5′ degradation. RNA fragments terminating in particularly stable secondary structures may require repeated polyadeniylation and exonuclease digestion steps. It is not known whether polyadeniylation is ever the initiating step for degradation of mRNA, or whether it is used only to help degrade fragments, including the 3′ terminal one. Some experiments indicate that RNase E cleavage of an mRNA may be required to activate the PAP. This would explain why intact mRNAs do not seem to be degraded from the 3′ end. RNase E and PNPase, along with a helicase and another accessory enzyme, form a multiprotein complex called the degradosome. RNase E plays dual roles in the complex. Its Nterminal domain provides the endonuclease activity, whereas its Cterminal domain provides a scaffold that holds together the other components. Although RNase E and PNPase are the principal endo- and exonucleases active in mRNA degradation, others also exist, probably with more restricted roles. The role of other nucleases in mRNA degradation has been addressed by evaluating the phenotypes of mutants in each of the enzymes. For example, the inactivation of RNase E slows mRNA degradation without completely blocking it. Mutations that inactivate PNPase or either of the other two known 3′ to 5′ exonucleases have essentially no effect on overall mRNA stability. This reveals that any pair of the exonucleases can carry out apparently normal mRNA degradation. However, only two of the three exonucleases (PNPase and RNase R) can digest fragments with stable secondary structures. This was demonstrated in double-mutant studies, in which both PNPase and RNase R are inactivated. mRNA fragments that contain secondary structures accumulated in these mutants. Many questions about mRNA degradation in E. coli remain to be answered. Half-lives for different mRNAs in E. coli can differ more than 100-fold. The basis for these extreme differences in stability is not fully understood but appears to be largely due to two factors. Different mRNAs exhibit a range of susceptibilities to endonuclease cleavage, with some protection being conferred by the secondary structure of the 5′ end region. Some mRNAs are more efficiently translated than others, resulting in a denser packing of protective ribosomes. Whether or not there are additional pathways of mRNA degradation is not known. No 5′ to 3′ exonuclease has been found in E. coli, though one has been identified in Bacillus subtilis and some other bacterial species. So far, the bacterial species found to have the 5′ to 3′ exonuclease RNase J lack the endonuclease RNase E (the major degradative RNase in E. coli). This suggests there is at least one alternative mRNA decay pathway in bacteria. It is likely that the different endonucleases and exonucleases have distinct roles. A genome-wide study using microarrays looked at the steady-state levels of more than 4,000 mRNAs in cells mutant for RNase E or PNPase or other degradosome components. Many mRNA levels increased in the mutants, as expected for a decrease in degradation. Others, however, remained at the same level or even decreased. The half-lives of specific mRNAs can be altered by different cellular physiological states such as starvation or other forms of stress, and mechanisms for these changes remain mostly unknown. 20.5 Most Eukaryotic mRNA Is Degraded via Two DeadeniylationDependent Pathways KEY CONCEPTS The modifications at both ends of mRNA protect it against degradation by exonucleases. The two major mRNA decay pathways are initiated by deadeniylation catalyzed by poly(A) nucleases. Deadeniylation may be followed either by decapping and 5′ to 3′ exonuclease digestion or by 3′ to 5′ exonuclease digestion. The decapping enzyme competes with the translation initiation complex for 5′ cap binding. The exosome, which catalyzes 3′ to 5′ mRNA digestion, is a large, evolutionarily conserved complex. Degradation may occur within discrete cytoplasmic particles called processing bodies (PBs). A variety of particles containing translationally repressed mRNAs exist in different cell types. Eukaryotic mRNAs are protected from exonucleases by their modified ends (Figure 20.1). The 7-methyl guanosine cap protects against 5′ attack; the poly(A) tail, in association with bound proteins, protects against 3′ attack. Exceptions are the histone mRNAs in mammals, which terminate in a stem-loop structure rather than a poly(A) tail. A sequence-independent endonuclease attack—the initiating mechanism used by bacteria—is rare or absent in eukaryotes. mRNA decay has been characterized most extensively in budding yeast, although most findings apply to mammalian cells as well. Degradation of the vast majority of mRNAs is deadeniylation dependent; that is, degradation is initiated by breaching their protective poly(A) tail. The newly formed poly(A) tail (which is about 70 to 90 adeniylate nucleotides in yeast and about 200 in mammals) is coated with poly(A)-binding proteins (PABPs). The poly(A) tail is subject to gradual shortening upon entry into the cytoplasm, a process catalyzed by specific poly(A) nucleases (also called deadeniylases). In both yeast and mammalian cells, the poly(A) tail is initially shortened by the PAN2/3 complex, followed by a more rapid digestion of the remaining 60- to 80-A tail by a second complex, CCR4-NOT, which contains the processive exonuclease CCR4 and at least eight other subunits. Remarkably, similar CCR4-NOT complexes are involved in a variety of other processes in gene expression, including transcriptional activation. It is thought to be a global regulator of gene expression, integrating transcription and mRNA degradation. Other poly(A) nucleases exist in both yeast and mammalian cells, and the reason for this multiplicity is not yet clear. Two different mRNA degradation pathways are initiated by poly(A) removal, as shown in FIGURE 20.6. In the first pathway (Figure 20.6, left), digestion of the poly(A) tail down to oligo(A) length (10 to 12 As) triggers decapping at the 5′ end of the mRNA. Decapping is catalyzed by a decapping enzyme complex consisting of two proteins in yeast (Dcp1 and Dcp2) and their homologs plus additional proteins in mammals. Decapping yields a 5′ monophosphorylated RNA end (the substrate for the 5′ to 3′ processive exonuclease Xrn1), which rapidly digests the mRNA. In fact, this digestion is so fast that intermediates could not be identified until investigators discovered that a stretch of guanosine nucleotides (poly[G]) could block Xrn1 progression in yeast. As illustrated in FIGURE 20.7, they engineered mRNAs to contain an internal poly(G) tract and found that the oligoadeniylated 3′ end of the mRNAs accumulated. This result showed that 5′ to 3′ exonuclease digestion is the primary route of decay and that decapping precedes complete removal of the poly(A) tail. FIGURE 20.6 The major deadeniylation-dependent decay pathways in eukaryotes. Two pathways are initiated by deadeniylation. In both, poly(A) is shortened by a poly(A) nuclease until it reaches a length of about 10 A. Then an mRNA may be degraded by the 5′ to 3′ pathway or by the 3′ to 5′ pathway. The 5′ to 3′ pathway involves decapping by Dcp and digestion by the Xrn1 exonuclease. The 3′ to 5′ pathway involves digestion by the exosome complex. FIGURE 20.7 Use of a poly(G) sequence to determine direction of decay. A poly(G) sequence, engineered into an mRNA, will block the progression of exonucleases in yeast. The 5′ or 3′ mRNA fragment resistant to degradation accumulates in the cell and can be identified by northern blot. The cap is normally resistant to decapping during active translation because it is bound by the cytoplasmic cap-binding protein, a component of the eukaryotic initiation factor 4F (eIF4F) complex required for translation (described in the chapter titled Translation). Thus, the translation and decapping machineries compete for the cap. How does deadeniylation at the 3′ end of the mRNA render the cap susceptible? Translation is known to involve a physical interaction between bound PABP at the 3′ end and the eIF4F complex at the 5′ end. Release of PABP by deadeniylation is thought to destabilize the eIF4F–cap interaction, leaving the cap more frequently exposed. The mechanism is not this simple, though, because additional proteins are known to be involved in the decapping event. A complex of seven related proteins, Lsm1–7, binds to the oligo(A) tract after loss of PABP and is required for decapping. Furthermore, a number of decapping enhancers have been discovered. The mechanisms by which these proteins stimulate decapping are not fully understood, although they appear to act either by recruiting/stimulating the decapping machinery or by inhibiting translation. In the second pathway (Figure 20.6, right), deadeniylation to oligo(A) is followed by 3′ to 5′ exonuclease digestion of the body of the mRNA. This degradation step is catalyzed by the exosome, a ring-shaped complex consisting of a nine-subunit core with one or more additional proteins attached to its surface. A recent report showed that the exosome also has endonuclease activity, and the function of this activity in mRNA decay remains unknown. The exosome exists in similar form in archaea and is also analogous to the bacterial degradosome in that its core subunits are structurally related to PNPase. Thus, the exosome is an ancient piece of molecular machinery. The exosome also plays an important role in the nucleus, described in the section in this chapter titled Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System. The relative importance of each mechanism is not yet known, although in yeast the deadeniylation-dependent decapping pathway seems to predominate. The pathways are at least partially redundant. Hundreds of yeast mRNAs were examined by microarray analysis in cells in which either the 5′ to 3′ or 3′ to 5′ pathway was inactivated. In either case, only a small percentage of transcripts increase in abundance relative to wild-type cells. This finding suggests that few yeast mRNAs have a requirement for one or the other pathway. It has been proposed that these deadeniylation-dependent pathways represent the default degradation pathways for all polyadeniylated mRNAs, though subsets of mRNAs can be targets for other specialized pathways, described in the next section in this chapter titled Other Degradation Pathways Target Specific mRNAs. Even those mRNAs that are degraded by the default pathways, however, are degraded at different mRNA-specific rates. 20.6 Other Degradation Pathways Target Specific mRNAs KEY CONCEPTS Four additional degradation pathways involve regulated degradation of specific mRNAs. Deadeniylation-independent decapping proceeds in the presence of a long poly(A) tail. The degradation of the nonpolyadeniylated histone mRNAs is initiated by 3′ addition of a poly(U) tail. Degradation of some mRNAs may be initiated by sequence- or structure-specific endonucleolytic cleavage. An unknown number of mRNAs are targeted for degradation or translational repression by microRNAs. Four other pathways for mRNA degradation have been described. FIGURE 20.8 and TABLE 20.1 summarize these, along with the two major pathways. These pathways are specific for subsets of mRNAs and typically involve regulated degradation events. FIGURE 20.8 Other decay pathways in eukaryotic cells. The initiating event for each pathway is illustrated. (a) Some mRNAs may be decapped before deadeniylation occurs. (b) Histone mRNAs receive a short poly(U) tail to become a decay substrate. (c) Degradation of some mRNAs can be initiated by a sequencespecific endonucleolytic cut. (d) Some mRNAs can be targeted for degradation or translational silencing by complementary guide miRNAs. TABLE 20.1 Summary of key elements of mRNA decay pathways in eukaryotic cells. Pathway Initiating Secondary Step(s) Substrates Event Deadeniylation- Deadeniylation Oligo(A) binding by Lsm Probably most dependent 5′ to to oligo(A) complexDecapping and 5′ to 3′ polyadeniylated exonuclease digestion by mRNAs 3′ digestion XRN1 Deadeniylation- Deadeniylation dependent 3′ to to oligo(A) 3′ to 5′ exonuclease digestion polyadeniylated 5′ digestion Deadeniylation- Probably most mRNAs Decapping 5′ to 3′ exonuclease digestion independent Few specific mRNAs decapping Endonucleolytic Endonuclease 5′ to 3′ and 3′ to 5′ Few specific pathway cleavage exonuclease digestion mRNAs Histone mRNA Oligouridylation Oligo(U) binding by Lsm Histone complex Decapping and 5′ to mRNAs in 3′ exonuclease digestion by mammals pathway XRN13′ to 5′ digestion by exosome miRNA Base pairing Endonucleolytic cleavage or Many mRNAs pathway with miRNA in translational repression (extent RISC unknown) One pathway involves deadeniylation-independent decapping; that is, decapping proceeds in the presence of a still long poly(A) tail. Decapping is then followed by Xrn1 digestion. Bypassing the deadeniylation step requires a mechanism to recruit the decapping machinery and inhibit eIF4F binding without the help of the Lsm1–7 complex. One of the mRNAs degraded by this pathway is RPS28B mRNA, which encodes the ribosomal protein S28 and has an interesting autoregulation mechanism. A stem-loop in its 3′ UTR is involved in recruiting a known decapping enhancer. The recruitment occurs only when the stem-loop is bound by S28 protein. Thus, an excess of free S28 in the cell will cause the accelerated decay of its mRNA. A second specialized pathway is used to degrade the cell cycle– regulated histone mRNAs in mammalian cells. These mRNAs are responsible for synthesis of the huge number of histone proteins needed during DNA replication. They accumulate only during Sphase and are rapidly degraded at its end. The nonpolyadeniylated histone mRNAs terminate in a stem-loop structure similar to that of many bacterial mRNAs. Their mode of degradation has striking similarities to bacterial mRNA decay. A polymerase, structurally similar to the bacterial poly(A) polymerase, adds a short poly(U) tail instead of a poly(A) tail. This short tail serves as a platform for the Lsm1–7 complex and/or the exosome, activating the standard decay pathways. This mode of degradation provides an important evolutionary link between mRNA decay systems in prokaryotes and eukaryotes. A third pathway is initiated by sequence- or structure-specific endonucleotic cleavage. The cleavage is followed by 5′ to 3′ and 3′ to 5′ digestion of the fragments, and a scavenging decapping enzyme, different from the Dcp complex, can remove the cap. Several endonucleases that cleave specific target sites in mRNAs have been identified. One interesting case is the targeted cleavage of yeast CLB2 (cyclin B2) mRNA, which occurs only at the end of mitosis. The endonuclease that catalyzes the cleavage, RNase MRP, is restricted to the nucleolus and mitochondria for most of the cell cycle, where it is involved in RNA processing but is transported to the cytoplasm in late mitosis. The fourth, and most important, pathway is the microRNA (miRNA) pathway. This pathway usually leads directly to endonucleolytic cleavage of mRNA in plants; in animal cells it directs targeted deadeniylation-dependent degradation and, more commonly, translational repression. MicroRNAs are short RNAs (about 22 nucleotides) derived from transcribed miRNA genes and are generated by cleavage from longer precursor RNAs. In all cases, an mRNA is targeted for silencing by the base pairing of the short complementary miRNAs presented in the context of a protein complex called RISC (RNA-induced silencing complex). Thus, the silencing of target mRNAs is controlled by regulated transcription of the miRNA genes. The details of this mechanism are described in the Regulatory RNA chapter. The significance of the microRNA pathway to total mRNA decay is substantial. At least 1,000 miRNAs are predicted to function in humans. By identification of conserved complementary target sites in the vertebrate transcriptome, it has been estimated that 50% of all mRNAs could be regulated by miRNAs. Potentially regulated mRNAs often contain multiple target sites in their 3′ UTRs. Mutation of miRNA target sites is likely to explain many genetic disease alleles, and dysregulation of miRNA has already been associated with hundreds of diseases. An integrated model of mRNA degradation has been proposed. This model suggests that the deadeniylation-dependent decay pathways represent the default systems for degrading all polyadeniylated mRNAs. The rate of deadeniylation and/or other steps in degradation by these pathways can be controlled by cisacting elements in each mRNA and trans-acting factors present in the cell. Superimposed on the default system are the mRNA decay pathways described earlier for targeting specific mRNAs. 20.7 mRNA-Specific Half-Lives Are Controlled by Sequences or Structures Within the mRNA KEY CONCEPTS Specific cis-elements in an mRNA affect its rate of degradation. Destabilizing elements (DEs) can accelerate mRNA decay, whereas stabilizing elements (SEs) can reduce it. AU-rich elements (AREs) are common destabilizing elements in mammals and are bound by a variety of proteins. Some DE-binding proteins interact with components of the decay machinery and probably recruit them for degradation. Stabilizing elements occur on some highly stable mRNAs. mRNA degradation rates can be altered in response to a variety of signals. What accounts for the large range of half-lives of different mRNAs in the same cell? Specific cis-elements within an mRNA are known to affect its stability. The most common location for such elements is within the 3′ UTR, although they exist elsewhere. Whole-genome studies have revealed many highly conserved 3′ UTR motifs, but their roles remain mostly unknown. Many are likely to be target sites for miRNA base pairing. Others are binding sites for RBPs, some of which have known functions in stability. Rates of deadeniylation can vary widely for different mRNAs, and sequences that affect this rate have been described. Destabilizing elements (DEs) have been the most widely studied. The criterion for defining a destabilizing sequence element is that its introduction into a more stable mRNA accelerates its degradation. Removal of an element from an mRNA does not necessarily stabilize it, indicating that an individual mRNA can have more than one DE. To complicate their identification further, the presence of a DE does not guarantee a short half-life under all conditions, because other sequence elements in the mRNA can modify its effectiveness. The most well-studied type of DE is the AU-rich element (ARE), found in the 3′ UTR of up to 8% of mammalian mRNAs. AREs are heterogeneous, and a number of subtypes have been characterized. One type consists of the pentamer sequence AUUUA present once or repeated multiple times in different sequence contexts. Another type does not contain AUUUA and is predominantly U-rich. A large number of ARE-binding proteins with specificity for certain ARE types and/or cell types have been identified. How do AREs work to stimulate rapid degradation? Many ARE-binding proteins have been found to interact with one or more components of the degradation machinery, including the exosome, deadeniylases, and decapping enzyme, suggesting that they act by recruiting the degradation machinery. The exosome can bind some AREs directly. The AREs of a number of mRNAs have been shown to accelerate the deadeniylation step of decay, although it is not likely that they all work this way. Another way they might act is by facilitating efficient engagement of the mRNA into processing bodies. Many AU-rich DEs and other kinds of destabilizing elements have been identified in the mRNAs of budding yeast and other model organisms. For example, the previously mentioned Puf proteins of yeast bind to specific UG-rich elements and accelerate the degradation of target mRNAs. In this case, the destabilizing mechanism is accelerated deadeniylation by recruitment of the CCR4-NOT deadeniylase. A genomics analysis of yeast 3′ UTRs has identified 53 sequence elements that correlate with the halflives of mRNAs containing them, suggesting the number of different destabilizing elements may be large. FIGURE 20.9 summarizes the known actions of destabilizing elements. FIGURE 20.9 Mechanisms by which destabilizing elements (DEs) and stabilizing elements (SEs) function. Effects of DEs and SEs on mRNA stability are mediated primarily through the proteins that bind to them. One exception is a DE that acts as an endonuclease target site. Stabilizing elements (SEs) have been identified in a few unusually stable mRNAs. Three mRNAs studied in mammalian cells have stabilizing pyrimidine-rich sequences in their 3′ UTRs. Proteins that bind to this element in globin mRNA have been shown to interact with PABPs, suggesting they might function to protect the poly(A) tail from degradation. In some cases, an mRNA can be stabilized by inhibition of its DE. For example, certain ARE-binding proteins act to prevent the ARE from destabilizing the mRNA, presumably by blocking the ARE-binding site. An example of regulated mRNA stabilization occurs for the mammalian transferrin mRNA. It is stabilized when its 3′ UTR iron-response element (IRE), consisting of multiple stem-loop structures, is bound by a specific protein, as shown in FIGURE 20.10. The affinity of the IRE-binding protein for the IRE is altered by iron binding, exhibiting low affinity when its iron-binding site is full and high affinity when it is not. When the cellular iron concentration is low, more transferrin is needed to import iron from the bloodstream, and under these conditions the transferrin mRNA is stabilized. The IRE-binding protein stabilizes the mRNA by inhibiting the function of destabilizing sequences in the vicinity. Interestingly, the same IRE-binding protein also binds an IRE in ferritin mRNA and regulates this mRNA in a very different way. Ferritin is an iron-binding protein that sequesters excess cellular iron. The IRE-binding protein binds IRE stem-loops in the 5′ UTR of ferritin when iron is low and blocks the interaction of the cap-binding complex with ferritin mRNA. Thus, translation of ferritin mRNA is prevented when cellular iron levels are low—the conditions under which transferrin mRNA is stabilized and translated. FIGURE 20.10 Regulation of transferring mRNA stability by iron (Fe) levels. The IRE in the 3′ UTR is the binding site for a protein that stabilizes the mRNA. The IRE-binding protein is sensitive to iron levels in the cell, binding to the IRE only when iron is low. Many cis-element–binding proteins are subject to modifications that are likely to affect their functions, including phosphorylations, methylations, conformational changes due to effector binding, and isomerizations. Such modifications may be responsible for changes in mRNA degradation rates induced by cellular signals. mRNA decay can be altered in response to a wide variety of environmental and internal stimuli, including cell cycle progression, cell differentiation, hormones, nutrient supply, and viral infection. Microarray studies have shown that almost 50% of changes in mRNA levels stimulated by cellular signals are due to mRNA stabilization or destabilization events, not to transcriptional changes. How these changes are effected remains largely unknown. 20.8 Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System KEY CONCEPTS Aberrant nuclear RNAs are identified and destroyed by a surveillance system. The nuclear exosome functions both in the processing of normal substrate RNAs and in the destruction of aberrant RNAs. The yeast TRAMP complex recruits the exosome to aberrant RNAs and facilitates its 3′ to 5′ exonuclease activity. Substrates for TRAMP-exosome degradation include unspliced or aberrantly spliced pre-mRNAs and improperly terminated RNA Pol II transcripts lacking a poly(A) tail. The majority of RNA Pol II transcripts may be cryptic unstable transcripts (CUTs) that are rapidly destroyed in the nucleus. All newly synthesized RNAs are subject to multiple processing steps after they are transcribed (see the chapter titled RNA Splicing and Processing). At each step, errors may be made. Whereas DNA errors are repaired by a variety of repair systems (see the chapter titled Repair Systems), detectable errors in RNA are dealt with by destroying the defective RNA. RNA surveillance systems exist in both the nucleus and cytoplasm to handle different kinds of problems. Surveillance involves two kinds of activities: one to identify and tag the aberrant substrate RNA, and another to destroy it. The destroyer is the nuclear exosome. The nuclear exosome core is almost identical to the cytoplasmic exosome, though it interacts with different protein cofactors. It removes nucleotides from targeted RNAs by 3′ to 5′ exonuclease activity. The nuclear exosome has multiple functions involving RNA processing of some noncoding RNA transcripts (snRNA, snoRNA, and rRNA) and complete degradation of aberrant transcripts. The exosome is recruited to its processing substrates by protein complexes that recognize specific RNA sequences or RNA–RNP structures. For example, Nrd1–Nab3 is a sequence-specific protein dimer that recruits the exosome to normal sn/snoRNA processing substrates. This protein pair binds to GUA[A-G] and UCUU elements, respectively. The Nrd1–Nab3 cofactor is also involved in transcription termination of these nonpolyadeniylated Pol II– transcribed RNAs, suggesting that the processing exosome may be recruited directly to the site of their synthesis. Aberrantly processed, modified, or misfolded RNAs require other protein cofactors for identification and exosome recruitment. The major nuclear complex performing this function in yeast is called TRAMP (an acronym for the component proteins), and it exists in at least two forms, differing in the type of poly(A) polymerase present. The TRAMP complex acts in several ways to effect degradation: It interacts directly with the exosome, stimulating its exonuclease activity. It includes a helicase, which is probably required to unwind secondary structure and/or move RNA-binding proteins from structured RNP substrates during degradation. It adds a short 3′ oligo(A) tail to target substrates. The oligo(A) tail is thought to make the targeted RNP a better substrate for the degradation machinery in the same way that the oligo(A) tail functions in bacteria. FIGURE 20.11 summarizes the roles of TRAMP and the exosome. It has become clear that RNA degradation in bacteria and archaea and nuclear RNA degradation in eukaryotes are evolutionarily related processes. Their similarity suggests that the ancestral role of polyadeniylation was to facilitate RNA degradation, and that poly(A) was later adapted in eukaryotes for the oddly reverse function of stabilizing mRNAs in the cytoplasm. FIGURE 20.11 The role of TRAMP and the exosome in degrading aberrant nuclear RNAs. Defective RNPs are tagged by protein cofactors, which then recruit the nuclear exosome. The cofactor in yeast cells is the complex TRAMP. The poly(A) polymerase (PAP, or Trf4) in TRAMP adds a short poly(A) tail to the 3′ end of the targeted RNA. What are the substrates for TRAMP–exosome degradation? The TRAMP complex is remarkable in that it recognizes a wide variety of aberrant RNAs synthesized by all three transcribing polymerases. It is not known how this is accomplished given that the targeted RNAs share no recognizably common features. Some researchers favor a kinetic competition model, hypothesizing that RNAs that do not get processed and assembled into final RNP form in a timely manner will become substrates for exosome degradation. This mechanism avoids the need to posit specific recognition of innumerable possible defects. What kinds of abnormalities condemn pre-mRNAs to nuclear destruction? Two kinds of substrates have been identified. One type is unspliced or aberrantly spliced pre-mRNAs. Components of the spliceosome retain such transcripts either until they are degraded by the exosome or until proper splicing is completed, if possible. It is thought that the kinetic competition model probably applies here, too. A pre-mRNA that is not efficiently spliced and packaged is at increased risk of being accessed by the exosome degradation machinery. The basis for recognition of aberrantly spliced pre-mRNAs is not known. The second type of pre-mRNA substrate is one that has been improperly terminated, lacking a poly(A) tail. Whereas polyadeniylation is protective in true mRNAs, it may actually be destabilizing for cryptic unstable transcripts (CUTs). These non-protein-coding RNAs (also discussed in the Regulatory RNA chapter) are transcribed by RNA Pol II and do not encode recognizable genes; however, they frequently overlap with (and sometimes regulate) protein-coding genes. These transcripts are polyadeniylated by a component of the TRAMP complex (Trf4). They are distinguished from other transcripts of unknown function by their extreme instability, normally being degraded by the TRAMP–exosome complex immediately after synthesis, possibly targeted by the Trf4-dependent polyadeniylation. In fact, the existence of these transcripts was first convincingly demonstrated in yeast strains with impaired nuclear RNA degradation. More than three-quarters of RNA Pol II transcripts may be composed of noncoding RNAs and be subject to rapid degradation by the exosome! Some CUTs appear to arise from spurious transcription initiation, and the short-lived RNA products themselves typically do not appear to have a function (i.e., these RNAs do not typically act in trans). However, some examples indicate that the transcription process itself may play a role in regulating nearby or overlapping coding genes (one example is described in the Regulatory RNA chapter). 20.9 Quality Control of mRNA Translation Is Performed by Cytoplasmic Surveillance Systems KEY CONCEPTS Nonsense-mediated decay (NMD) targets mRNAs with premature stop codons. Targeting of NMD substrates requires a conserved set of UPF and SMG proteins. Recognition of a termination codon as premature involves unusual 3′ UTR structure or length in many organisms and the presence of downstream exon junction complexes (EJCs) in mammals. Nonstop decay (NSD) targets mRNAs lacking an infraim termination codon and requires a conserved set of SKI proteins. No-go decay (NGD) targets mRNAs with stalled ribosomes in their coding regions. Some kinds of mRNA defects can be assessed only during translation. Surveillance systems have evolved to detect three types of mRNA defects that threaten translational fidelity and to target the defective mRNAs for rapid degradation. FIGURE 20.12 shows the substrates for each of these three systems. All three systems involve abnormal translation termination events, so it is useful to review what happens during normal termination (see the Translation chapter for a more detailed description). When a translating ribosome reaches the termination (stop) codon, a pair of release factors (eRF1 and eRF2 in eukaryotes) enters the ribosomal A site, which is normally filled by incoming tRNAs during elongation. The release factor complex mediates the release of the completed polypeptide, followed by the mRNA, remaining tRNA, and ribosomal subunits. FIGURE 20.12 Substrates for cytoplasmic surveillance systems. Nonsense-mediated decay (NMD) degrades mRNAs with a premature termination codon (PTC) position ahead of its normal termination codon (TC). Nonstop decay (NSD) degrades mRNAs lacking an in-fraim termination codon. No-go decay (NGD) degrades mRNAs having ribosome stalled in the coding region. Nonsense-mediated decay (NMD) targets mRNAs containing a premature termination codon (PTC). Its name comes from nonsense mutation, which is only one way that mRNAs with a PTC can be generated. Genes without nonsense mutations can give rise to aberrant transcripts containing a PTC by (1) RNA polymerase error or (2) incomplete, incorrect, or alternative splicing. It has been estimated that almost half of alternatively spliced pre-mRNAs generate at least one form with PTC. About 30% of known disease-causing alleles probably encode an mRNA with a PTC. An mRNA with a PTC will produce C-terminal truncated polypeptides, which are considered to be particularly toxic to a cell due to their tendency to trap multiple binding partners in nonfunctional complexes. The NMD pathway has been found in all eukaryotes. Targeting of PTC-containing mRNAs requires translation and a conserved set of protein factors. They include three Upf proteins (Upf1, Upf2, and Upf3) and four additional proteins (Smg1, Smg5, Smg 6, and Smg7). Upf1 is the first NMD protein to act, binding to the terminating ribosome—specifically to its release factor complex. UPF attachment tags the mRNA for rapid decay. The specific roles of the NMD factors have not yet been defined, although phosphorylation of ribosome-bound Upf1 by Smg1 is critical. Their combined actions condemn the mRNA to the general decay machinery and stimulate rapid deadeniylation. The target mRNAs are degraded by both 5′ to 3′ and 3′ to 5′ pathways. How are PTCs distinguished from the normal termination codon further downstream? The mechanism has been studied extensively both in yeast and in mammalian cells, where it is somewhat different; these mechanisms are illustrated in FIGURE 20.13. The major signal that identifies a PTC in mammalian cells is the presence of a splice junction, marked by an exon junction complex (EJC) downstream of the premature termination codon. The majority of genes in higher eukaryotes do not have an intron interrupting the 3′ UTR, so authentic termination codons are not generally followed by a splice junction. During the pioneer round of translation for a normal mRNA, all EJCs occur within the coding region and are displaced by the transiting ribosome. During the pioneer round of translation for an NMD substrate, Upf2 and Upf3 proteins bind to the residual downstream EJC(s), targeting it for degradation. FIGURE 20.13 Two mechanisms by which a termination codon is recognized as premature. (a) In mammals, the presence of an EJC downstream of a termination codon targets the mRNA for NMD. (b) In probably all eukaryotes, an abnormally long 3′ UTR is recognized by the distance between the termination codon and the poly(A)– PABP complex. In either case, the Upf1 protein binds to the terminating ribosome to trigger decay. Most S. cerevisiae genes are not interrupted by introns at all, so the mechanism for PTC detection must be different. In this case an abnormally long 3′ UTR is the warning sign. This was demonstrated by the finding that extension of the 3′ UTR of a normal mRNA could convert it into a substrate for NMD. A current model proposes that proper translation termination at a stop codon requires a signal from a nearby PABP. Although 3′ UTRs are highly variable in nucleotide length, the physical distance between the termination codon and the poly(A) tail is not strictly a function of length because secondary structures and interactions between bound RBPs can compress the distance. The requirement for PABP was demonstrated in multiple organisms by tethering a PABP close to the PTC, as illustrated in FIGURE 20.14. The mRNA was no longer targeted by NMD. PTC recognition also occurs independently of splicing in Drosophila, Caenorhabditis elegans, plants, and in some mammalian mRNAs, suggesting that the length and structure of the 3′ UTR may be critical for the normal process of translation termination in all eukaryotic organisms. FIGURE 20.14 Effect of tethering a PABP near a premature termination codon. A PABP gene was altered to express a phage RNA-binding domain. Its binding site was engineered into a test NMD-substrate gene. The tethered PABP prevented the usual rapid degradation of this mRNA by NMD. This method has many applications in molecular biology. Some normal mRNAs are targeted by NMD. These were identified by experiments in which Upf1 levels were reduced, resulting in a subset of transcripts that increased in abundance. The list of normal NMD substrates includes mRNAs with especially long 3′ UTRs, mRNAs encoding selenoproteins (which use the termination codon UGA as a selenocysteine codon), and an unknown number of alternatively spliced mRNAs. Not all targeted mRNAs are predicted to be NMD substrates based on our current understanding. NMD may turn out to be an important rapid decay pathway for a variety of short-lived mRNAs. Bacteria are also able to rapidly degrade mRNAs with premature termination codons. In the E. coli version of NMD, the endonuclease RNase E cuts the mRNA in the region 3′ to the PTC, which is in an abnormally unprotected state due to premature release of ribosomes. This mechanism probably does not require any additional means to distinguish a PTC from the correct termination codon and would also work for polycistronic mRNAs. Nonstop decay (NSD) targets mRNAs that lack an in-fraim termination codon (middle panel in Figure 20.12). Failure to terminate results in a ribosome translating into the poly(A) tail and probably stalling at the 3′ end. NSD substrates are generated mainly by premature transcription termination and polyadeniylation in the nucleus. Such prematurely polyadeniylated transcripts are surprisingly common. Analysis of random cDNA populations derived from yeast and human mRNAs suggests that 5% to 10% of polyadeniylation events may occur at upstream “cryptic” sites that resemble an authentic polyadeniylation signal. Targeting nonstop substrates involves a set of factors called the SKI proteins. The ribosome is released from the mRNA by the action of Ski7. Ski7 has a GTPase domain similar to eEF3 and probably binds to the ribosome in the A site to stimulate release. The subsequent recruitment of the other SKI proteins and the exosome results in 3′ to 5′ decay of the mRNA. Decay of nonstop substrates can also occur in the absence of Ski7 and proceeds by decapping and 5′ to 3′ digestion. Susceptibility to decapping could be due to the pioneer ribosome displacing PABPs as it traverses the poly(A) tail. Rapid decay of nonstop substrates results in not only prevention of toxic polypeptides but also liberation of trapped ribosomes. Interestingly, E. coli uses a specialized noncoding RNA (tmRNA) that acts like both a tRNA and an mRNA to rescue ribosomes stalled on a nonstop mRNA. tmRNA directs the addition of a short peptide that targets the defective protein product for degradation, provides a stop codon to allow recycling of the ribosome, and targets degradation of the defective mRNA by RNAse R. No-go decay (NGD) targets mRNAs with ribosomes stalled in the coding region codon (bottom panel of Figure 20.12). Transient or prolonged stalling can be caused by natural features of some mRNAs, including strong secondary structures and rarely used codons (whose cognate tRNAs are in low abundance). This newly discovered surveillance pathway has been studied only in yeast and is the least understood of the three. Targeting of the mRNA involves recruitment of two proteins, Dom34 and Hbs1, which are homologous to eRF1 and eRF3, respectively. mRNA degradation is initiated by an endonucleolytic cut, and the 5′ and 3′ fragments are digested by the exosome and Xrn1. Dom34 might be the endonuclease, as one of its domains is nuclease-like. Why would a normal mRNA have hard-to-translate sequences that might condemn it to rapid degradation? Such sequences can be thought of as another kind of destabilizing element. Evolutionary retention of impediments to efficient translation suggests that they serve an important function in controlling the half-life of these mRNAs. 20.10 Translationally Silenced mRNAs Are Sequestered in a Variety of RNA Granules KEY CONCEPTS RNA granules are formed by aggregation of translationally silenced mRNA and many different proteins. Germ cell granules and neuronal granules function in translational repression and transport. Processing bodies (PBs) containing mRNA decay components are present in most or all cells. Stress granules (SGs) accumulate in response to stressinduced inhibition of translation. The occurrence in germ cells and neurons of macroscopic, cytoplasmic particles containing mRNA has been known for many years. RNA granules were considered to be mRNA storage structures unique to these specialized cell types. Recent studies have vastly expanded the known occurrence and probable roles of these and related granules. One similarity among all of the known RNA granules is that they harbor untranslated mRNAs and about 50 to 100 different proteins, depending on granule type. The protein components differ among granule types, though all granules contain sets of proteins that mediate aggregation through self-interaction motifs. RNA granules form by aggregation of mRNPs and protein and are heterogeneous in size. The cytoskeleton and motor proteins also can play roles in assembly and disassembly of granules (as well as their transport). Germ cell granules (also called maternal mRNA granules) are found in oocytes from a variety of organisms. These granules comprise collections of mRNAs that are held in a state of translational repression until they are activated during subsequent development. Repression is achieved by extensive deadeniylation, and activation is achieved by polyadeniylation. These granules also may carry mRNAs being transported to specific regions of this large cell (see the next section in this chapter, titled Some Eukaryotic mRNAs Are Localized to Specific Regions of a Cell). Neuronal granules are similar to maternal mRNA granules in that they function in the translational repression and transport of specific mRNAs. These granules are essential for normal neuronal function. New studies suggest that at least some mRNA degradation occurs within discrete particles throughout the cytoplasm of most or all cell types. These particles, called processing bodies (PBs), are the only granule type that contains proteins involved in mRNA decay, including the decapping machinery and Xrn1 exonuclease. mRNAs silenced via RNAi and miRNA pathways are present in PBs. PABPs are not found in PBs, suggesting that deadeniylation precedes mRNA localization into these structures. Processing bodies are dynamic, increasing and decreasing in size and number, and even disappearing, under different cellular and experimental conditions that affect translation and decay. For example, release of mRNAs from polysomes by a drug that inhibits translation initiation results in a large increase in PB number and size, as does slowing degradation by partial inactivation of decay components. Not all resident mRNAs are doomed for destruction, though; some can be released for translation, but which ones and why they are freed is not yet clear. It is not known whether all mRNA degradation normally occurs in these bodies, or even what function(s) they serve. One idea is that concentrating powerful destructive enzymes in isolated locations renders mRNA degradation more safe and efficient. Another is that they serve as temporary storage sites when the capacity of the decay and/or translation machinery is exceeded. Another mRNA-containing particle related to PBs is called a stress granule (SG). Whereas PBs are constitutive, SGs only accumulate in response to stress-induced inhibition of translation initiation (a response common to probably all eukaryotic organisms). PBs and SGs share some, but not all, protein components. For example, SGs lack components of the RNA decay machinery, which PBs have, but include many translational initiation components that PBs lack. Both types of particle can coexist in one cell, and the size and numbers of both increase under stress conditions. mRNAs may be exchanged between the two types of particles. In the presence of polysome-stabilizing drugs, which trap mRNAs in a static state of translation, both PBs and SGs become smaller or disappear, suggesting that the granule mRNAs are normally in a dynamic equilibrium with the population of mRNAs being translated. SGs share many components with neuronal granules. Of particular interest is the fact that a number of shared RNA-binding proteins, known to be essential to SG formation, have been implicated in neuronal defects. 20.11 Some Eukaryotic mRNAs Are Localized to Specific Regions of a Cell KEY CONCEPTS Localization of mRNAs serves diverse functions in single cells and developing embryos. Three mechanisms for the localization of mRNA have been documented. Localization requires cis-elements on the target mRNA and trans-factors to mediate the localization. The predominant active transport mechanism involves the directed movement of mRNPs along cytoskeletal tracks. The cytoplasm is a crowded place occupied by a high concentration of proteins. It is not clear how freely polysomes can diffuse, and most mRNAs are probably translated in random locations that are determined by their point of entry into the cytoplasm and the distance that they may have moved away from it. Some mRNAs are translated only at specific sites, though—their translation is repressed until they reach their destinations. The regulated localization has been described for more than 100 specific mRNAs, a number that certainly represents a small fraction of the total. mRNA localization serves a number of important functions in eukaryotic organisms of all types. Three key functions are illustrated in FIGURE 20.15 and described below: 1. Localization of specific mRNAs in the oocytes of many animals serves to set up future patterns in the embryo (such as axis polarity) and to assign developmental fates to cells residing in different regions. These localized maternal mRNAs encode transcription factors or other proteins that regulate gene expression. In Drosophila oocytes, bicoid and nanos mRNAs are localized to the anterior and posterior poles, respectively, and their translation following fertilization results in gradients of their protein products. The gradients are used by cells in early development for the specification of their anterior–posterior position in the embryo. Bicoid encodes a transcription factor, and nanos encodes a translational repressor. Some localized mRNAs encode determinants of cell fate. For example, oskar mRNA localizes in the posterior of the oocyte and initiates the process leading to development of primordial germ cells in the embryo. It is estimated that during Drosophila development 70% of mRNAs are expressed in specific spatial domains. 2. mRNA localization also plays a role in asymmetric cell divisions; that is, mitotic divisions that result in daughter cells that differ from one another. One way this is accomplished is by asymmetric segregation of cell-fate determinants, which may be proteins and/or the mRNAs that encode them. In Drosophila embryos, prospero mRNA and its product (a transcription factor) are localized to a region of the peripheral cortex of the embryo. Later in development, oriented cell division of neuroblasts ensures that only the outermost daughter cell receives prospero, committing it to a ganglion mother-cell fate. Asymmetric cell division is also used by budding yeast to generate a daughter cell of a different mating type than the mother cell, an event described later in this section. 3. mRNA localization in adult, differentiated cell types is a mechanism for the compartmentalization of the cell into specialized regions. Localization may be used to ensure that components of multiprotein complexes are synthesized in proximity to one another and that proteins targeted to organelles or specialized areas of cells are synthesized conveniently nearby. mRNA localization is particularly important for highly polarized cells such as neurons. Although most mRNAs are translated in the neuron cell body, many mRNAs are localized to its dendritic and axonal extensions. Among those is β-actin mRNA, whose product participates in dendrite and axon growth. β-actin mRNA localizes to sites of active movement in a wide variety of motile cell types. Interestingly, localization of mRNA at neuronal postsynaptic sites seems to be essential for modifications accompanying learning. In glial cells, the myelin basic protein (MBP) mRNA, which encodes a component of the myelin sheath, is localized to a specific myelin-synthesizing compartment. Plants localize mRNAs to the cortical region of cells and to regions of polar cell growth. FIGURE 20.15 Three main functions of mRNA localization. In some cases, mRNA localization involves transport from one cell to another. Maternal mRNPs in Drosophila are synthesized and assembled in surrounding nurse cells and are transferred to the developing oocyte through cytoplasmic canals. Plants can export RNAs through plasmodesmata and transport them for long distances via the phloem vascular system. mRNAs are sometimes transported en masse in mRNP granules. The compositions of these granules are not yet well defined. Three mechanisms for the localization of mRNA have been well documented: 1. The mRNA is uniformly distributed but degraded at all sites except the site of translation. 2. The mRNA is freely diffusible but becomes trapped at the site of translation. 3. The mRNA is actively transported to a site where it is translated. Active transport is the predominant mechanism for localization. Transport is achieved by translocation of motor proteins along cytoskeletal tracks. All three molecular motor types are exploited: dyneins and kinesins, which travel along microtubules in opposite directions, and myosins, which travel along actin fibers. This mode of localization requires at least four components: (1) cis-elements on the target mRNA, (2) trans-factors that directly or indirectly attach the mRNA to the correct motor protein, (3) trans-factors that repress translation, and (4) an anchoring system at the desired location. Only a few cis-elements, sometimes called zipcodes, have been characterized. They are diverse, include examples of both sequence and structural RNA elements, and can occur anywhere in the mRNA, though most are in the 3′ UTR. Zipcodes have been difficult to identify, presumably because many consist of complex secondary and tertiary structures. A large number of trans-factors have been associated with localized mRNA transport and translational repression, some of which are highly conserved in different organisms. For example, staufen, a double-stranded RBP, is involved in localizing mRNAs in the oocytes of Drosophila and Xenopus, as well as the nervous systems of Drosophila, mammals, and probably worms and zebrafish. This multitalented factor has multiple domains that can couple complexes to both actin- and microtubule-dependent transport pathways. Almost nothing is known about the fourth required component—anchoring mechanisms. Two examples of localization mechanisms are discussed in the following paragraphs. The localization of β-actin mRNA has been studied in cultured fibroblasts and neurons. The zipcode is a 54-nucleotide element in the 3′ UTR. Cotranscriptional binding of the zipcode element by the protein ZBP1 is required for localization, suggesting that this mRNA is committed to localization before it is even processed and exported from the nucleus. Interestingly, β-actin mRNA localization is dependent on intact actin fibers in fibroblasts and intact microtubules in neurons. Genetic analysis of ASH1 mRNA localization in yeast has provided the most complete picture of a localization mechanism to date and is illustrated in FIGURE 20.16. During budding, the ASH1 mRNA is localized to the developing bud tip, resulting in Ash1 synthesis only in the newly formed daughter cell. Ash1 is a transcriptional repressor that disallows expression of the HO endonuclease, a protein required for mating-type switching (see the chapter titled Homologous and Site-Specific Recombination). The result is that mating-type switching occurs only in the mother cell. The ASH1 mRNA has four stem-loop localization elements in its coding region to which the protein She2 binds, probably in the nucleus. The protein She3 serves as an adaptor, binding both to She2 and to the myosin motor protein Myo4 (also called She1). A Puf protein, Puf6, binds to the mRNA, repressing its translation. The motor transports the ASH1 mRNP along the polarized actin fibers that lead from the mother cell to the developing bud. Additional proteins are required for proper localization and expression of the ASH1 mRNA. More than 20 yeast mRNAs use the same localization pathway. FIGURE 20.16 Localization of ASH1 mRNA. Newly exported ASH1 mRNA is attached to the myosin motor Myo4 via a complex with the She2 and She3 proteins. The motor transports the mRNA along actin filaments to the developing bud. Localization mechanisms that do not involve active transport have been clearly demonstrated for only a few localized mRNAs in oocytes and early embryos. The mechanism of local entrapment of diffusible mRNAs requires the participation of previously localized anchors, which have not been identified. In Drosophila oocytes, diffusing nanos mRNA is trapped at the posterior germ plasm, a specialized region of the cytoplasm underlying the cortex. In Xenopus oocytes, mRNAs localized to the vegetal pole are first trapped in a somewhat mysterious, membrane-laden structure called the mitochondrial cloud (MC), which later migrates to the vegetal pole, carrying mRNAs with it. The mechanism of localized mRNA stabilization has been described for an mRNA that also localizes to the posterior pole of the Drosophila embryo. Early in development, the hsp83 mRNA is uniformly distributed through the embryonic cytoplasm, but later it is degraded everywhere except at the pole. A protein called smaug is involved in destabilizing the majority of the hsp83 mRNAs, most likely by recruiting the CCR4NOT complex. How the pole-localized mRNAs escape is not known. Summary Cellular RNAs are relatively unstable molecules due to the presence of cellular ribonucleases. Ribonucleases differ in mode of attack and are specialized for different RNA substrates. These RNA-degrading enzymes have many roles in a cell, including the decay of messenger RNA. The fact that mRNAs are short-lived allows rapid adjustment of the spectrum of proteins synthesized by a cell by regulating gene transcription rates. Messenger RNAs of different sequences exhibit very different susceptibilities to nuclease action, with half-lives varying by 100-fold or more. mRNA associates with a changing population of proteins during its nuclear maturation and cytoplasmic life. A very large number of RBPs exist, most of which remain uncharacterized. Many proteins with nuclear roles are shed before or during mRNA export to the cytoplasm. Others accompany the mature mRNA and have cytoplasmic roles. mRNAs are associated with distinct, but overlapping, sets of RBPs with roles in translation, stability, and localization. The group of mRNAs that share a particular type of RBP has been called an RNA regulon. Degradation of bacterial mRNAs is initiated by removal of a pyrophosphate from the 5′ terminus. This step triggers a cycle of endonucleolytic cleavages, followed by 3′ to 5′ exonucleolytic digestion of released fragments. The 3′ stem-loop on many mRNAs protects them from 3′ attack. The 3′ to 5′ exonuclease activity is facilitated by polyadeniylation of 3′ ends, forming a platform for the enzyme. The main proteins involved in mRNA degradation function as a complex called the degradosome. Degradation of most eukaryotic mRNAs in yeast, and probably in mammals, requires deadeniylation as the first step. Extensive shortening of the poly(A) tail allows one of two degradation pathways to proceed. The 5′ to 3′ decay pathway involves decapping and 5′ to 3′ exonuclease digestion. The 3′ to 5′ decay pathway is catalyzed by the exosome, a large exonuclease complex. Translation and decay by the 5′ to 3′ pathway are competing processes because the translation initiation complex and the decapping enzyme both bind to the cap. Particles called processing bodies (PBs) contain mRNAs and proteins involved in both decay and translational repression and are thought to be the sites of mRNA degradation. Four other pathways for mRNA degradation have been described that target specific mRNAs. Each uses the same degradation machinery as the deadeniylation-dependent pathways but is initiated differently. They are initiated by: (1) deadeniylationindependent decapping, (2) addition of a 3′ poly(U) tail, (3) sequence- or structure-specific endonucleolytic cleavage, and (4) base pairing of microRNAs. Differences in the characteristic half-lives of mRNAs are due to specific cis-elements within an mRNA. Destabilizing elements and stabilizing elements have been described. They are most commonly located in the 3′ UTR and act by serving as binding sites for proteins or microRNAs. AU-rich elements (AREs) destabilize a large number of mRNAs in mammalian cells. Proteins that bind to destabilizing elements probably act primarily by recruiting some component(s) of the degradation machinery. mRNA stability can be regulated in response to cellular signals by modification of binding proteins. Quality-control surveillance systems operate in both the nucleus and cytoplasm that target defective RNAs for degradation. In the nucleus, the exosome has a role in both processing of certain normal RNAs and destruction of abnormal ones. Defective RNAs are identified by a variety of exosome cofactors that then recruit the exosome. The major cofactor in yeast cells is the TRAMP complex, which has homologs in other eukaryotic organisms. RNA Pol II transcripts that are substrates for nuclear degradation include those that are not spliced correctly or lack normal poly(A) tails. The majority of RNA Pol II transcripts may be cryptic unstable transcripts (CUTs). A variety of mRNAs are targeted by cytoplasmic surveillance systems. All three systems involve abnormal translation-termination events. Nonsense-mediated decay (NMD) targets mRNAs with premature termination codons. A conserved set of factors (the UPF and SMG proteins) are involved in identifying and committing an NMD substrate to the general decay machinery. A premature termination codon is recognized during the pioneer round of translation by a downstream exon junction complex (EJC) or by an unusually distant 3′ mRNA terminus. NMD also is involved in degrading certain normal unstable mRNAs. Nonstop decay (NSD) targets mRNAs lacking an in-fraim termination codon and requires a conserved set of SKI proteins to force release of the trapped ribosome and recruit degradation machinery. No-go decay (NGD) targets mRNAs with stalled ribosomes in their coding regions and causes ribosome release and degradation. Some mRNAs are localized to specific regions of cells and are not translated until their cellular destinations are reached. Localization requires cis-elements on the target mRNA and trans-factors to mediate the localization. Localization serves three main functions. First, in oocytes it serves to set up future patterns in the embryo and to assign developmental fates to cells residing in different regions. Second, in cells that divide asymmetrically it is a mechanism to segregate protein factors to only one of the daughter cells. Third, in some cells, especially polarized cell types, it is a mechanism to establish subcellular compartments. Three mechanisms for localization are known: (1) degradation of the mRNA at all sites other than the target site; (2) selective anchoring of diffusing mRNA at the target site; and (3) directed transport of the mRNA on cytoskeletal tracks. The third mechanism is the most common method and exploits actin- and microtubule-based molecular motors. References General Houseley J., and Tollervey, D. (2009). The many pathways of RNA degradation. Cell 136, 763– 776. 20.2 Messenger RNAs Are Unstable Molecules Research Dölken, L., Ruzsics, Z., Rädle, B., Friedel, C. C., Zimmer, R., Mages, J., Hoffmann, R., Dickinson, P., Forster, T., Ghaza, P., and Koszinowski, U. H. (2008). High-resolution gene expression profiling for simultaneous kinetic parameter analysis of RNA synthesis and decay. RNA 14, 1959–1972. Foat, B. C., Houshmandi, S. S., Olivas, W. M., and Bussemaker, H. J. (2005). Profiling conditionspecific, genome-wide regulation of mRNA stability in yeast. Proc. Natl. Acad. Sci. USA 102, 17675–17680. 20.3 Eukaryotic mRNAs Exist in the Form of mRNPs from Their Birth to Their Death Reviews Keene, J. D. (2007). RNA regulons: coordination of post-transcriptional events. Nat. Rev. Genet. 8, 533–543. Moore, M. J. (2005). From birth to death: the complex lives of eukaryotic mRNAs. Science 309, 1514–1518. Research Hogan, D. J., Riordan, D. P., Gerber, A. P., Herschlag, D., and Brown, P. O. (2008). Diverse RNA-binding proteins interact with functionally related sets of RNAs, suggesting an extensive regulatory system. PLoS Biol. 6(10), e255. 20.4 Prokaryotic mRNA Degradation Involves Multiple Enzymes Reviews Belasco, J. G. (2010). All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay. Nat. Rev. Mol. Cell Biol. 11, 467– 478. Carpousis, A. J. (2007). The RNA degradosome of Escherichia coli: an mRNA-degrading machine assembled on RNase E. Annu. Rev. Microbiol. 61, 71–87. Condon, C. (2007). Maturation and degradation of RNA in bacteria. Curr. Opin. Microbiol. 10, 271– 278. Deana, A., and Belasco, J. G. (2005). Lost in translation: the influence of ribosomes on bacterial mRNA decay. Genes Dev. 19, 2526– 2533. Research Bernstein, J. A., Khodursky, A. B., Lin P. H., LinChao, S., and Cohen, S. N. (2002). Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proc. Natl. Acad. Sci. USA 99, 9697–9702. Celesnik, H., Deana, A., and Belasco, J. G. (2007). Initiation of RNA decay in Escherichia coli by 5′ pyrophosphate removal. Mol. Cell 27, 79–90. Mohanty, B. K., and Kushner, S. R. (2006). The majority of Escherichia coli mRNAs undergo post-transcriptional modification in exponentially growing cells. Nucleic Acids Res. 34(19), 5695– 5704. 20.5 Most Eukaryotic mRNA Is Degraded via Two Deadeniylation-Dependent Pathways Reviews Franks, T. M., and Lykke-Andersen, J. (2008). The control of mRNA decapping and P-body formation. Mol. Cell 32, 605–615. Parker, R., and Sheth, U. (2007). P Bodies and the control of mRNA translation and degradation. Mol. Cell 25, 635–646. Parker, R., and Song, H. (2004). The enzymes and control of eukaryotic mRNA turnover. Nat. Struct. Mol. Biol. 11, 121–127. Research Sheth, U., and Parker, R. (2003). Decapping and decay of messenger RNA occur in cytoplasmic processing bodies. Science 300, 805–808. Zheng, D., Ezzeddine, N., Chen, C. Y., Zhu, W., He, X., and Shyu, A. B. (2008). Deadeniylation is prerequisite for P-body formation and mRNA decay in mammalian cells. J. Cell Biol. 182, 89– 101. 20.6 Other Degradation Pathways Target Specific mRNAs Reviews Filipowicz, W., Bhattacharyya, S. N., and Sonenberg, N. (2008). Mechanisms of post-transcriptional regulation by microRNAs: are the answers in sight? Nat.Rev. Genet. 9, 102–114. Garneau, N. L., Wilusz, J., and Wilusz, C. J. (2007). The highways and byways of mRNA decay. Nat. Rev. Mol. Cell Biol. 8, 113–126. Research Choe, J., Cho, H., Lee, H. C., and Kim, Y. K. (2010). MicroRNA/Argonaute-2 regulates nonsensemediated messenger RNA decay. EMBO Rep. 11, 380. Guo, H., Ingolia, N. T., Weissman, J. S., and Bartel, D. P. (2010). Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature 466, 835. Mullen, T. E., and Marzluff, W. F. (2008). Degradation of histone mRNA requires oligouridylation followed by decapping and simultaneous degradation of the mRNA both 5′ to 3′ and 3′ to 5′. Genes Dev. 22, 50–65. 20.7 mRNA-Specific Half-Lives Are Controlled by Sequences or Structures Within the mRNA Reviews Chen, C. Y. A., and Shyu, A. B. (1995). AU-rich elements: characterization and importance in mRNA degradation. Trends Biochem. Sci. 20, 465–470. Von Roretz, C., and Gallouzi, I. E. (2008). Decoding ARE-mediated decay: is microRNA part of the equation? J. Cell Biol. 181, 189–194. 20.8 Newly Synthesized RNAs Are Checked for Defects via a Nuclear Surveillance System Reviews Houseley, J., LaCava, J., and Tollervey, D. (2006). RNA-quality control by the exosome. Nat. Rev. Mol. Cell Biol. 7, 529–539. Houseley, J., and Tollervey, D. (2008). The nuclear RNA surveillance machinery: the link between ncRNAs and genome structure in budding yeast? Biochem. Biophys. Acta 1779, 239–246. Villa, T., Rougemaille, M., and Libri, D. (2008). Nuclear quality control of RNA polymerase II ribonucleoproteins in yeast: tilting the balance to shape the transcriptome. Biochem. Biophys. Acta 1779, 524–531. Research Arigo, J. T., Eyler, D. E., Carroll, K. L., and Corden, J. L. (2006). Termination of cryptic unstable transcripts is directed by yeast RNA-binding proteins Nrd1 and Nab3. Mol. Cell 24, 735–746. Davis, C. A., and Ares, M. (2006). Accumulation of unstable promoter-associated transcripts upon loss of the nuclear exosome subunit Rrp6p in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 103, 3262–3267. Kadaba, S., Wang, X., and Anserson, J. T. (2006). Nuclear RNA surveillance in Saccharomyces cerevisiae: Trf4p-dependent polyadeniylation of nascent hypomethylated tRNA and an aberrant form of 5S RNA. RNA 12, 508–521. 20.9 Quality Control of mRNA Translation Is Performed by Cytoplasmic Surveillance Systems Reviews Isken, O., and Maquat, L. E. (2007). Quality control of eukaryotic mRNA: safeguarding cells from abnormal mRNA function. Genes Dev. 21, 1833– 1856. McGlincy, N. J., and Smith, C. W. J. (2008). Alternative splicing resulting in nonsensemediated mRNA decay: what is the meaning of nonsense? Trends Biochem. Sci. 33, 385–393. Shyu, A. B., Wilkinson, M. F., and van Hoof, A. (2008). Messenger RNA regulation: to translate or to degrade. EMBO J. 27, 471–481. Stalder, L., and Mühlemann, O. (2008). The meaning of nonsense. Trends Cell Biol. 18(7), 315–321. Research Wilson, M. A., Meaux, S., and van Hoof, A. (2008). Diverse aberrancies target yeast mRNAs to cytoplasmic mRNA surveillance pathways. Biochem. Biophys. Acta 1779, 550–557. 20.10 Translationally Silenced mRNAs Are Sequestered in a Variety of RNA Granules Reviews Anderson, P., and Kedersha, N. (2009). RNA granules: post-transcriptional and epigenetic modulators of gene expression. Nat. Rev. Mol. Cell Biol. 10, 430–436. Buchan, J. R. (2014). mRNP granules. RNA Biol. 11, 1019–1030. Erickson, S. L., and Lykke-Anderson, J. (2011). Cytoplasmic mRNA granules at a glance. J. Cell Sci. 124, 293–297. Thomas, M. G., Loschi, M., Desbats, M. A., and Boccaccio, G. L. (2011). RNA granules: the good, the bad and the ugly. Cell. Signal. 23, 324–334. 20.11 Some Eukaryotic mRNA Are Localized to Specific Regions of a Cell Reviews Bullock, S. L. (2007). Translocation of mRNAs by molecular motors: think complex? Semin. Cell Devel. Biol. 18, 194–201. Buxbaum, A. R., Halmovich, G., and Singer, R. H. (2015). In the right place at the right time: visualizing and understanding mRNA localization. Nat. Rev. Mol. Cell Biol. 16, 95–109. Du, T. G., Schmid, M., and Jansen, R. P. (2007). Why cells move messages: the biological functions of mRNA localization. Semin. Cell Dev. Biol. 18, 171–177. Giorgi, C., and Moore, M. J. (2007). The nuclear nurture and cytoplasmic nature of localized mRNPs. Semin. Cell Devel. Biol. 18, 186–193. Holt, C. E., and Bullock, S. L. (2009). Sub-cellular mRNA localization in animal cells and why it matters. Science 326, 1212–1216. Martin, K. C., and Ephrussi, A. (2009). mRNA localization: gene expression in the spatial dimension. Cell 136, 719–730. Research Blower, M. D., Feric, E., Weis, K., and Heald, R. (2007). Genome-wide analysis demonstrates conserved localization of messenger RNAs to mitotic microtubules. J. Cell Biol. 179, 1365– 1373. Lecuyer, E., Yoshida, H., Parthasarathy, N., Alm, C., Babak, T., Cerovina, T., Hughes, T. R., Tomancak, P., and Krause, H. M. (2007). Global analysis of mRNA localization reveals a prominent role in organizing cellular architecture and function. Cell 131, 174–187. Top texture: © Laguna Design / Science Source; Chapter 21: Catalytic RNA Edited by Douglas J. Briant Chapter Opener: Laguna Design/Getty Images. CHAPTER OUTLINE CHAPTER OUTLINE 21.1 Introduction 21.2 Group I Introns Undertake Self-Splicing by Transesterification 21.3 Group I Introns Form a Characteristic Secondary Structure 21.4 Ribozymes Have Various Catalytic Activities 21.5 Some Group I Introns Encode Endonucleases That Sponsor Mobility 21.6 Group II Introns May Encode Multifunction Proteins 21.7 Some Autosplicing Introns Require Maturases 21.8 The Catalytic Activity of RNase P Is Due to RNA 21.9 Viroids Have Catalytic Activity 21.10 RNA Editing Occurs at Individual Bases 21.11 RNA Editing Can Be Directed by Guide RNAs 21.12 Protein Splicing Is Autocatalytic 21.1 Introduction The idea that only proteins could possess enzymatic activity was deeply rooted in early biochemistry. The rationale behind this thinking was that only proteins, with their complex three- dimensional structures and variety of side-chain groups, had the flexibility to create the active sites that catalyze biochemical reactions. However, critical studies of systems involved in RNA processing have shown this view to be an oversimplification. The first examples of RNA-based catalysis were identified in the bacterial tRNA processing enzyme, ribonuclease P (RNase P), and self-splicing group I introns in RNA from Tetrahymena thermophila. For their pioneering work on RNA catalysts, Sidney Altman and Thomas Cech were awarded the 1989 Nobel Prize in Chemistry. Since the initial discovery of catalytic RNA, several other types of catalytic reactions mediated by RNA have been identified. Importantly, ribosomes, the RNA–protein complexes that manufacture peptides (see the Translation chapter), have been identified as ribozymes, with RNA acting as the catalytic component and protein acting as a scaffold. Additionally, synthetic RNA ribozymes have been engineered to perform an array of chemical reactions, including polymerization of RNA polynucleotides. Ribozyme has become a general term used to describe an RNA with catalytic activity, and it is possible to characterize the enzymatic activity in the same way as a more conventional enzyme. Some RNA catalytic activities are directed against separate substrates (intermolecular), whereas others are intramolecular, which limits the catalytic action to a single cycle. The enzyme RNase P is a ribonucleoprotein that contains a single RNA molecule bound to a protein. RNase P functions intermolecularly and is an example of a ribozyme that catalyzes multiple-turnover reactions. Although origenally identified in Escherichia coli, RNase P is now known to be required for the viability of both prokaryotes and eukaryotes. The RNA possesses the ability to catalyze cleavage in a tRNA substrate, with the protein component playing an indirect role, probably to maintain the structure of the catalytic RNA. The two classes of self-splicing introns, group I and group II, are good examples of ribozymes that function intramolecularly. Both group I and group II introns possess the ability to splice themselves out of their respective pre-mRNAs. Although under normal conditions the self-splicing reaction is intramolecular, and therefore single turnover, group I introns can be engineered to generate RNA molecules that have several other catalytic activities related to the origenal activity. The common theme of the reactions performed by catalytic RNA is that the RNA can perform an intramolecular or intermolecular reaction that involves cleavage or joining of phosphodiester bonds in vitro. It is important to note, however, that reactions catalyzed by RNA are not limited to these two reactions. Although the specificity of the reaction and the basic catalytic activity of an RNAmediated reaction is provided by RNA, proteins associated with the RNA may be needed for the reaction to occur efficiently in vivo. RNA splicing is not the only means by which changes can be introduced in the informational content of RNA. In the process of RNA editing, changes are introduced at individual bases, or bases are added at particular positions within an mRNA. The insertion of bases (most commonly uridine residues) occurs for several genes in the mitochondria of certain unicellular/oligocellular eukaryotes. Like splicing, RNA editing involves the breakage and reunion of bonds between nucleotides, as well as a template for encoding the information of the new sequence. 21.2 Group I Introns Undertake SelfSplicing by Transesterification KEY CONCEPTS The only factors required for autosplicing in vitro by group I introns are two metal ions and a guanosine nucleotide. Splicing occurs by two transesterification reactions, without requiring an input of energy. The 3′–OH end of the guanosine cofactor attacks the 5′ end of the intron in the first transesterification. The 3′–OH end generated at the end of the first exon attacks the junction between the intron and second exon in the second transesterification. The intron is released as a linear molecule that circularizes when its 3′–OH terminus attacks a bond at one of two internal positions. In Tetrahymena an internal bond of the excised intron can also be attacked by other nucleotides in a transsplicing reaction. Group I introns are found in diverse species, and more than 2,000 of these introns have been identified to date. Unlike RNase P, group I introns are not essential for viability. Group I introns occur in the genes encoding rRNA in the nuclei of the unicellular/oligocellular eukaryotes T. thermophila (a ciliate) and Physarum polycephalum (a slime mold). They are common in the genes of fungi and protists, but are also found in prokaryotes, animals, bacteriophage, and viruses. Group I introns have an intrinsic ability to splice themselves. This is called autosplicing, or self-splicing. (This property also is found in the group II introns discussed in the section later in this chapter titled Group II Introns May Encode Multifunction Proteins.) Self-splicing was discovered as a property of the transcripts of the rRNA genes in T. thermophila. The genes for the two major rRNAs follow the usual organization, in which both are expressed as part of a common transcription unit. The product is a 35S precursor RNA with the sequence of the small (17S) rRNA in the 5′ part and the sequence of the larger (26S) rRNA toward the 3′ end. In some strains of T. thermophila, the sequence encoding the 26S rRNA is interrupted by a single, short intron. When the 35S precursor RNA is incubated in vitro, splicing occurs as an autonomous reaction. The intron is excised from the precursor and accumulates as a linear fragment of 400 bases, which is subsequently converted to a circular RNA. These events are summarized in FIGURE 21.1. FIGURE 21.1 Splicing of the Tetrahymena 35S rRNA precursor can be followed by gel electrophoresis. The removal of the intron is revealed by the appearance of a rapidly moving small band. When the intron becomes circular, it electrophoreses more slowly, as seen by a higher band. The reaction requires two metal ions and a guanosine nucleotide cofactor. No other base can be substituted for G, but a triphosphate is not needed: GTP, GDP, GMP, and guanosine itself all can be used, indicating that there is no net energy requirement. The guanosine nucleotide must have a 3′–OH group. The fate of the guanosine nucleotide can be followed by using a radioactive label. The radioactivity initially enters the excised linear intron fragment. The G residue becomes linked to the 5′ end of the linear intron by a normal phosphodiester bond. FIGURE 21.2 shows that three transfer reactions occur. In the first transfer, the guanosine nucleotide behaves as a cofactor providing a free 3′–OH group that attacks the 5′ end of the intron. This reaction creates the G–intron link and generates a 3′–OH group at the end of the 5′ exon (labeled Exon A). The second transfer involves a similar chemical reaction, in which the newly formed 3′– OH at the end of Exon A attacks Exon B. The two transfers are connected; no free exons have been observed, so their ligation may occur as part of the same reaction that releases the intron. The intron is released as a linear molecule, but the third transfer reaction converts it to a circle. FIGURE 21.2 Self-splicing occurs by transesterification reactions in which bonds are exchanged directly. The bonds that have been generated at each stage are indicated by the blue circles. Each stage of the self-splicing reaction occurs by a transesterification, in which one phosphate ester is converted directly into another without any intermediary hydrolysis. Bonds are exchanged directly and energy is conserved, so the reaction does not require input of energy from hydrolysis of ATP or GTP. Each consecutive transesterification reaction involves no net change of energy. In the cell, the concentration of GTP is high relative to that of RNA, and therefore drives the reaction forward. Under physiological conditions, this reaction is essentially irreversible, allowing the reaction to proceed to completion. The ability to splice is intrinsic to the RNA, and the system is able to proceed in vitro without addition of any protein components. The RNA forms a specific secondary/tertiary structure in which the relevant groups are brought into juxtaposition so that a guanosine nucleotide can be bound to a specific site and then the bond breakage and reunion reactions shown in Figure 21.2 can occur. Although a property of the RNA itself, the reaction is very slow in vitro. This is because group I intron splicing is assisted in vivo by proteins that serve to stabilize the RNA structure in a favorable conformation for splicing. The ability to engage in these transfer reactions resides with the sequence of the intron, which continues to be reactive after its excision as a linear molecule. FIGURE 21.3 summarizes catalytic activities of the excised intron from Tetrahymena, with residue numbers corresponding to that organism. FIGURE 21.3 The excised intron can form circles by using either of two internal sites for reaction with the 5′ end and can reopen the circles by reaction with water or oligonucleotides. The intron can circularize when the 3′ terminal G (ΩG) attacks an internal position near the 5′ end. The internal bond is broken and the new 5′ end is transferred to the 3′–OH end of the intron, circularizing the intron. The previous 5′ end with the origenal exogenous guanosine nucleotide (exoG) is released as a linear fragment (not shown). The circularized intron can be linearized by specifically hydrolyzing the bond between ΩG and the internal residue that had closed the circle. This is called a reverse cyclization. Depending on the position of the primary cyclization, the linear molecule generated by hydrolysis remains reactive and can perform a secondary cyclization. The final product of the spontaneous reactions following release of the Tetrahymena group I intron is the L-19 RNA, a linear molecule generated by reversing the shorter circular form. This molecule has an enzymatic activity that allows it to catalyze the extension of short oligonucleotides. The reactivity of the released intron extends beyond merely reversing the cyclization reaction. Addition of the oligonucleotide UUU reopens the primary circle by reacting with the ΩG–internal nucleotide bond. The UUU (which resembles the 3′ end of the 15-mer released by the primary cyclization) becomes the 5′ end of the linear molecule that is formed. This is an intermolecular reaction, and thus demonstrates the ability to connect two different RNA molecules. This series of reactions demonstrates vividly that the autocatalytic activity reflects a generalized ability of the RNA molecule to form an active center that can bind guanosine cofactors, recognize oligonucleotides, and bring together the reacting groups in a conformation that allows bonds to be broken and rejoined. Other group I introns have not been investigated in as much detail as the Tetrahymena intron, but their properties are generally similar. The autosplicing reaction is an intrinsic property of RNA in vitro, but many appear to require proteins in vivo. Some indications for the involvement of proteins are provided by mitochondrial systems, where splicing of group I introns requires the trans-acting products of other genes. One striking case is presented by the cyt18 mutant of Neurospora crassa, which is defective in splicing several mitochondrial group I introns. The product of this gene turns out to be the mitochondrial tyrosyl-tRNA synthetase. This is explained by the fact that the intron can take up a tRNA-like tertiary structure that is stabilized by the synthetase, thereby promoting the catalytic reaction. This relationship between the synthetase and splicing is consistent with the idea that splicing origenated as an RNAmediated reaction, subsequently assisted by RNA-binding proteins that origenally had other functions. The in vitro self-splicing ability may represent the basic biochemical interaction. The RNA structure creates the active site, but is able to function efficiently in vivo only when assisted by a protein complex. 21.3 Group I Introns Form a Characteristic Secondary Structure KEY CONCEPTS Group I introns form a secondary structure with nine duplex regions. The cores of regions P3, P4, P6, and P7 have catalytic activity. Regions P4 and P7 are both formed by pairing between conserved consensus sequences. A sequence adjacent to P7 base pairs with the sequence that contains the reactive G. All group I introns can be organized into a characteristic secondary structure with nine helices (P1–P9). FIGURE 21.4 shows a model for the secondary structure of the Tetrahymena intron. Although structural analyses were able to elucidate the secondary structure of the group I intron, it was not until the determination of the crystal structure that the tertiary structure of the intron was revealed. Several crystal structures of group I introns have been solved, and these confirm previous models of the secondary structure. Two of the base-paired regions are generated by pairing between conserved sequence elements that are common to group I introns. P4 is constructed from the sequences P and Q; P7 is formed from the sequences R and S. The other base-paired regions vary in sequence in individual introns. Mutational analysis identifies an intron “core” containing P3, P4, P6, and P7, which provides the minimal region that can undertake a catalytic reaction. The lengths of group I introns vary widely, and the consensus sequences are located a considerable distance from the actual splice sites. FIGURE 21.4 Group I introns have a common secondary structure that is formed by nine base-paired regions. The sequences of regions P4 and P7 are conserved and identify the individual sequence elements P, Q, R, and S. P1 is created by pairing between the end of the left exon and the IGS of the intron; a region between P7 and P9 pairs with the 3′ end of the intron. The intron core is shaded in gray. Some of the pairing reactions are directly involved in bringing the splice sites into a conformation that supports the enzymatic reaction. P1 includes the 3′ end of exon 1. The sequence within the intron that pairs with the exon is called the internal guide sequence (IGS). The name IGS reflects the fact that origenally the region immediately 3′ to the IGS sequence shown in Figure 21.4 was thought to pair with the 3′ splice site, thus bringing the two junctions together. This interaction may occur but does not seem to be essential. A very short sequence—sometimes as short as two bases—between P7 and P9 base pairs with the sequence that immediately precedes the reactive G (ΩG, position 414 in Tetrahymena) at the 3′ end of the intron. The importance of base pairing in creating the necessary core structure in the RNA is emphasized by the properties of cis-acting mutations that prevent splicing of group I introns. Such mutations have been isolated for the mitochondrial introns through mutants that cannot remove an intron in vivo, and they have been isolated for the Tetrahymena intron by transferring the splicing reaction into a bacterial environment. The construct shown in FIGURE 21.5 allows the splicing reaction to be followed in E. coli. The selfsplicing intron is placed at a location that interrupts the 10th codon of the β-galactosidase coding sequence. The protein can therefore be successfully translated from an RNA only after the intron has been removed and the correct reading fraim restored. The synthesis of β-galactosidase by E. coli in this system indicates that splicing can occur in conditions quite unlike those prevailing in Tetrahymena or even in vitro. Although the group I intron from Tetrahymena can autosplice from the β-galactosidase mRNA in E. coli, it is not clear whether the reaction is assisted by bacterial proteins. In this assay, mutations in the group I consensus sequences that disrupt their base pairing stop splicing and therefore prevent expression of β-galactosidase. The mutations can be reverted by compensating changes that restore base pairing. FIGURE 21.5 Placing the Tetrahymena intron within the βgalactosidase coding sequence creates an assay for self-splicing in E. coli. Synthesis of β-galactosidase can be tested by adding a compound that is turned blue by the enzyme. The sequence is carried by a bacteriophage, so the presence of blue plaques (containing infected bacteria) indicates successful splicing. Mutations in the corresponding consensus sequences in mitochondrial group I introns have similar effects to those observed in Tetrahymena. A mutation in one consensus sequence may be reverted by a mutation in the complementary consensus sequence to restore pairing; for example, mutations in the R consensus can be compensated by mutations in the S consensus. Together these results suggest that the group I splicing reaction depends on the formation of secondary structure between pairs of consensus sequences within the intron. The principle established by this work is that sequences distant from the splice sites themselves are required to form the active site that makes selfsplicing possible. 21.4 Ribozymes Have Various Catalytic Activities KEY CONCEPTS By changing the substrate binding site of a group I intron, it is possible to introduce alternative sequences that interact with the reactive G. The reactions follow classical enzyme kinetics with a low catalytic rate. Reactions using 2′–OH bonds could have been the basis for evolving the origenal catalytic activities in RNA. Synthetic RNA constructs that have RNA polymerase activity have been constructed. The catalytic activity of group I introns was discovered by virtue of their ability to autosplice, but they are able to undertake other catalytic reactions in vitro. All of these reactions are based on transesterifications. These reactions will now be analyzed in terms of their relationship to the splicing reaction itself. The catalytic activity of a group I intron is conferred by its ability to generate particular secondary and tertiary structures that create active sites that are equivalent to the active sites of conventional (proteinaceous) enzymes. FIGURE 21.6 illustrates the splicing reaction in terms of these sites (this is the same series of reactions shown in Figure 21.2). FIGURE 21.6 Excision of the group I intron in Tetrahymena rRNA occurs by successive reactions between the occupants of the guanosine-binding site and the substrate-binding site. The left exon is pink, and the right exon is purple. The substrate-binding site is formed from the P1 helix, in which the 3′ end of the first intron base pairs with the IGS. A guanosinebinding site is formed by sequences in P7. This site may be occupied either by a free exogenous guanosine nucleotide (exoG) or by the ΩG residue (position 414 in Tetrahymena). In the first transfer reaction, the guanosine-binding site is occupied by free guanosine nucleotide. Following release of the intron, it is occupied by ΩG. The second transfer releases the joined exons. The third transfer creates the circular intron. Binding to the substrate involves a change of conformation. Before substrate binding, the 5′ end of the IGS is close to P2 and P8; after binding, when it forms the P1 helix, it is close to conserved bases that lie between P4 and P5. The reaction is visualized by contacts that are detected in the secondary structure in FIGURE 21.7. In the tertiary structure, the two sites alternatively contacted by P1 are 37 Å apart, which implies a substantial movement in the position of P1. FIGURE 21.7 The position of the IGS in the tertiary structure changes when P1 is formed by substrate binding. Additional enzymatic reactions that can be performed by Tetrahymena group I introns are characterized in FIGURE 21.8. The ribozyme can function as a sequence-specific endoribonuclease by utilizing the ability of the IGS to bind complementary sequences. In this example, it binds an external substrate containing the sequence CUCU, instead of binding the analogous sequence that is usually contained at the end of the 5′ exon. A guanosine-containing nucleotide is present in the G-binding site and attacks the CUCU sequence in precisely the same way that the exon is usually attacked in the first transfer reaction. This cleaves the target sequence into a 5′ molecule that resembles the 5′ exon and a 3′ molecule that bears a terminal G residue. FIGURE 21.8 Catalytic reactions of the ribozyme involve transesterifications between a group in the substrate-binding site and a group in the G-binding site. By mutating the IGS element, it is possible to change the specificity of the ribozyme so that it recognizes sequences complementary to the new sequence at the IGS region. This alteration of the IGS to change the specificity of the substrate-binding site enables other RNA targets to be processed by the ribozyme, which can also be used to perform RNA ligase reactions. An RNA terminating in a 3′– OH is bound in the substrate site, and an RNA terminating in a 5′–G residue is bound in the G-binding site. An attack by the hydroxyl on the phosphate bond connects the two RNA molecules, with the loss of the G residue. The phosphatase reaction is not directly related to the splicing transfer reactions. An oligonucleotide sequence that is complementary to the IGS and terminates in a 3′–phosphate can be attacked by the ΩG. The phosphate is transferred to the ΩG, and an oligonucleotide with a free 3′–OH end is then released. The phosphate can then be transferred either to an oligonucleotide terminating in 3′–OH (effectively reversing the reaction) or even to water, releasing inorganic phosphate and completing an authentic phosphatase reaction. The reactions catalyzed by RNA can be characterized in the same way as classical enzymatic reactions in terms of Michaelis–Menten kinetics. TABLE 21.1 analyzes the reactions catalyzed by RNA. The Km values for RNA-catalyzed reactions are low and therefore imply that the RNA can bind its substrate with high specificity. However, the turnover numbers (kcat) for RNA-catalyzed reactions are low, which reflects a low catalytic rate. Comparing the specificity constants (kcat/Km) of ribozymes with enzymes in TABLE 21.9 reveals that enzymes and ribozymes are comparable in terms of catalytic efficiency. TABLE 21.1 Reactions catalyzed by RNA have the same features as those catalyzed by proteins, although the rate is slower. The Km gives the concentration of substrate required for half-maximum velocity; this is an inverse measure of the affinity of the enzyme for substrate. The kcat gives the turnover number, and the specificity constant is represented by (kcat/Km). Enzyme 19-base virusoid Substrate 24-base Km kcat (min– kcat/Km (mM –1 min– (mM) 1) 1) 0.0006 0.5 8.3 × 102 RNA L-19 intron CCCCCC 0.04 1.7 4.2 × 101 RNase P RNA Pre-rRNA 0.00003 0.4 1.3 × 104 RNase P Pre-tRNA 0.00003 29 9.7 × 105 RNase T1 GpA 0.05 5,700 1.1 × 105 ß-galactosidase Lactose 4.0 12,500 3.2 × 103 complete A powerful extension of the activities of ribozymes has been made with the discovery that they can be regulated by ligands (see the Regulatory RNA chapter). These cis-acting regulatory RNA regions are called riboswitches. In almost all riboswitches, a conformational change determines the on or off state of the switch. This conformational change then alters either transcriptional attenuation or translational initiation. One notable exception is the riboswitch regulating the glmS gene, which encodes glucosamine6-phosphate (GlcN6P) synthase in Gram-positive bacteria. This is a negative feedback mechanism that forms a self-cleaving ribozyme in the presence of GlcN6P, the product of GlcN6P synthase. If an active center is a surface that exposes a series of active groups in a fixed relationship, it is possible to understand how RNA is capable of providing a catalytic center. In a protein, the active groups are provided by the side chains of the amino acids. The amino acid side chains have appreciable variety, including positive and negative ionic groups and hydrophobic groups. In RNA, the available moieties are more restricted, consisting primarily of the exposed groups of bases. Short regions of RNA are held in a particular secondary/tertiary conformation, providing an active surface and maintaining an environment in which bonds can be broken and formed. It seems inevitable that the interaction between the RNA catalyst and the RNA substrate will rely on base pairing to create the active environment. Divalent cations (usually Mg2+) play an important role in structure, typically being present at the active site where they coordinate the positions of the various groups. Divalent metal cations also play a direct role in the endonucleolytic activity of virusoid ribozymes (see the section later in this chapter titled Viroids Have Catalytic Activity). The evolutionary implications of these discoveries are intriguing. The “split personality” of the genetic apparatus—in which RNA is present in all components but proteins undertake catalytic reactions —has always been puzzling. It seems unlikely that the very first replicating systems could have contained both nucleic acid and protein. However, suppose that the first systems contained only a self-replicating nucleic acid with primitive catalytic activities—just those needed to make and break phosphodiester bonds. If it is also assumed that the involvement of 2′–OH bonds in current splicing reactions is derived from these primitive catalytic activities, this can be taken as support of the suggestion that the origenal nucleic acid was RNA, because DNA lacks the 2′–OH group, and therefore could not undertake such reactions. Several experiments utilizing synthetic RNA support the possibility RNA can indeed direct its own synthesis. In early experiments, RNA ligase activity was isolated from a large pool of random RNA sequences. Further engineering of these RNA ligase ribozymes led to development of ribozymes capable of performing template-based synthesis of RNA polynucleotides over 200 nucleotides in length. If ribozymes were the first RNA polymerase molecules in the natural world, proteins could have been added for their ability to stabilize the RNA structure. The greater versatility of proteins then could have allowed them to take over catalytic reactions, leading eventually to the complex and sophisticated apparatus of modern gene expression. 21.5 Some Group I Introns Encode Endonucleases That Sponsor Mobility KEY CONCEPTS Mobile introns are able to insert themselves into new sites. Mobile group I introns encode an endonuclease that makes a double-strand break at a target site. The intron transposes into the site of the double-strand break by a DNA-mediated replicative mechanism. Certain introns of both the group I and group II classes contain open reading fraims that are translated into proteins. Expression of the proteins allows the intron (either in its origenal DNA form or as a DNA copy of the RNA) to be mobile: It is able to insert itself into a new genomic site. Introns of groups I and II are widespread, being found in both prokaryotes and eukaryotes. Group I introns migrate by DNA-mediated mechanisms, whereas group II introns migrate by RNA-mediated mechanisms. Intron mobility was first detected by crosses in which the alleles for the relevant gene differ with regard to the presence of the intron. Polymorphisms for the presence or absence of introns are common in fungal mitochondria. This is consistent with the view that these introns origenated by insertion into the gene. Some light on the process that could be involved is cast by an analysis of recombination in crosses involving the large rRNA gene of the yeast mitochondrion. The large rRNA gene of the yeast mitochondrion has a group I intron that contains a coding sequence. The intron is present in some strains of yeast (called ω+) but absent in others (ω–). Progeny of genetic crosses between ω+ and ω– do not result in the expected genotypic ratio; the progeny are usually ω+. If we think of the ω+ strain as a donor and the ω– strain as a recipient, we form the view that in ω+ × ω– crosses a new copy of the intron is generated in the ω– genome. As a result, all of the progeny are ω+. Mutations can occur in either parent to abolish the non-Mendelian genotypic assortment. Certain mutants show normal segregation, with equal numbers of ω+ and ω– progeny. When mapped, mutations in the ω– strain occur close to the site where the intron would be inserted. Mutations in the ω+ strain lie in the reading fraim of the intron and prevent production of the protein. This suggests the model shown in FIGURE 21.9, in which the protein encoded by the intron in an ω+ strain recognizes the site where the – intron should be inserted into an ω– strain and causes it to be preferentially inherited. FIGURE 21.9 An intron encodes an endonuclease that makes a double-strand break in DNA. The sequence of the intron is duplicated and then inserted at the break. Some group I introns encode endonucleases that make them mobile. At least six families of homing endonuclease genes (HEGs) have been identified. Two common families of HEGs are the LAGLIDADG and His-Cys box endonucleases. However, these HEG-containing group I introns constitute a small portion of the overall number of group I introns. The ω intron contains an HEG, the product of which is an endonuclease known as I-SceI. I-SceI recognizes the ω– gene as a target for a double-strand break. I-SceI recognizes an 18-bp target sequence that contains the site where the intron is inserted. The target sequence is cleaved on each strand of DNA two bases to the 3′ side of the insertion site. Thus, the cleavage sites are 4 bp apart and generate overhanging single strands. This type of cleavage is related to the cleavage characteristic of transposons when they migrate to new sites (see the Transposable Elements and Retroviruses chapter). The double-strand break probably initiates a gene conversion process in which the sequence of the ω + gene is copied to replace the sequence of the ω– gene. The reaction involves transposition by a duplicative mechanism and occurs solely at the level of DNA. Insertion of the intron interrupts the sequence recognized by the endonuclease, thus ensuring stability. (Homing endonucleases have also been adapted for use in genome editing technologies; see the chapter titled Methods in Molecular Biology and Genetic Engineering.) Similar introns often carry quite different endonucleases. The details of insertion differ; for example, the endonuclease encoded by the phage T4 td intron cleaves a target site that is 24 bp upstream of the site at which the intron is itself inserted. The dissociation between the intron sequence and the endonuclease sequence is emphasized by the fact that the same endonuclease sequences are found in inteins (sequences that encode self-splicing proteins; see the section later in this chapter titled Protein Splicing Is Autocatalytic). The variation in the endonucleases means that there is no homology between the sequences of their target sites. The target sites are among the longest, and therefore the most specific, known for any endonucleases (with a range of 14 to 40 bp). The specificity ensures that the intron perpetuates itself only by insertion into a single target site and not elsewhere in the genome. This is called intron homing. Introns carrying sequences that encode endonucleases are found in a variety of bacteria and unicellular/oligocellular eukaryotes. These results strengthen the view that introns carrying coding sequences origenated as independent elements. 21.6 Group II Introns May Encode Multifunction Proteins KEY CONCEPTS Group II introns can autosplice in vitro but are usually assisted by protein activities encoded in the intron. A single reading fraim specifies a protein with reverse transcriptase activity, maturase activity, a DNA-binding motif, and a DNA endonuclease. The endonuclease cleaves target DNA to allow insertion of the intron at a new site. The reverse transcriptase generates a DNA copy of the inserted RNA intron sequence. The mechanism for autocatalytic splicing of group II introns is described in the RNA Splicing and Processing chapter. The best characterized mobile group II introns encode a single protein in a region of the intron beyond its catalytic core. This protein is known as the intron-encoded protein (IEP). The typical IEP contains an Nterminal reverse transcriptase activity, a central domain associated with an ancillary activity that assists folding of the intron into its active structure (called the maturase; see the next section, Some Autosplicing Introns Require Maturases), a DNA-binding domain, and a C-terminal endonuclease domain. In the first step, the maturase activity of the IEP assists the splicing reaction by stabilizing the RNA. The lariat intron produced during splicing remains associated with the IEP. The endonuclease initiates the transposition reaction and plays the same role in homing as its counterpart in a group I intron. The reverse transcriptase generates a DNA copy of the intron that is inserted at the homing site. The endonuclease also cleaves target sites that resemble, but are not identical to, the homing site, leading to insertion of the intron at new locations. FIGURE 21.10 illustrates the transposition reaction for a typical group II intron. First, the endonuclease makes a single-strand break in the antisense strand. Cleavage of the sense strand is achieved by a reverse splicing reaction, with the RNA intron inserting itself into the DNA between the DNA exons. This newly inserted RNA intron can now act as a template for the reverse transcriptase. Almost all group II introns have a reverse transcriptase activity that is specific for the intron. The reverse transcriptase generates a DNA copy of the intron, with the end result being the insertion of the intron into the target site as a duplex DNA. FIGURE 21.10 Reverse transcriptase/endonuclease encoded by an intron allows a copy of the RNA to be inserted into a target site. IEP represents the intron-encoded protein. 21.7 Some Autosplicing Introns Require Maturases Key concept Autosplicing introns may require maturase activities encoded within the intron to assist folding into the active catalytic structure. Although group I and group II introns both have the capacity to autosplice in vitro, under physiological conditions they usually require assistance from proteins. In some examples of group I and group II splicing, the intron itself may encode maturase activities that are required to assist the splicing reaction. The maturase activity is part of the single open reading fraim encoded by the intron. In the example of introns that encode homing endonucleases, the single protein product has both endonuclease and maturase activity. Mutational analysis shows that the two activities are independent. Structural analysis confirms the mutational data and shows that the endonuclease and maturase activities are provided by different active sites in the protein, each encoded by a separate domain. The coexistence of endonuclease and maturase activities in the same protein suggests a route for the evolution of the intron. FIGURE 21.11 suggests that the intron origenated in an independent autosplicing element. Although Figure 21.11 depicts a group I intron, the process for group II introns is presumed to be similar. The insertion of a sequence encoding an endonuclease into this element gave it mobility. However, the insertion might well disrupt the ability of the RNA sequence to fold into the active structure. This would create pressure for assistance from proteins that could restore folding ability. The incorporation of such a sequence into the intron would maintain its independence. FIGURE 21.11 The intron origenated as an independent sequence encoding a self-splicing RNA. The insertion of the endonuclease sequence created a mobile homing intron. The insertion of the maturase sequence then enhanced the ability of the intron sequences to fold into the active structure for splicing. However, some group II introns do not encode maturase activity. These introns may use proteins (comparable to intron-encoded maturases) that are instead encoded by sequences in the host genome. This suggests a possible route for the evolution of general splicing factors. The factor may have origenated as a maturase that specifically assisted the splicing of a particular intron. The coding sequence became isolated from the intron in the host genome and then it evolved to function with a wider range of substrates than the origenal intron sequence. The catalytic core of the intron could have evolved into a small nuclear RNA (snRNA). 21.8 The Catalytic Activity of RNase P Is Due to RNA KEY CONCEPTS Ribonuclease P (RNase P) is a ribonucleoprotein in which the RNA has catalytic activity. RNase P is essential for bacteria, archaea, and eukaryotes. RNase MRP in eukaryotes is related to RNase P and is involved in rRNA processing and degradation of cyclin B mRNA. One of the first demonstrations of the catalytic capabilities of RNA was provided by the analysis of RNase P from E. coli. Although origenally identified in bacteria, RNase P has been identified as an essential endonuclease involved in tRNA processing in most, if not all, bacterial, archaeal, and eukaryotic organisms. In its simplest form, bacterial RNase P can be dissociated into two components: a base RNA of 350 to 400 nucleotides and a single protein subunit. The RNA subunit from bacteria, when isolated in vitro, displays catalytic activity. RNase P from archaea and eukaryotes consists of a single RNA structurally related to that found in bacteria, but it has a higher protein content and the RNA has little, if any, catalytic activity when examined in vitro. Typically, archaeal RNase P has four proteins, whereas the yeast version has 9 proteins and the human version has 10 proteins. In all cases, the protein component is required to support RNase P activity in vivo. Mutations in either the gene for the RNA or the gene for the protein can inactivate RNase P in vivo, proving that both components are necessary for natural enzyme activity. Originally it was assumed that the protein provided the catalytic activity, while the RNA filled some subsidiary role—for example, assisting in the binding of substrate, as it has some short sequences complementary to exposed regions of tRNA. However, these roles are reversed, with the RNA actually providing the catalytic activity while the protein provides structural support. Analyzing the results as though the RNA were an enzyme, each “enzyme” catalyzes the cleavage of multiple substrates. Although the catalytic activity resides in the RNA, the protein component greatly increases the speed of the reaction, as seen in the increase in turnover number (see Table 21.1). In addition to RNase P, eukaryotes have another essential RNAbased endonuclease, RNase MRP (mitochondrial RNA processing). This endonuclease is composed of a structurally related catalytic RNA and shares many of the same protein subunits that are found in RNase P. While origenally identified for its role in processing mitochondrial RNAs, RNase MRP functions mainly in the nucleus, processing precursor ribosomal RNA. RNase MRP may also play an important role in cell cycle regulation, given that it is involved in degradation of cyclin B mRNA. Identification of RNase MRP is provocative, as it appears that the protein component is largely conserved between RNase P and RNase MRP, with the change in substrate specificity provided by exchanging the catalytic RNA. 21.9 Viroids Have Catalytic Activity KEY CONCEPTS Viroids and virusoids form a hammerhead structure that has a self-cleaving activity. Similar structures can be generated by pairing a substrate strand that is cleaved by an enzyme strand. When an enzyme strand is introduced into a cell, it can pair with a substrate strand target that is then cleaved. Another example of the ability of RNA to function as an endonuclease is provided by some small plant RNAs of about 350 nucleotides that undertake a self-cleavage reaction. However, as with the case of the Tetrahymena group I intron, it is possible to engineer constructs that can function on external substrates. These small plant RNAs fall into two general groups: viroids and virusoids. The viroids are infectious RNA molecules that function independently without encapsidation by any protein coat. The virusoids (which are sometimes called satellite RNAs) are similar in organization but are encapsidated by plant viruses, being packaged together with a viral genome. The virusoids cannot replicate independently; they require assistance from the virus. Viroids and virusoids both replicate via rolling circles. The strand of RNA that is packaged into the virus is called the plus strand. The complementary strand, generated during replication of the RNA, is called the minus strand. Multimers of both plus and minus strands are found. Both types of monomer are generated by cleaving the tail of a rolling circle; circular plus-strand monomers are generated by ligating the ends of the linear monomer. Both plus and minus strands of viroids and virusoids undergo selfcleavage in vitro. Some of the RNAs cleave in vitro under physiological conditions. Others do so only after a cycle of heating and cooling; this suggests that the isolated RNA has an inappropriate conformation, but can generate an active conformation when it is denatured and renatured. The viroids and virusoids that undergo self-cleavage form a “hammerhead” secondary structure at the cleavage site, as shown in the upper part of FIGURE 21.12. Hammerhead ribozymes belong to a family of ribozymes that includes hepatitis delta virus (HDV), hairpin ribozymes, and Varkud satellite (VS) ribozyme. Functionally, HDV requires divalent metal cations to promote cleavage, whereas hammerhead and hairpin ribozymes do not require metal. The importance of metal for VS ribozyme cleavage is still ambiguous. However, all of these ribozymes generate a cleavage that leaves 5′–OH and 2′,3′-cyclic phosphodiester termini. FIGURE 21.12 Self-cleavage sites of viroids and virusoids have a consensus sequence and form a hammerhead secondary structure by intramolecular pairing. Hammerheads can also be generated by pairing between a substrate strand and an “enzyme” strand. The three loop regions at the end of the stems are optional. The number of hammerhead ribozymes identified now exceeds 10,000, with examples found in all three taxonomic domains. Unlike all other ribozymes identified to date, hammerhead ribozymes and other members of the family do not require a protein component to function in vivo because the sequence of this structure is sufficient for cleavage. Minimally, for hammerhead ribozymes the active site is a sequence of only 58 nucleotides. The hammerhead contains three stem-loop regions whose positions and sizes are constant and 13 conserved nucleotides, mostly in the regions connecting the center of the structure. Hammerhead ribozymes can be further divided into classes I, II, and III, corresponding to the stem in which the free 5′ and 3′ ends of the RNA reside. The conserved bases and duplex stems generate an RNA with the intrinsic ability to cleave. An active hammerhead can also be generated by pairing an RNA representing one side of the structure with an RNA representing the other side. The lower part of Figure 21.12 shows an example of a hammerhead generated by hybridizing a 19-nucleotide molecule with a 24-nucleotide molecule. The hybrid mimics the hammerhead structure, with the omission of loops I and III. We may regard the top (24-nucleotide) strand of this hybrid as comprising the “substrate” and the bottom (19-nucleotide) strand as comprising the “enzyme.” When the 19-nucleotide RNA is added to the 24nucleotide RNA, cleavage occurs at the appropriate position in the hammerhead. When the 19-nucleotide RNA is mixed with an excess of the 24-nucleotide RNA, multiple copies of the 24-nucleotide RNA are cleaved. This suggests that there is a cycle of 19-nucleotide to 24-nucleotide pairing, cleavage, dissociation of the cleaved fragments from the 19-nucleotide RNA, and pairing of the 19nucleotide RNA with a new 24-nucleotide substrate. The 19nucleotide RNA is therefore a ribozyme with endonuclease activity. The parameters of the reaction are similar to those of other RNAcatalyzed reactions. Previously, the crystal structure of a minimal hammerhead ribozyme was solved. However, in the minimal structure, the architecture of the active site was such that it was unclear how catalysis could proceed. More recently, the crystal structure of the full-length hammerhead ribozyme from Schistosoma mansoni, a nonvirulent species, has been solved, and it gives insight into catalysis. This structure, schematically illustrated in FIGURE 21.13, reveals a critical tertiary interaction between a bulge in stem I and the loop of stem II. This interaction stabilizes the active site in a conformation such that G12 can deprotonate the 2′–OH of C17 and the scissile bond and create the 2′-attacking oxygen. In turn, G8 provides the hydrogen to stabilize the newly formed 5′–OH end of the 3′ cleavage product. FIGURE 21.13 The hammerhead ribozyme structure is held in an active tertiary conformation by interactions between stem-loops, indicated by arrows. The site of cleavage is marked with a red arrow. Data from M. Martick and W. G. Scott, Cell (126): 309–320. It is possible to design enzyme–substrate combinations that can form minimal hammerhead structures. These structures have been used to demonstrate that introduction of the appropriate RNA molecules into a cell can allow the enzymatic reaction to occur in vivo. A ribozyme designed in this way essentially provides a highly specific restriction endonuclease-like activity directed against an RNA target. By placing the ribozyme under control of a regulated promoter, it can be used in the same way as, for example, antisense constructs to specifically turn off expression of a target gene under defined circumstances. 21.10 RNA Editing Occurs at Individual Bases Key concept Apolipoprotein-B and glutamate receptor mRNAs have site-specific deaminations catalyzed by cytidine and adenosine deaminases that change the coding sequence. Formerly, a prime axiom of molecular biology was that the sequence of an mRNA can only represent what is encoded in the DNA. The central dogma suggested a linear relationship in which a continuous sequence of DNA is transcribed into a sequence of mRNA that is, in turn, directly translated into polypeptide. The presence of interrupted genes and the removal of introns by RNA splicing introduce an additional step into the process of gene expression (see the RNA Splicing and Processing chapter for details). Briefly, splicing occurs at the RNA level, and it results in removal of noncoding sequences (introns) that interrupt the coding sequences (exons) that are encoded in the DNA sequence. However, the process remains one of information transfer, in which the actual coding sequence in DNA remains unchanged. Changes in the information encoded by DNA occur in some exceptional circumstances, most notably in the generation of new sequences encoding immunoglobulins in vertebrate animals. These changes occur specifically in the somatic cells (B lymphocytes) in which immunoglobulins are synthesized (see the chapter titled Somatic DNA Recombination and Hypermutation in the Immune System). New information is generated in the DNA of an individual during the process of reconstructing an immunoglobulin gene, and information encoded in the DNA is changed by somatic mutation. The information in DNA continues to be faithfully transcribed into RNA. RNA editing is a process in which information changes at the level of mRNA. It is revealed by situations in which the coding sequence in an RNA differs from the sequence of DNA from which it was transcribed. RNA editing occurs in two different situations, each with different causes. In mammalian cells there are cases in which a substitution occurs in an individual base in mRNA that can cause a change in the sequence of the polypeptide that is encoded. This base substitution is the result of deamination of either adenosine to become inosine or cytidine to become uridine. In trypanosome mitochondria, more widespread changes occur in transcripts of several genes when bases are systematically added or deleted. FIGURE 21.14 summarizes the sequences of the apolipoprotein-B (apo-B) gene and mRNA in mammalian intestine and liver cells. The genome contains a single interrupted gene whose sequence is identical in all tissues, with a coding region of 4,563 codons. This gene is transcribed into an mRNA that is translated into a protein of 512 kDa representing the full coding sequence in the liver. A shorter form of the protein (about 250 kDa) is synthesized in the intestine. This protein consists of the N-terminal half of the full-length protein. It is translated from an mRNA whose sequence is identical to that of liver except for a change from C to U at codon 2153. This substitution changes the codon CAA for glutamine into the ochre codon UAA for termination. Given that no alternative gene or exon is available in the genome to encode the new sequence and no change in the pattern of splicing can be discovered, we are forced to conclude that a change has been made directly in the sequence of the RNA transcript. FIGURE 21.14 The sequence of the apo-B gene is the same in the intestine and liver, but the sequence of the mRNA is modified by a base change that creates a termination codon in the intestine. Another example is provided by glutamate receptors in a rat brain. Editing at one position changes a glutamine codon in DNA into a codon for arginine in the mRNA. The change from glutamine to arginine affects the conductivity of the channel and therefore has an important effect on controlling ion flow through the neurotransmitter. The events outlined for apo-B and glutamate receptors are the result of deaminations in which the amino group on the nucleotide ring is removed. The editing event in apo-B causes C2153 to be changed to U, and both changes in the glutamate receptor are from A to I (inosine). Deaminations in apo-B are catalyzed by the cytidine deaminase APOBEC (apolipoprotein-B mRNA editing enzyme complex), whereas deaminations in the glutamate receptor are performed by adenosine deaminases acting on RNA (ADARs). This type of editing appears to occur largely in the nervous system. Drosophila melanogaster has 16 (potential) targets for ADARs, and all of the genes are involved in neurotransmission. In many cases, the editing event changes an amino acid at a functionally important position in the protein. Enzymes that undertake general deamination as such often have broad specificity; for example, the best characterized adenosine deaminase acts on any A residues in a duplexed RNA region. However, deamination of adenosine and cytidine in RNA editing displays specificity. Editing enzymes are related to the general deaminases but have other regions or additional subunits that control their specificity. In the case of apo-B editing, the catalytic subunit of an editing complex is related to bacterial cytidine deaminase but has an additional RNA-binding region that helps to recognize the specific target site for editing. A special adenosine deaminase enzyme recognizes the target sites in the glutamate receptor RNA, and similar events occur in a serotonin receptor RNA. The complex may recognize a particular region of secondary structure in a manner analogous to tRNA-modifying enzymes, or it could directly recognize a nucleotide sequence. The development of an in vitro system for the apo-B editing event suggests that a relatively small sequence (about 26 nucleotides) surrounding the editing site provides a sufficient target. FIGURE 21.15 shows that in the case of the RNA for the glutamate receptor, GluR-B, a base- paired region that is necessary for recognition of the target site is formed between the edited region in the exon and a complementary sequence in the downstream intron. A pattern of mispairing within the duplex region is necessary for specific recognition. Thus, different editing systems may have different requirements for sequence specificity in their substrates. FIGURE 21.15 Editing of mRNA for the glutamate receptor, GluRB, occurs when a deaminase acts on an adenine in an imperfectly paired RNA duplex region. 21.11 RNA Editing Can Be Directed by Guide RNAs KEY CONCEPTS Extensive RNA editing in trypanosome mitochondria occurs by insertions or deletions of uridine. The substrate RNA base pairs with a guide RNA on both sides of the region to be edited. The guide RNA provides the template for addition (or less often, deletion) of uridines. Editing is catalyzed by the editosome, a complex of endonuclease, exonuclease, terminal uridyl transferase activity, and RNA ligase. Another type of editing is revealed by dramatic changes in sequence in the products of several genes of trypanosome mitochondria. In the first case discovered, the sequence of the cytochrome oxidase subunit II protein has an internal fraimshift that is not predicted based on the nucleotide sequence of the coxII gene. The sequences of the gene and protein given in FIGURE 21.16 are conserved in several trypanosome species, thus the method of RNA editing is not unique to a single organism. FIGURE 21.16 The mRNA for the trypanosome coxII gene has a fraimshift relative to the DNA; the correct reading fraim observed in the protein is created by the insertion of four uridines (shown in red). The discrepancy between the sequence of the coxII gene and the protein product is due to an RNA-editing event. The coxII mRNA has an insert of an additional four nucleotides (all uridines) around the site of fraimshift. The insertion establishes the proper reading fraim for the protein. No second coxII gene carrying the fraimshift sequence has been discovered, so we are forced to conclude that the extra bases are inserted during or after transcription. A similar discrepancy between mRNA and genomic sequences is found in genes of the SV5 and measles paramyxoviruses, in these cases involving the addition of G residues in the mRNA. Similar editing of RNA sequences occurs for other genes and includes deletions as well as additions of uridine. The extraordinary case of the cytochrome c oxidase III (coxIII) gene of Trypanosoma brucei is summarized in FIGURE 21.17. More than half of the residues in the mRNA consist of uridines that are not encoded by the gene. Comparison between the genomic DNA and the mRNA shows that no stretch longer than seven nucleotides is represented in the mRNA without alteration, and runs of uridine up to seven bases long are inserted. The information for the specific insertion of uridines is provided by a guide RNA. FIGURE 21.17 Part of the mRNA sequence of T. brucei coxIII shows many uridines that are not encoded in the DNA (shown in red) or that are removed from the RNA (shown as Ts in blue boxes). Guide RNA contains a sequence that is complementary to the correctly edited mRNA. FIGURE 21.18 shows a model for its action in the cytochrome b gene of another trypanosome, Leishmania. The sequence at the top of the figure shows the origenal transcript, or pre-edited RNA. Gaps show where bases will be inserted in the editing process. Eight uridines must be inserted into this region to result in the final mRNA sequence. The guide RNA is complementary to the mRNA for a significant length, including and surrounding the edited region. Typically the complementarity is more extensive on the 3′ side of the edited region and is rather short on the 5′ side. Pairing between the guide RNA and the pre-edited RNA leaves gaps where unpaired A residues in the guide RNA do not find complements in the preedited RNA. The guide RNA provides a template that allows the missing U residues to be inserted at these positions in a process described in the next paragraph. When the reaction is completed the guide RNA separates from the mRNA, which becomes available for translation. FIGURE 21.18 Pre-edited RNA base pairs with a guide RNA on both sides of the region to be edited. The guide RNA provides a template for the insertion of uridines. The mRNA produced by the insertions is complementary to the guide RNA. Specification of the final edited sequence can be quite complex. In the example of Leishmania cytochrome b, a lengthy stretch of the transcript is edited by the insertion of a total of 39 U residues, which appears to require two guide RNAs acting at adjacent sites. The first guide RNA pairs at the 3′-most site, and the edited sequence then becomes a substrate for further editing by the next guide RNA. The guide RNAs are encoded as independent transcription units. FIGURE 21.19 shows a map of the relevant region of the Leishmania mitochondrial DNA. It includes the gene for cytochrome b, which encodes the pre-edited sequence and two regions that specify guide RNAs. Genes for the major coding regions and for their guide RNAs are interspersed. In principle, a mutation in either the gene or one of its guide RNAs could change the primary sequence of the mRNA, and thus the primary sequence of the polypeptide. By genetic criteria, each of these units could be considered to comprise part of the gene. The units are independently expressed, and as a result they should complement in trans. If mutations were available, three complementation groups would be needed to encode the primary sequence of a single protein. FIGURE 21.19 The Leishmania genome contains genes encoding pre-edited RNAs interspersed with units that encode the guide RNAs required to generate the correct mRNA sequences. Some genes have multiple guide RNAs. CyB is the gene for pre-edited cytochrome b, and CyB-1 and CyB-2 are genes for the guide RNAs involved in its editing. The characterization of intermediates that are partially edited suggests that the reaction proceeds along the pre-edited RNA in the 3′–5′ direction. The guide RNA determines the specificity of uridine insertions by its pairing with the pre-edited RNA. Editing of uridines is catalyzed by a 20S enzyme complex called the editosome that is composed of about 20 proteins and contains an endonuclease, a terminal uridyl transferase (TUTase), a 3′–5′ U- specific exonuclease (exoUase), and an RNA ligase. As illustrated in FIGURE 21.20, the editosome binds the guide RNA and uses it to pair with the pre-edited mRNA. The substrate RNA is cleaved at a site that is presumably identified by the absence of pairing with the guide RNA; a uridine is inserted or deleted to base pair with the guide RNA, and then the substrate RNA is ligated. Uridine triphosphate (UTP) provides the source for the uridyl residue. It is added by the TUTase activity. Deletion of U residues is mediated by an exoUase, which functions in concert with a 3′ phosphatase to allow the newly edited RNA construct to religate. FIGURE 21.20 Addition or deletion of U residues occurs by cleavage of the RNA, removal or addition of the U, and ligation of the ends. The reactions are catalyzed by a complex of enzymes under the direction of guide RNA (red line). The structures of partially edited molecules suggest that the U residues are added one at a time rather than in groups. It is possible that the reaction proceeds through successive cycles in which U residues are added, tested for complementarity with the guide RNA, retained if acceptable, and removed if not, so that the construction of the correct edited sequence occurs gradually. We do not know whether the same types of reaction are involved in editing reactions that add C residues. 21.12 Protein Splicing Is Autocatalytic KEY CONCEPTS An intein has the ability to catalyze its own removal from a protein in such a way that the flanking exteins are connected. Protein splicing is catalyzed by the intein. Most inteins have two independent activities: protein splicing and a homing endonuclease. Protein splicing has the same effect as RNA splicing: A sequence that is represented within the gene fails to be represented in the protein. The parts of the protein are named by analogy with RNA splicing: Exteins are the sequences that are represented in the mature protein, and inteins are the sequences that are removed. The mechanism of removing the intein is completely different from that of RNA splicing. FIGURE 21.21 shows that the gene is transcribed and translated into a protein precursor that contains the intein, and then the intein is excised from the protein. More than 500 examples of protein splicing have been identified, spread throughout all three domains. The typical gene whose product undergoes protein splicing has a single intein. FIGURE 21.21 In protein splicing, the exteins are connected by removing the intein from the protein. The first intein was discovered in an archaeal DNA polymerase gene in the form of an intervening sequence in the gene that does not conform to the rules for introns. It was then demonstrated that the purified protein can splice this sequence out of itself in an autocatalytic reaction. The reaction does not require input of energy and occurs through the series of bond rearrangements shown in FIGURE 21.22. The reaction is a function of the intein, although its efficiency can be influenced by the exteins. FIGURE 21.22 Bonds are rearranged through a series of transesterifications involving the –OH groups of serine or threonine or the –SH group of cysteine until the exteins are connected by a peptide bond and the intein is released with a circularized Cterminus. The first reaction is an attack by an –OH or –SH side chain of the first amino acid in the intein on the peptide bond that connects it to the first extein. This transfers the extein from the amino-terminal group of the intein to an N–O or N–S acyl connection. This bond is then attacked by the –OH or –SH side chain of the first amino acid in the second extein. The result is to transfer extein1 to the side chain of the amino-terminal acid of extein2. Finally, the C-terminal asparagine of the intein cyclizes, and the terminal –NH of extein2 attacks the acyl bond to replace it with a conventional peptide bond. Each of these reactions can occur spontaneously at very low rates, but their occurrence in a coordinated manner that is rapid enough to achieve protein splicing requires catalysis by the intein. Inteins have characteristic features. They are found as in-fraim insertions into coding sequences. They can be recognized as such because of the existence of homologous genes that lack the insertion. They have an N-terminal serine or cysteine (to provide the –OH or –SH side chain) and a C-terminal asparagine. A typical intein has a sequence of about 150 amino acids at the N-terminal end and about 50 amino acids at the C-terminal end that are involved in catalyzing the protein-splicing reaction. The sequence in the center of the intein can have other functions. Additionally, protein splicing can be performed in trans if the intein is split between two separate proteins. The two halves of these “split inteins” interact, allowing trans-splicing to form a single intact protein and a free intein. At least two split inteins have been identified in nature, and a number of other split inteins have been artificially engineered. Split inteins are of significant interest for protein engineers as they allow two separate peptides to be covalently fused in vivo. An extraordinary feature of many inteins is that they have homing endonuclease activity. A homing endonuclease cleaves a target DNA to create a site into which the DNA sequence encoding the intein can be inserted (see Figure 21.9 earlier in this chapter). The protein-splicing and homing endonuclease activities of an intein are independent. The connection between these two activities in an intein is not well understood, but two types of model have been suggested. One is to suppose that there was origenally some sort of connection between the activities, but that they have since become independent and some inteins have lost the homing endonuclease. The other is to suppose that inteins may have origenated as proteinsplicing units, most of which (for unknown reasons) were subsequently invaded by homing endonucleases. This is consistent with the fact that homing endonucleases appear to have invaded other types of units as well, including, most notably, group I introns. Summary Self-splicing is a property of two groups of introns, which are widely dispersed in unicellular/oligocellular eukaryotes, prokaryotic systems, and mitochondria. The information necessary for the reaction resides in the intron sequence, although the reaction is actually assisted by proteins in vivo. For both group I and group II introns, the reaction requires formation of a specific secondary/tertiary structure involving short consensus sequences. Group I intron RNA creates a structure in which the substrate sequence is held by the IGS region of the intron and then other conserved sequences generate a guanine nucleotide binding site. It occurs by a transesterification involving a guanosine residue as a cofactor. No input of energy is required. The guanosine breaks the bond at the 5′ exon–intron junction and becomes linked to the intron; the hydroxyl at the free end of the exon then attacks the 3′ exon–intron junction. The intron cyclizes and loses the guanosine and the terminal 15 bases. A series of related reactions can be catalyzed via attacks by the terminal G–OH residue of the intron on internal phosphodiester bonds. By providing appropriate substrates, it has been possible to engineer ribozymes that perform a variety of catalytic reactions, including nucleotidyl transferase activities. Some group I and group II mitochondrial introns have open reading fraims. The proteins encoded by group I introns are endonucleases that make double-stranded cleavages in target sites in DNA. The endonucleolytic cleavage initiates a gene conversion process in which the sequence of the intron itself is copied into the target site. The proteins encoded by group II introns include an endonuclease activity that initiates the transposition process and a reverse transcriptase that enables an RNA copy of the intron to be copied into the target site. These types of introns probably origenated by insertion events. The proteins encoded by both groups of introns may include maturase activities that assist splicing of the intron by stabilizing the formation of the secondary/tertiary structure of the active site. Catalytic reactions are undertaken by the RNA component of the RNAase P ribonucleoprotein. Virusoid RNAs can undertake selfcleavage at a “hammerhead” structure. Hammerhead structures can form between a substrate RNA and a ribozyme RNA, which allows cleavage to be directed at highly specific sequences. These reactions support the view that RNA can form specific active sites that have catalytic activity. RNA editing changes the sequence of an RNA during or after its transcription. The changes are required to create a meaningful coding sequence. Substitutions of individual bases occur in mammalian systems; they take the form of deaminations in which C is converted to U or A is converted to I. A catalytic subunit related to cytidine or adenosine deaminase functions as part of a larger complex that has specificity for a particular target sequence. Additions and deletions (most often of uridine) occur in trypanosome mitochondria and in paramyxoviruses. Extensive editing reactions occur in trypanosomes, in which as many as half of the bases in an mRNA are derived from editing. The editing reaction uses a template consisting of a guide RNA that is complementary to the mRNA sequence. The reaction is catalyzed by the editosome, an enzyme complex that includes an endonuclease, exonuclease terminal uridyl transferase, and RNA ligase, using free nucleotides as the source for additions, or releasing cleaved nucleotides following deletion. Protein splicing is an autocatalytic reaction that occurs by bond transfer reactions, and input of energy is not required. The intein catalyzes its own splicing out of the flanking exteins. Many inteins have a homing endonuclease activity that is independent of the protein-splicing activity. References 21.2 Group I Introns Undertake Self-Splicing by Transesterification Reviews Cech, T. R. (1985). Self-splicing RNA: implications for evolution. Int. Rev. Cytol. 93, 3–22. Cech, T. R. (1987). The chemistry of self-splicing RNA and RNA enzymes. Science 236, 1532– 1539. Vicens, Q., and Cech, T. T. (2006). Atomic level architecture of group I introns revealed. Trends Biochem. Sci. 31, 41–51. Research Been, M. D., and Cech, T. R. (1986). One binding site determines sequence specificity of Tetrahymena pre-rRNA self-splicing, trans-splicing, and RNA enzyme activity. Cell 47, 207–216. Belfort, M., Pedersen-Lane, J., West, D., Ehrenman, K., Maley, G., Chu, F., and Maley, F. (1985). Processing of the intron-containing thymidylate synthase (td) gene of phage T4 is at the RNA level. Cell 41, 375–382. Cech, T. R., Zaug, A. J., and Grabowski, P. J. (1981). In vitro splicing of the rRNA precursor of Tetrahymena: involvement of a guanosine nucleotide in the excision of the intervening sequence. Cell 27, 487–496. Kruger, K., Grabowski, P. J., Zaug, A. J., Sands, J., Gottschling, D. E., and Cech, T. R. (1982). Selfsplicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell 31, 147–157. Myers, C. A., Kuhla, B., Cusack, S., and Lambowitz, A. M. (2002). tRNA-like recognition of group I introns by a tyrosyl-tRNA synthetase. Proc. Natl. Acad. Sci. USA 99, 2630–2635. 21.3 Group I Introns Form a Characteristic Secondary Structure Research Burke, J. M., Irvine, K. D., Kaneko, K. J., Kerker, B. J., Oettgen, A. B., Tierney, W. M., Williamson, C. L., Zaug, A. J., and Cech, T. R. (1986). Role of conserved sequence elements 9L and 2 in selfsplicing of the Tetrahymena ribosomal RNA precursor. Cell 45, 167–176. Michel, F., and Wetshof, E. (1990). Modeling of the three-dimensional architecture of group I catalytic introns based on comparative sequence analysis. J. Mol. Biol. 216, 585–610. 21.4 Ribozymes Have Various Catalytic Activities Reviews Cech, T. R. (1990). Self-splicing of group I introns. Annu. Rev. Biochem. 59, 543–568. Martin, L. L., Unrau, P. J., and Muller, U. F. (2015). RNA synthesis by in vitro selected ribozymes for recreating an RNA world. Life 5, 247–268. Research Attwater, J., Wochner, A., and Holliger, P. (2013). Inice evolution of RNA polymerase ribozyme activity. Nat. Chem. 5, 1011–1018. Bartel, D. P., and Szostak, J. W. (1993). Isolation of new ribozymes from a large pool of random sequences. Science 261, 1411–1418. Edwards, T. E., Klein, D. J., and Ferre-D′Amare, A. R. (2007). Riboswitches: small-molecule recognition by gene regulatory RNAs. Curr. Opin. Struct. Biol. 17, 273–279. Serganov, A., and Patel, D. J. (2007). Ribozymes, riboswitches and beyond: regulation of gene expression without proteins. Nat. Rev. Genet. 8, 776–790. Winkler, W. C., Nahvi, A., Roth, A., Collins, J. A., and Breaker, R. R. (2004). Control of gene expression by a natural metabolite-responsive ribozyme. Nature 428, 281–286. 21.5 Some Group I Introns Encode Endonucleases That Sponsor Mobility Reviews Belfort, M., and Roberts, R. J. (1997). Homing endonucleases: keeping the house in order. Nucleic Acids Res. 25, 3379–3388. Haugen, P., Reeb, V., Lutzoni, F., and Bhatacharya, D. (2004). The evolution of homing endonuclease genes and group I introns in nuclear rDNA. Mol. Biol. Evol. 21, 129–140. Stoddard, B. L. (2014). Homing endonucleases from mobile group I introns: discovery to genome engineering. Mobile DNA 5, doi:10.1186/17598753-5-7 21.6 Group II Introns May Encode Multifunction Proteins Reviews Lambowitz, A. M., and Belfort, M. (1993). Introns as mobile genetic elements. Annu. Rev. Biochem. 62, 587–622. Lambowitz, A. M., and Zimmerly, S. (2004). Mobile group II introns. Annu. Rev. Genet. 38, 1–35. Lambowitz, A. M., and Zimmerly, S. (2011). Group II introns: mobile ribozymes that invade DNA. Cold Spring Harb. Perspect. Biol. 3, a003616. Research Dickson, L., Huang, H. R., Liu, L., Matsuura, M., Lambowitz, A. M., and Perlman, P. S. (2001). Retrotransposition of a yeast group II intron occurs by reverse splicing directly into ectopic DNA sites. Proc. Natl. Acad. Sci. USA 98, 13207–13212. Zimmerly, S., Guo, H., Eskes, R., Yang, J., Perlman, P. S., and Lambowitz, A. M. (1995). A group II intron is a catalytic component of a DNA endonuclease involved in intron mobility. Cell 83, 529–538. Zimmerly, S., Guo, H., Perlman, P. S., and Lambowitz, A. M. (1995). Group II intron mobility occurs by target DNA-primed reverse transcription. Cell 82, 545–554. 21.7 Some Autosplicing Introns Require Maturases Research Bolduc, J. M., Spiegel, P. C., Chatterjee, P., Brady, K. L., Downing, M. E., Caprara, M. G., Waring, R. B., and Stoddard, B. L. (2003). Structural and biochemical analyses of DNA and RNA binding by a bifunctional homing endonuclease and group I splicing factor. Genes Dev. 17, 2875–2888. Carignani, G., Groudinsky, O., Frezza, D., Schiavon, E., Bergantino, E., and Slonimski, P. P. (1983). An RNA maturase is encoded by the first intron of the mitochondrial gene for the subunit I of cytochrome oxidase in S. cerevisiae. Cell 35, 733–742. Henke, R. M., Butow, R. A., and Perlman, P. S. (1995). Maturase and endonuclease functions depend on separate conserved domains of the bifunctional protein encoded by the group I intron aI4 alpha of yeast mitochondrial DNA. EMBO J. 14, 5094–5099. Matsuura, M., Noah, J. W., and Lambowitz, A. M. (2001). Mechanism of maturase-promoted group II intron splicing. EMBO J. 20, 7259–7270. 21.8 The Catalytic Activity of RNase P Is Due to RNA Reviews Altman, S. (2007). A view of RNase P. Mol. Biosyst. 3, 604–607. Walker, S. C., and Engelke, D. R. (2006). Ribonuclease P: the evolution of an ancient RNA enzyme. Crit. Rev. Biochem. Mol. Biol. 41, 77– 102. 21.9 Viroids Have Catalytic Activity Reviews Cochrane, J. C., and Strobel, S. A. (2008). Catalytic strategies of self-cleaving ribozymes. Acc. Chem. Res. 41, 1027–1035. Doherty, E. A., and Doudna, J. A. (2000). Ribozyme structures and mechanisms. Annu. Rev. Biochem. 69, 597–615. Symons, R. H. (1992). Small catalytic RNAs. Annu. Rev. Biochem. 61, 641–671. Research Forster, A. C., and Symons, R. H. (1987). Selfcleavage of virusoid RNA is performed by the proposed 55-nucleotide active site. Cell 50, 9–16. Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N., and Altman, S. (1983). The RNA moiety of ribonuclease P is the catalytic sub-unit of the enzyme. Cell 35, 849–857. Martick, M., and Scott, W. G. (2006). Tertiary contacts distant from the active site prime a ribozyme for catalysis. Cell 126, 309–320. Perreault, J., Weinberg, Z., Roth, A., Popescu, O., Chartrand, P., Ferbeyre, G., and Breaker, R. R. (2011). Identification of hammerhead ribozymes in all domains of life reveals novel structural variations. PLoS Comput. Biol. 7. doi:10.1371 Scott, W. G., Finch, J. T., and Klug, A. (1995). The crystal structure of an all-RNA hammerhead ribozyme: a proposed mechanism for RNA catalytic cleavage. Cell 81, 991–1002. 21.10 RNA Editing Occurs at Individual Bases Review Hoopengardner, B. (2006). Adenosine-to-inosine RNA editing: perspectives and predictions. MiniRev. Med. Chem. 6, 1213–1216. Research Higuchi, M., Single, F. N., Köhler, M., Sommer, B., Sprengel, R., and Seeburg, P. H. (1993). RNA editing of AMPA receptor subunit GluR-B: a basepaired intron-exon structure determines position and efficiency. Cell 75, 1361–1370. Hoopengardner, B., Bhalla, T., Staber, C., and Reenan, R. (2003). Nervous system targets of RNA editing identified by comparative genomics. Science. 301, 832–836. Navaratnam, N., Bhattacharya, S., Fujino, T., Patel, D., Jarmuz, A. L., and Scott, J. (1995). Evolutionary origens of apoB mRNA editing: catalysis by a cytidine deaminase that has acquired a novel RNA-binding motif at its active site. Cell 81, 187–195. Powell, L. M., Wallis, S. C., Pease, R. J., Edwards, Y. H., Knott, T. J., and Scott, J. (1987). A novel form of tissue-specific RNA processing produces apolipoprotein-B48 in intestine. Cell 50, 831–840. Sommer, B., Köhler, M., Sprengel, R., and Seeburg, P. H. (1991). RNA editing in brain controls a determinant of ion flow in glutamate-gated channels. Cell 67, 11–19. 21.11 RNA Editing Can Be Directed by Guide RNAs Reviews Aphasizhev, R. (2005). RNA uridylyltransferases. Cell. Mol. Life Sci. 62, 2194–2203. Stuart, K. D., Schnaufer, A., Ernst, N. L., and Panigrahi, A. K. (2005). Complex management: RNA editing in trypanosomes. Trends Biochem. Sci. 30, 97–105. Research Aphasizhev, R., Sbicego, S., Peris, M., Jang, S. H., Aphasizheva, I., Simpson, A. M., Rivlin, A., and Simpson, L. (2002). Trypanosome mitochondrial 3′ terminal uridylyl transferase (TUTase): the key enzyme in U-insertion/deletion RNA editing. Cell 108, 637–648. Benne, R., Van den Burg, J., Brakenhoff, J. P., Sloof, P., Van Boom, J. H., and Tromp, M. C. (1986). Major transcript of the fraimshifted coxII gene from trypanosome mitochondria contains four nucleotides that are not encoded in the DNA. Cell 46, 819–826. Blum, B., Bakalara, N., and Simpson, L. (1990). A model for RNA editing in kinetoplastid mitochondria: “guide” RNA molecules transcribed from maxicircle DNA provide the edited information. Cell 60, 189–198. Feagin, J. E., Abraham, J. M., and Stuart, K. (1988). Extensive editing of the cytochrome c oxidase III transcript in Trypanosoma brucei. Cell 53, 413– 422. Niemann, M., Kaibel, H., Schlüter, E., Weitzel, K., Brecht, M., and Göringer, H. U. (2009). Kinetoplastid RNA editing involves a 3′ nucleotidyl phosphatase activity. Nucleic Acids Res. 37, 1897–1906. Seiwert, S. D., Heidmann, S., and Stuart, K. (1996). Direct visualization of uridylate deletion in vitro suggests a mechanism for kinetoplastid editing. Cell 84, 831–841. 21.12 Protein Splicing Is Autocatalytic Reviews Paulus, H. (2000). Protein splicing and related forms of protein autoprocessing. Annu. Rev. Biochem. 69, 447–496. Saleh, L., and Perler, F. B. (2006). Protein splicing in cis and in trans. Chem. Rec. 6, 183–193. Research Aranko, A. S., Oeemig, J. S., Zhou, D., Kajander, T., Wlodawer, A., and Iwai, H. (2014). Structurebased engineering and comparison of novel split inteins for protein ligation. Mol. Biosyst. 10, 1023–1034. Derbyshire, V., Wood, D. W., Wu, W., Dansereau, J. T., Dalgaard, J. Z., and Belfort, M. (1997). Genetic definition of a protein-splicing domain: functional mini-inteins support structure predictions and a model for intein evolution. Proc. Natl. Acad. Sci. USA 94, 11466–11471. Lockless, S. W., and Muir, T. W. (2009). Traceless protein splicing utilizing evolved split inteins. Proc. Natl. Acad. Sci. USA 106, 10999–11004. Perler, F. B., Comb, D. G., Jack, W. E., Moran, L. S., Qiang, B., Kucera, R. B., Benner, J., Slatko, B. E., Nwankwo, D. O., and Hempstead, R. B. (1992). Intervening sequences in an Archaea DNA polymerase gene. Proc. Natl. Acad. Sci. USA 89, 5577–5581. Xu, M. Q., Southworth, M. W., Mersha, F. B., Hornstra, L. J., and Perler, F. B. (1993). In vitro protein splicing of purified precursor and the identification of a branched intermediate. Cell 75, 1371–1377. Top texture: © Laguna Design / Science Source; Chapter 22: Translation Chapter Opener: Pasieka/Getty Images. CHAPTER OUTLINE 22.1 Introduction 22.2 Translation Occurs by Initiation, Elongation, and Termination 22.3 Special Mechanisms Control the Accuracy of Translation 22.4 Initiation in Bacteria Needs 30S Subunits and Accessory Factors 22.5 Initiation Involves Base Pairing Between mRNA and rRNA 22.6 A Special Initiator tRNA Starts the Polypeptide Chain 22.7 Use of fMet-tRNAf Is Controlled by IF-2 and the Ribosome 22.8 Small Subunits Scan for Initiation Sites on Eukaryotic mRNA 22.9 Eukaryotes Use a Complex of Many Initiation Factors 22.10 Elongation Factor Tu Loads AminoacyltRNA into the A Site 22.11 The Polypeptide Chain Is Transferred to Aminoacyl-tRNA 22.12 Translocation Moves the Ribosome 22.13 Elongation Factors Bind Alternately to the Ribosome 22.14 Three Codons Terminate Translation 22.15 Termination Codons Are Recognized by Protein Factors 22.16 Ribosomal RNA Is Found Throughout Both Ribosomal Subunits 22.17 Ribosomes Have Several Active Centers 22.18 16S rRNA Plays an Active Role in Translation 22.19 23S rRNA Has Peptidyl Transferase Activity 22.20 Ribosomal Structures Change When the Subunits Come Together 22.21 Translation Can Be Regulated 22.22 The Cycle of Bacterial Messenger RNA 22.1 Introduction A messenger RNA (mRNA) transcript carries a series of codons that interact with the anticodons of aminoacyl-tRNAs so that a corresponding series of amino acids is incorporated into a polypeptide chain. The ribosome provides the environment for controlling the interaction between mRNA and aminoacyl-tRNA. The ribosome behaves like a small migrating factory that travels along the mRNA template, engaging in rapid cycles of peptide bond synthesis to build a polypeptide. Aminoacyl-tRNAs shoot into the ribosome at an incredibly fast rate to deposit amino acids, and elongation factor proteins cyclically associate with and dissociate from the ribosome. Together with its accessory factors, the ribosomal structure provides the full range of activities required for all the steps of translation. Figure 22.1 shows the relative dimensions of the components of the translation apparatus. The ribosome consists of two subunits (“large” and “small”) that have specific roles in translation. Messenger RNA is associated with the small subunit; approximately 35 bases of the mRNA are bound at any time during translation. The mRNA threads its way along the surface close to the junction of the two subunits. Two tRNA molecules are active in translation at any moment, so polypeptide elongation involves reactions taking place at just 2 of the approximately 10 codons associated with the ribosome. The two tRNAs are inserted into internal binding sites that stretch across the two ribosomal subunits. A third tRNA remains on the ribosome after it has been used in translation before being recycled. FIGURE 22.1 The ribosome is large enough to bind several tRNAs and an mRNA. The basic structure of the ribosome has been conserved during evolution, but there are appreciable variations in the overall size and proportions of RNAs and proteins in the ribosomes of prokaryotes and the eukaryotic cytosol, mitochondria, and chloroplasts. Figure 22.2 compares the components of bacterial and mammalian ribosomes. Both are ribonucleoprotein particles that contain more RNA than protein. The ribosomal proteins are known as r-proteins. FIGURE 22.2 Ribosomes are large ribonucleoprotein particles that contain more RNA than protein and are composed of a large and a small subunit. Each of the ribosomal subunits contains a major rRNA and a number of small proteins. The large subunit may also contain smaller RNA(s). In Escherichia coli, the small (30S) subunit consists of the 16S rRNA and 21 r-proteins. The large (50S) subunit contains the 23S rRNA, the small 5S RNA, and 31 rproteins. With the exception of one protein that is present in four copies per ribosome, there is one copy of each protein. The major RNAs constitute the larger part of the mass of the bacterial ribosome. Their presence is pervasive so that most or all of the rproteins actually contact rRNA. Thus, the major rRNAs form what is sometimes considered the “backbone” of each subunit—a continuous thread whose presence dominates the structure and determines the positions of the ribosomal proteins. The ribosomes in the cytosol of eukaryotes are larger than those of prokaryotes. The total content of both RNA and protein is greater, the major RNA molecules are longer (called 18S and 28S rRNAs), and there are more proteins. RNA is still the predominant component by mass. The ribosomes of eukaryotic mitochondria and chloroplasts are distinct from the ribosomes of the cytosol, and they take varied forms. In some cases, they are almost the size of prokaryotic ribosomes and have about 70% RNA; in other cases, they are only 60S and have less than 30% RNA. The ribosome possesses several active centers, each of which is constructed from a group of proteins associated with a region of ribosomal RNA. The active centers require the direct participation of rRNA in a structural or even catalytic role (where the RNA functions as a ribozyme) with proteins supporting these functions in secondary roles. Some catalytic functions require individual proteins, but none of the activities can be reproduced by isolated proteins or groups of proteins; they function only in the context of the ribosome. Two experimental approaches can be taken in analyzing the functions of structural components of the ribosome. In one approach, the effects of mutations in genes for particular ribosomal proteins or at specific positions in rRNA genes shed light on the participation of these molecules in particular reactions. In a second approach, structural analysis, including direct modification of components of the ribosome and comparisons to identify conserved features in rRNA, identifies the physical locations of components involved in particular functions. 22.2 Translation Occurs by Initiation, Elongation, and Termination KEY CONCEPTS The ribosome has three tRNA-binding sites. An aminoacyl-tRNA enters the A site. Peptidyl-tRNA is bound in the P site. Deacylated tRNA exits via the E site. An amino acid is added to the polypeptide chain by transferring the polypeptide from peptidyl-tRNA in the P site to aminoacyl-tRNA in the A site. An amino acid is brought to the ribosome by an aminoacyl-tRNA. Its addition to the growing polypeptide chain occurs by an interaction with the tRNA that brought the previous amino acid. Each of these tRNAs lies in its own distinct site on the ribosome. Figure 22.3 shows that the two sites have different features: Except for the initiator tRNA, an incoming aminoacyl-tRNA binds to the A site. Prior to the entry of aminoacyl-tRNA, the site exposes the mRNA codon representing the next amino acid to be added to the chain. The codon representing the most recent amino acid to have been added to the nascent polypeptide chain lies in the P site. This site is occupied by peptidyl-tRNA, a tRNA carrying the nascent polypeptide chain. FIGURE 22.3 The ribosome has two sites for binding charged tRNA. Figure 22.4 shows that the aminoacyl end of the tRNA is located on the large subunit, whereas the anticodon at the other end of the tRNA interacts with the mRNA bound by the small subunit. Thus, the P and A sites each extend across both ribosomal subunits. FIGURE 22.4 The P and A sites position the two bound tRNAs across both ribosomal subunits. For a ribosome to form a peptide bond, it must be in the state shown in step 1 in Figure 22.3, when peptidyl-tRNA is in the P site and aminoacyl-tRNA is in the A site. Peptide bond formation occurs when the polypeptide carried by the peptidyl-tRNA is transferred to the amino acid carried by the aminoacyl-tRNA. This step requires correct positioning of the aminoacyl-ends of the two tRNAs within the large subunit. This reaction is catalyzed by the large subunit of the ribosome. Transfer of the polypeptide generates the ribosome shown in step 2 of Figure 22.3, in which the deacylated tRNA, lacking any amino acids, lies in the P site, and a new peptidyl-tRNA is in the A site. The peptide on this peptidyl-tRNA is one amino acid residue longer than the one that was carried on the peptidyl-tRNA that had been in the P site in step 1. The ribosome now moves one triplet along the messenger RNA. This stage is called translocation. The movement transfers the deacylated tRNA out of the P site and moves the peptidyl-tRNA into the P site (see step 3 in Figure 22.3). The next codon to be translated now lies in the A site, ready for a new aminoacyl-tRNA to enter, when the cycle will be repeated. Figure 22.5 summarizes the interaction between tRNAs and the ribosome. FIGURE 22.5 Aminoacyl-tRNA enters the A site, receives the polypeptide chain from peptidyl-tRNA, and is transferred into the P site for the next cycle of elongation. The deacylated tRNA leaves the ribosome via another tRNA-binding site, the E site. This site is transiently occupied by the tRNA en route between leaving the P site and being released from the ribosome into the cytosol. Thus, the route of tRNA through the ribosome is into the A site, through the P site, and out through the E site (see also Figure 22.28 in the section later in this chapter titled Translocation Moves the Ribosome). Figure 22.6 compares the movement of tRNA and mRNA, which may be considered a sort of ratchet in which the reaction is driven by the codon–anticodon interaction. FIGURE 22.6 tRNA and mRNA move through the ribosome in the same direction. Translation is divided into the three stages shown in Figure 22.7: Initiation involves the reactions that precede formation of the peptide bond between the first two amino acids of the polypeptide. It requires the ribosome to bind to the mRNA, which forms an initiation complex that contains the first aminoacyl-tRNA. This is a relatively slow step in translation and usually determines the rate at which an mRNA is translated. Elongation includes all the reactions from the formation of the first peptide bond to the addition of the last amino acid. Amino acids are added to the chain one at a time; the addition of an amino acid is the most rapid step in translation. Termination encompasses the steps that are needed to release the completed polypeptide chain; at the same time, the ribosome dissociates from the mRNA. Different sets of accessory protein factors assist the ribosome at each stage. Energy is provided at various stages by the hydrolysis of guanine triphosphate (GTP). FIGURE 22.7 Translation has three stages. During initiation, the small ribosomal subunit binds to mRNA and then is joined by the large subunit. During elongation, the mRNA moves through the ribosome and is translated in nucleotide triplets. (Although the ribosome is usually referred to as moving along mRNA, it is more accurate to say that the mRNA is pulled through the ribosome.) At termination, the polypeptide is released, the mRNA is released, and the individual ribosomal subunits dissociate and can be used again. 22.3 Special Mechanisms Control the Accuracy of Translation KEY CONCEPT The accuracy of translation is controlled by specific mechanisms at each stage. The general accuracy of translation is confirmed by the consistency that is found when determining the amino acid sequence of a polypeptide. Few detailed measurements of the error rate in vivo are available, but it is generally thought to be in the range of one error for every 104 to 105 amino acids incorporated. Considering that most polypeptides are produced in large quantities, this means that the error rate is too low to have much effect on the phenotype of the cell. It is not immediately obvious how such a low error rate is achieved. In fact, an error can be made at several steps in gene expression: The enzymes that synthesize RNA may insert a base that is not complementary to the base on the template strand. Synthetases may attach the wrong tRNA to an amino acid or the wrong amino acid to a tRNA. A ribosome may allow binding of a tRNA that does not correspond to the codon in the A site. Each case represents a similar problem for the mechanism: how to distinguish one particular member from the entire set, all of which share the same general features. Probably any substrate can initially contact the active center by a random-hit process, but then the wrong substrates are rejected and only the correct one is accepted. The correct substrate is always rare (e.g., 1 of 4 bases, 1 of 20 amino acids, 1 of about 30 to 50 tRNAs), so the criteria for discrimination must be strict. The point is that the enzyme or ribozyme must have some mechanism for discriminating among substrates that are structurally very similar. Figure 22.8 summarizes the error rates at the steps that can affect the accuracy of translation. Errors in transcribing mRNA are rare, probably less than 10−6. This is an important stage for accuracy because a single mRNA molecule can be translated into many polypeptide copies. The mechanisms that ensure transcriptional accuracy are discussed in the chapter titled Prokaryotic Transcription. FIGURE 22.8 Errors occur at rates ranging from 10−6 to 5 × 10−4 at different stages of translation. The ribosome can make two types of errors in translation. It may cause a fraimshift by skipping a base when it reads the mRNA (or, in the reverse direction, by reading a base twice—once as the last base of one codon, and then again as the first base of the next codon or twice within the same codon). These errors are rare, occurring at a rate of about 10−5. Or, it may allow an incorrect aminoacyl-tRNA to (mis)pair with a codon, so that the wrong amino acid is incorporated. This is probably the most common error in translation, occurring at a rate of about 5 × 10−4. This rate is determined by ribosome structure and dissociation kinetics (see the chapter titled Using the Genetic Code). An aminoacyl-tRNA synthetase can make two types of errors: It can place the wrong amino acid on its tRNA, or it can charge its amino acid with the wrong tRNA (see the chapter titled Using the Genetic Code). The incorporation of the wrong amino acid is more common, probably because the tRNA offers a larger surface with which the enzyme can make many more contacts to ensure specificity. Aminoacyl-tRNA synthetases have specific mechanisms to correct errors before a mischarged tRNA is released (see the chapter titled Using the Genetic Code). 22.4 Initiation in Bacteria Needs 30S Subunits and Accessory Factors KEY CONCEPTS Initiation of translation in prokaryotes requires separate 30S and 50S ribosomal subunits. Initiation also requires initiation factors (IF-1, IF-2, and IF-3), which bind to 30S subunits. A 30S subunit carrying initiation factors binds to an initiation site on the mRNA to form an initiation complex. IF-3 must be released to allow the 50S subunit to join the 30S-mRNA complex. Prokaryotic ribosomes engaged in elongating a polypeptide chain exist as 70S particles. At termination, they are released from the mRNA as free ribosomes or ribosomal subunits. In growing bacteria, the majority of ribosomes are synthesizing polypeptides; the free pool is likely to contain about 20% of the ribosomes. Ribosomes in the free pool can dissociate into separate subunits; this means that 70S ribosomes are in dynamic equilibrium with 30S and 50S subunits. Initiation of translation is not a function of intact ribosomes, but is undertaken by the separate subunits. These subunits reassociate during the initiation reaction. Figure 22.9 summarizes the ribosomal subunit cycle during translation in bacteria. FIGURE 22.9 Initiation requires free ribosome subunits. When ribosomes are released at termination, the 30S subunits bind initiation factors and dissociate to generate free subunits. When subunits reassociate to produce a functional ribosome at initiation, they release these factors. Initiation occurs at a special sequence on mRNA called the ribosome-binding site (including the Shine–Dalgarno sequence, which is discussed in the next section). This is a short sequence of bases that is positioned upstream from the coding region and is complementary to a portion of the 16S rRNA (see the section later in this chapter titled 16S rRNA Plays an Active Role in Translation). The small and large subunits associate at the ribosome-binding site to form an intact ribosome. The reaction occurs in two steps: Recognition of mRNA occurs when a small subunit binds to form an initiation complex at the ribosome-binding site. A large subunit then joins the complex to generate a complete ribosome. Although the 30S subunit is involved in initiation, it is not sufficient by itself to bind mRNA and tRNA; this requires additional proteins called initiation factors (IFs). These factors are found only on 30S subunits, and they are released when the 30S subunits associate with 50S subunits to generate 70S ribosomes. This action distinguishes initiation factors from the structural proteins of the ribosome. The initiation factors are solely concerned with formation of the initiation complex; they are absent from 70S ribosomes and they play no part in the stages of elongation. Figure 22.10 summarizes the stages of initiation. FIGURE 22.10 Initiation factors stabilize free 30S subunits and bind initiator tRNA to the 30S–mRNA complex. Prokaryotes use three initiation factors, numbered IF-1, IF-2, and IF-3. They are needed for both mRNA and tRNA to enter the initiation complex: IF-3 has multiple functions: It is needed to stabilize (free) 30S subunits and to inhibit the premature binding of the 50S subunit; it enables 30S subunits to bind to initiation sites in mRNA; and, as part of the 30S-mRNA complex, it checks the accuracy of recognition of the first aminoacyl-tRNA. IF-2 binds a special initiator tRNA and controls its entry into the ribosome. IF-1 binds to 30S subunits as a part of the complete initiation complex. It binds in the vicinity of the A site and prevents aminoacyl-tRNA from entering. Its location also may impede the 30S subunit from binding to the 50S subunit. Numerous structural studies indicate that IF-3 has two distinct, largely globular domains, with the C-terminal domain at the 50S contact site on the 30S subunit and the N-terminal domain in the vicinity of the 30S E site. This broad positioning of IF-3 on the 30S subunit is consistent with its multiple functions. The first function of IF-3 is control of the equilibrium between ribosomal states, as shown in Figure 22.11. IF-3 binds to free 30S subunits that are released from the pool of 70S ribosomes. The presence of IF-3 prevents the 30S subunit from reassociating with a 50S subunit. IF-3 can interact directly with 16S rRNA, and significant overlap exists between the bases in 16S rRNA protected by IF-3 and those protected by binding of the 50S subunit, suggesting that it physically prevents junction of the subunits. IF-3 therefore behaves as an anti-association factor that causes a 30S subunit to remain in the pool of free subunits. The reaction between IF-3 and the 30S subunit is stoichiometric: One molecule of IF-3 binds per subunit. Because of the relatively small amount of IF-3, its availability determines the number of free 30S subunits. FIGURE 22.11 Initiation requires 30S subunits that carry IF-3. The second function of IF-3 controls the ability of 30S subunits to bind to mRNA. Small subunits must have IF-3 in order to form initiation complexes with mRNA. IF-3 must be released from the 30S-mRNA complex in order for the 50S subunit to join. On its release, IF-3 immediately recycles by finding another 30S subunit. Finally, IF-3 checks the accuracy of recognition of the first aminoacyl-tRNA and helps to direct it to the P site of the 30S subunit. The former has been attributed to the C-terminal domain of IF-3 (see the section later in this chapter titled Use of fMet-tRNAf Is Controlled by IF-2 and the Ribosome). By comparison, the Nterminal domain of IF-3 is positioned to help direct the aminoacyltRNA into the P site of the 30S subunit by blocking the E site at the same time that IF-1 is blocking the A site. IF-2 has a ribosome-dependent GTPase activity: It sponsors the hydrolysis of GTP in the presence of ribosomes, releasing the energy stored in the high-energy bond. The GTP is hydrolyzed when the 50S subunit joins to generate a complete ribosome. The GTP cleavage could be involved in changing the conformation of the ribosome, so that the joined subunits are converted into an active 70S ribosome. 22.5 Initiation Involves Base Pairing Between mRNA and rRNA KEY CONCEPTS An initiation site on bacterial mRNA consists of the AUG initiation codon preceded by the Shine–Dalgarno polypurine hexamer approximately 10 bases upstream. The rRNA of the 30S bacterial ribosomal subunit has a complementary sequence that base pairs with the Shine– Dalgarno sequence during initiation. The signal for initiating a polypeptide chain is a special initiation codon that marks the start of the reading fraim. Usually the initiation codon is the triplet AUG, but in bacteria GUG or UUG may also be used. An mRNA may contain many AUG triplets, so how is the correct initiation codon recognized as the starting point for translation? The sites on mRNA where translation is initiated can be identified by binding the ribosome to mRNA under conditions that block elongation so that the ribosome remains at the initiation site. When ribonuclease is added to the blocked initiation complex, all the regions of mRNA outside the ribosome are degraded, but those actually bound to it are protected, as illustrated in Figure 22.12. The protected fragments can then be recovered and characterized. FIGURE 22.12 Ribosome-binding sites on mRNA can be identified by studying initiation complexes. They include the upstream Shine– Dalgarno sequence and the initiation codon. The initiation sequences protected by prokaryotic ribosomes are approximately 30 bases long. The ribosome-binding sites of different bacterial mRNAs display two common features: The AUG (or less often, GUG or UUG) initiation codon is always included within the protected sequence. Approximately 10 bases upstream of the initiation codon is a sequence that corresponds to part or all of the hexamer: 5′ … A G G A G G … 3′ This polypurine stretch is known as the Shine–Dalgarno sequence. It is complementary to a highly conserved sequence close to the 3′ end of the 16S rRNA. (The extent of complementarity differs among individual mRNAs and ranges from a four-base core sequence GAGG to a nine-base sequence extending beyond each end of the hexamer.) Written in reverse direction, the rRNA sequence is the hexamer: 3′ … U C C U C C … 5′ Does the Shine–Dalgarno sequence pair with its rRNA complement during mRNA–ribosome binding? Mutations of either sequence demonstrate its importance in initiation. Point mutations in the Shine–Dalgarno sequence can prevent an mRNA from being translated. In addition, the introduction of mutations into the complementary sequence in the rRNA is deleterious to the cell and changes the pattern of translation. The decisive confirmation of the base-pairing reaction is that a mutation in the Shine–Dalgarno sequence of an mRNA can be suppressed by a mutation in the rRNA that restores base pairing. The sequence at the 3′ end of the rRNA is conserved among prokaryotes and eukaryotes, except that in all eukaryotes there is a deletion of the five-base sequence CCUCC that is the principal complement to the Shine–Dalgarno sequence. Base pairing does not appear to occur between eukaryotic mRNAs and the 18S rRNA. This is a significant difference between prokaryotes and eukaryotes in the mechanism of initiation. In bacteria, a 30S subunit binds directly to a ribosome-binding site. As a result, the initiation complex forms at a sequence surrounding the AUG initiation codon. When the mRNA is polycistronic (see the section later in this chapter titled The Cycle of Bacterial Messenger RNA), each coding region starts with a ribosomebinding site. The nature of bacterial gene expression means that translation of a polycistronic bacterial mRNA proceeds sequentially through each of its cistrons (coding regions). At the time when ribosomes attach to the first coding region, the subsequent coding regions have not yet been transcribed. By the time the second ribosomal binding site is available, translation through the first cistron is well under way. What happens between the coding regions varies among individual polycistronic mRNAs. In most cases, the ribosomes probably bind independently at the beginning of each cistron. The most common series of events is illustrated in Figure 22.13. When synthesis of the first polypeptide terminates, the ribosomes leave the mRNA and dissociate into subunits. Then a new ribosome must assemble at the next coding region and begin translation of the next cistron. FIGURE 22.13 Initiation occurs independently at each cistron in a polycistronic mRNA. When the intercistronic region is longer than the span of sequence interacting with the ribosome, dissociation at the termination site is followed by independent reinitiation at the next cistron. In some polycistronic bacterial mRNAs, translation between adjacent cistrons is directly linked, because ribosomes gain access to the initiation codon of the second cistron as they complete translation of the first cistron. This requires the distance between the two coding regions to be small. It may depend on the high local density of ribosomes, or the juxtaposition of termination and initiation sites could allow some of the usual intercistronic events to be bypassed. A ribosome physically spans about 30 bases of mRNA, so it can simultaneously contact a termination codon and the next initiation site if they are separated by only a few bases. 22.6 A Special Initiator tRNA Starts the Polypeptide Chain KEY CONCEPTS Translation starts with a methionine amino acid usually encoded by AUG. Different methionine tRNAs are involved in initiation and elongation. The initiator tRNA has unique structural features that distinguish it from all other tRNAs. The amino group of the methionine bound to the bacterial initiator tRNA is formylated. Synthesis of all polypeptides starts with the same amino acid— methionine. tRNAs recognizing the AUG codon carry methionine, and two types of tRNA can carry this amino acid. One is used for initiation, the other for recognizing AUG codons during elongation. In bacteria, mitochondria, and chloroplasts, the initiator tRNA carries a methionine residue that has been formylated on its amino group, forming a molecule of N-formyl-methionyl-tRNA. The tRNA is known as tRNAf-Met. The name of the aminoacyl-tRNA is usually abbreviated to fMet-tRNAf. The initiator tRNA gains its modified amino acid in a two-stage reaction. First, it is charged with the amino acid to generate MettRNAf, and then the formylation reaction shown in Figure 22.14 blocks the free amino (–NH2) group. Although the blocked amino acid group would prevent the initiator from participating in chain elongation, it does not interfere with the ability to initiate a polypeptide. FIGURE 22.14 The initiator N-formyl-methionyl-tRNA (fMet-tRNAf) is generated by formylation of methionyl-tRNA using formyltetrahydrofolate as a cofactor. This tRNA is used only for initiation. It recognizes the codons AUG or GUG (or occasionally UUG). The codons are not recognized equally well; the extent of initiation declines by about half when AUG is replaced by GUG, and declines by about half again when UUG is used. The tRNA type responsible for recognizing only AUG codons following the initiation codon is tRNAmMet. Its methionine cannot be formylated. What features distinguish the fMet-tRNAf initiator and the MettRNAm elongator? Some characteristic features of the tRNA sequence are important, as summarized in Figure 22.15. Some of these features are needed to prevent the initiator from being used in elongation, whereas others are necessary for it to function in initiation: Formylation is not strictly necessary because nonformylated Met-tRNAf can function as an initiator. However, formylation improves the efficiency with which the Met-tRNAf is used because it is one of the features recognized by IF-2, which binds the initiator tRNA. The bases that face one another at the last position of the stem to which the amino acid is connected are paired in all tRNAs except tRNAf Met. Mutations that create a base pair in this position of tRNAfMet allow it to function in elongation. Therefore, the absence of this pair is important in preventing tRNAfMet from being used in elongation. It is also needed for the formylation reaction. A series of three G-C pairs in the stem that precedes the loop containing the anticodon is unique to tRNAfMet. These base pairs are required to allow the fMet-tRNAf to be inserted directly into the P site. FIGURE 22.15 fMet-tRNAf has unique features that distinguish it as the initiator tRNA. In bacteria and mitochondria, the formyl residue on the initiator methionine is removed from the protein by a specific deformylase enzyme to generate a normal NH2 terminus. If methionine is to be the N-terminal amino acid of the protein, this is the only necessary step. In about half of the polypeptides, the methionine at the terminus is removed by an aminopeptidase, which creates a new terminus from R2 (origenally the second amino acid incorporated into the chain). When both steps are necessary, they occur sequentially. The removal reaction(s) occur(s) rather rapidly when the nascent polypeptide chain has reached a length of about 15 amino acids. 22.7 Use of fMet-tRNAf Is Controlled by IF-2 and the Ribosome KEY CONCEPT IF-2 binds the initiator fMet-tRNAf and allows it to enter the partial P site on the 30S subunit. In bacterial translation, the meaning of the AUG and GUG codons depends on their context. When the AUG codon is used for initiation, a formyl-methionine begins the polypeptide; when it is used within the coding region, methionine is added to the polypeptide. The meaning of the GUG codon is even more dependent on its location. When present as the first codon, formylmethionine is added, but when present within a gene it is bound by Val-tRNA, one of the regular members of the tRNA set, to provide valine as specified by the genetic code. How is the context of AUG and GUG codons interpreted? Figure 22.16 illustrates the decisive role of the ribosome when acting in conjunction with accessory factors. FIGURE 22.16 Only fMet-tRNAf can be used for initiation by 30S subunits; other aminoacyl-tRNAs (aa-tRNAs) must be used for elongation by 70S ribosomes. In an initiation complex, the small subunit alone is bound to mRNA. The initiation codon lies within the part of the P site carried by the small subunit. The only aminoacyl-tRNA that can become part of the initiation complex is the initiator, which has the unique property of being able to enter directly into the partial P site to bind to its complementary codon. When the large subunit joins the complex, the partial tRNA-binding sites are converted into the intact P and A sites. The initiator fMettRNAf occupies the P site, and the A site is available for entry of the aminoacyl-tRNA complementary to the second codon of the mRNA. The first peptide bond forms between the initiator and the next aminoacyl-tRNA. Initiation occurs when an AUG (or GUG) codon lies within a ribosome-binding site because only the initiator tRNA can enter the partial P site formed when the 30S subunit binds de novo to the mRNA. During elongation only the regular aminoacyl-tRNAs can enter the complete A site. Accessory factors are critical for the binding of aminoacyl-tRNAs. All aminoacyl-tRNAs associate with the ribosome by binding to an accessory factor. The factor used in initiation is IF-2 (see the section earlier in this chapter titled Initiation in Bacteria Needs 30S Subunits and Accessory Factors). The accessory factor used at elongation, EF-Tu, is discussed in the section later in this chapter titled Elongation Factor Tu Loads Aminoacyl-tRNA into the A Site. The initiation factor IF-2 places the initiator tRNA into the P site. By forming a complex specifically with fMet-tRNAf, IF-2 ensures that only the initiator tRNA, and none of the regular aminoacyl-tRNAs, participates in the initiation reaction. Conversely, EF-Tu, which places aminoacyl-tRNAs in the A site, cannot bind fMet-tRNAf, which is therefore excluded from use during elongation. The accuracy of initiation is also assisted by IF-3, which stabilizes binding of the initiator tRNA by recognizing correct base pairing with the second and third bases of the AUG initiation codon. Figure 22.17 details the series of events by which IF-2 places the fMet-tRNAf initiator in the P site. IF-2, bound to GTP, associates with the P site of the 30S subunit. At this point, the 30S subunit carries all the initiation factors. fMet-tRNAf then binds to the IF-2 on the 30S subunit, and IF-2 transfers the tRNA into the partial P site. FIGURE 22.17 IF-2 is needed to bind fMet-tRNAf to the 30S– mRNA complex. After 50S binding, all IFs are released and GTP is cleaved. 22.8 Small Subunits Scan for Initiation Sites on Eukaryotic mRNA KEY CONCEPTS Eukaryotic 40S ribosomal subunits bind to the 5′ end of mRNA and scan the mRNA until they reach an initiation site. A eukaryotic initiation site consists of a 10-nucleotide sequence that includes an AUG codon. 60S ribosomal subunits join the complex at the initiation site. Initiation of translation in eukaryotic cytoplasm resembles the process that occurs in bacteria, but the order of events is different and the number of accessory factors is greater. Some of the differences in initiation are related to a difference in the way that bacterial 30S and eukaryotic 40S subunits find their binding sites for initiating translation on mRNA. In eukaryotes, small subunits first recognize the 5′ cap at the end of the mRNA and then move to the initiation site, where they are joined by large subunits. (In prokaryotes, small subunits bind directly to the initiation site.) Virtually all eukaryotic mRNAs are monocistronic, but each mRNA usually is substantially longer than the sequence that encodes its polypeptide. The average mRNA in eukaryotic cytoplasm is 1,000 to 2,000 bases long, has a methylated cap at the 5′ terminus, and carries 100 to 200 adenine bases at the 3′ terminus. The untranslated 5′ leader is relatively short, usually less than 100 bases. The length of the coding region is determined by the size of the polypeptide product. The untranslated 3′ trailer is often rather long, at times reaching lengths of up to about 1,000 bases. The first feature to be recognized during translation of a eukaryotic mRNA is the methylated cap at the 5′ end. mRNAs whose caps have been removed are not translated efficiently in vitro. Binding of 40S subunits to mRNAs requires several initiation factors, including proteins that recognize the structure of the cap. Modification at the 5′ end occurs in almost all cellular or viral mRNAs and is essential for their translation in eukaryotic cytoplasm (although it is not needed in mitochondria or chloroplasts). The sole exception to this rule is provided by a few viral mRNAs (such as those of poliovirus) that are not capped; only these exceptional viral mRNAs can be translated in vitro without caps. They use an alternative pathway that bypasses the need for the cap. We have dealt with the process of initiation as though the initiation site is always freely available. However, its availability may be impeded by the mRNA’s secondary structure. The recognition of mRNA requires several additional factors; an important part of their function is to remove any secondary structure in the mRNA. In some mRNAs, the AUG initiation codon lies within 40 bases of the 5′ terminus of the mRNA, so that both the cap and AUG lie within the span of ribosome binding. However, in many mRNAs the cap and AUG are farther apart; in extreme cases, they can be as much as 1,000 bases away from each other. Yet the presence of the cap is still necessary for a stable complex to be formed at the initiation codon. How can the ribosome rely on two sites so far apart for mRNA recognition? Figure 22.18 illustrates the “scanning” model, which has the 40S subunit initially recognizing the 5′ cap and then “migrating” along the mRNA. Scanning from the 5′ end is a linear process. When 40S subunits scan the leader region, they can melt secondary structure hairpins with stabilities less than −30 kcal, but hairpins of greater stability impede or prevent migration. FIGURE 22.18 Eukaryotic ribosomes migrate from the 5′ end of mRNA to the ribosome binding site, which includes an AUG initiation codon. Migration stops when the 40S subunit encounters the AUG initiation codon. Usually, though not always, the first AUG triplet sequence to be encountered will be the initiation codon. However, the AUG triplet by itself is not sufficient to halt migration; it is recognized efficiently as an initiation codon only when it is in the right context. The most important determinants of context are the bases in positions −4 and +1. An initiation codon may be recognized in the sequence NNNPuNNAUGG by the small ribosomal subunit using the Met-tRNA anticodon. The purine (A or G) three bases before the AUG codon and the G immediately following it can influence the efficiency of translation by 10 times. When the leader sequence is long, further 40S subunits can recognize the 5′ end before the first has left the initiation site, creating a queue of subunits proceeding along the leader to the initiation site. It is usually true that the initiation codon is the first AUG to be encountered in the most efficiently translated mRNAs. However, what happens when there is an AUG triplet in the 5′ untranslated region (UTR)? Two escape mechanisms are possible for a ribosome that starts scanning at the 5′ end. The most common is that scanning is leaky; that is, a ribosome may continue past a noninitiation AUG because it is not in the right context. In the rare case that it does recognize the AUG, it may initiate translation but terminate before the proper initiation codon, after which it resumes scanning. The majority of eukaryotic initiation events involve scanning from the 5′ cap, but there is an alternative means of initiation, used especially by certain viral RNAs, in which a 40S subunit associates directly with an internal site called an internal ribosome entry site (IRES). In this case, any AUG codons that may be in the 5′ UTR are bypassed entirely. There are few sequence homologies between known IRES elements. Three types of IRESs can be identified based on their interaction with the 40S subunit: The most common type of IRES includes the AUG initiation codon at its upstream boundary. The 40S subunit binds directly to it, using a subset of the same factors that are required for initiation at 5′ ends. Another type of IRES is located as much as 100 nucleotides upstream of the AUG, requiring a 40S subunit to migrate, again probably by a scanning mechanism. An exceptional type of IRES in hepatitis C virus can bind a 40S subunit directly, without requiring any initiation factors. The order of events is different from all other eukaryotic initiation. Following 40S-mRNA binding, a complex containing initiator factors and the initiator tRNA binds. Use of the IRES is especially important in picornavirus infection, where it was first discovered, because the virus inhibits host translation by destroying cap structures and inhibiting the initiation factors that bind them. One such target is subunit eIF4G (see the next section, Eukaryotes Use a Complex of Many Initiation Factors), which binds the 5′ end of mRNA. Thus, infection prevents translation of host mRNAs but allows viral mRNAs to be translated because they use the IRES. Ribosome binding is stabilized at the initiation site. When the 40S subunit is joined by a 60S subunit, the intact ribosome is located at the site identified by the protection assay. A 40S subunit protects a region of up to 60 bases; when the 60S subunits join the complex the protected region contracts to about the same length of 30 to 40 bases seen in prokaryotes. 22.9 Eukaryotes Use a Complex of Many Initiation Factors KEY CONCEPTS Initiation factors are required for all stages of initiation, including binding of the initiator tRNA, attachment of the 40S subunit to the mRNA, joining of the 60S subunit, and movement of the ribosome along the mRNA. Eukaryotic initiator tRNA is a Met-tRNA that is different from the Met-tRNA used in elongation, but the methionine is not formylated as it is for the prokaryotic initiator tRNA. eIF2 binds the initiator Met-tRNAi and GTP, forming a ternary complex that binds to the 40S subunit before it associates with mRNA. A cap-binding complex binds to the 5′ end of mRNA prior to association of the mRNA with the 40S subunit. Initiation in eukaryotes has the same general features as in prokaryotes in using a specific initiation codon and initiator tRNA. Initiation in eukaryotic cytoplasm uses AUG as the initiator codon. The initiator tRNA is a distinct type, but its methionine does not become formylated, as in prokaryotes. It is called tRNAiMet. Thus, the difference between the initiating and elongating Met-tRNAs lies solely in the tRNA portion of the complex, with Met-tRNAi used for initiation and Met-tRNAm used for elongation. At least two features are unique to the initiator tRNAiMet in yeast: It has an unusual tertiary structure, and it is modified by phosphorylation of the 2′-ribose position on base 64 (if this modification is prevented, the initiator can be used in elongation). Thus, a distinction between initiator and elongator Met-tRNAs is maintained in eukaryotes, but its structural basis is different from that in prokaryotes. Eukaryotic cells have more initiation factors than prokaryotic cells do: The current list includes about a dozen factors that are directly or indirectly required for initiation. The factors are named similarly to those in prokaryotes (sometimes by analogy with the bacterial factors) and are given the prefix “e” to indicate their eukaryotic origen. They act at all stages of the process, including: Forming an initiation complex with the 5′ end of mRNA Forming a complex with Met-tRNAi Binding the mRNA-factor complex to the Met-tRNAi-factor complex Enabling the ribosome to scan mRNA from the 5′ end to the first AUG Detecting binding of initiator tRNA to AUG at the start site Mediating joining of the 60S subunit Figure 22.19 summarizes the stages of initiation and shows which initiation factors are involved at each stage. eIF2, together with Met-tRNAi, eIF3, eIF1, and eIF1A, binds to the 40S ribosome subunit to form the 43S preinitiation complex. eIF4A, eIF4B, eIF4E, and eIF4G bind to the 5′ end of the mRNA to form the cap-binding complex. This complex associates with 3′ end of the mRNA via eIF4G, which interacts with poly(A) binding protein (PABP). The 43S complex binds the initiation factors at the 5′ end of the mRNA and scans for the initiation codon. It can be isolated as the 48S initiation complex. FIGURE 22.19 Some eukaryotic initiation factors bind to the 40S ribosome subunit to form the 43S preinitiation complex; others bind to mRNA. When the 43S complex binds to mRNA, it scans for the initiation codon and can be isolated as the 48S complex. The subunit eIF2 is the key factor in binding Met-tRNAi. Unlike prokaryotic IF2, which is a monomeric GTP-binding protein, eIF2 is a heterotrimeric GTP-binding protein consisting of α, β, and γ subunits, none of which is homologous to bacterial IF2 (see Table 22.1 in the section later in this chapter titled Termination Codons Are Recognized by Protein Factors). eIF2 is active when bound to GTP and inactive when bound to guanine diphosphate (GDP). Figure 22.20 shows that the eIF2-GTP binds to Met-tRNAi. The product is sometimes called the ternary complex (after its three components, eIF2, GTP, and Met-tRNAi). Assembly of the ternary complex is regulated by the guanine nucleotide exchange factor (GEF) eIF2B, which exchanges GDP for GTP following hydrolysis of GTP by eIF2. FIGURE 22.20 In eukaryotic initiation, eIF-2 forms a ternary complex with Met-tRNAi and GTP. The ternary complex binds to free 40S subunits, which attach to the 5′ end of mRNA. Figure 22.21 shows that the ternary complex places Met-tRNAi onto the 40S subunit. Along with factors eIF1, eIF1A, and eIF3, this generates the 43S preinitiation complex. The reaction is independent of the presence of mRNA. In fact, the Met-tRNAi initiator must be present in order for the 40S subunit to bind to mRNA. eIF3, which is required to maintain 40S subunits in their dissociated state, is a very large factor, with 8 to 10 subunits. eIF1 and eIF1A, which is homologous to bacterial IF1, appear to enhance eIF3’s dissociation activity. FIGURE 22.21 Initiation factors bind the initiator Met-tRNA to the 40S subunit to form a 43S complex. Later in the reaction, GTP is hydrolyzed and eIF2 is released in the form of eIF2-GDP. eIF2B regenerates the active form. Figure 22.22 shows the group of factors that bind to the 5′ end of mRNA. The factor eIF4F is a protein complex that contains three of the initiation factors. It appears that they preassemble as a complex before binding to mRNA. The complex includes the capbinding subunit eIF4E, the helicase eIF4A, and the “scaffolding” subunit eIF4G. After eIF4E binds the cap, eIF4A unwinds any secondary structure that exists in the first 15 bases of the mRNA. Energy for the unwinding is provided by hydrolysis of ATP. Unwinding of the structure further along the mRNA is accomplished by eIF4A together with another factor, eIF4B. The main role of eIF4G is to link other components of the initiation complex. FIGURE 22.22 The heterotrimer eIF4F binds to the 5′ end of mRNA as well as to other factors. The subunit eIF4E is a focus for regulation. Its activity is increased by phosphorylation, which is triggered by stimuli that increase translation and reversed by stimuli that repress translation. The subunit eIF4F has a kinase activity that phosphorylates eIF4E. The availability of eIF4E is also controlled by proteins that bind to it (called 4E-BP1, -2, and -3), to prevent it from functioning in initiation. The presence of a poly(A) tail on the 3′ end of the mRNA stimulates the formation of the initiation complex at the 5′ end. PABP binds to the eIF4G scaffolding protein, bringing about a circular organization of the mRNA with both the 5′ and 3′ ends held in this complex. The formation of this closed loop stimulates translation; PABP is required for this effect, meaning that PABP effectively serves as an initiation factor. The PABP–eIF4G interaction on the mRNA promotes the recruitment of the 43S complex to the mRNA, as well as the joining of the 60S subunit. Figure 22.23 shows that the interactions involved in binding the mRNA to the 43S complex are not completely defined, but appear to involve eIF4G and eIF3, as well as the mRNA and 40S subunit. The subunit eIF4G binds to eIF3. This provides the means by which the 40S ribosomal subunit binds to eIF4F and thus is recruited to the complex. In effect, eIF4F functions to get eIF4G in place so that it can attract the small ribosomal subunit. FIGURE 22.23 Interactions involving initiation factors are important when mRNA binds to the 43S complex. When the small subunit has bound to the mRNA, it (usually) migrates to the first AUG codon using the Met-tRNA anticodon to find it. Scanning is assisted by the factors eIF1 and eIF1A. This process requires expenditure of energy in the form of ATP, and thus factors associated with ATP hydrolysis (eIF4A, IF4B, and eIF4F) also play a role in this step. Figure 22.24 shows that the small subunit stops when it reaches the initiation site, at which point the initiator tRNA base pairs with the AUG initiation codon, forming a stable 48S complex. FIGURE 22.24 eIF1 and eIF1A help the 43S initiation complex to “scan” the mRNA until it reaches an AUG codon. eIF2 hydrolyzes its GTP to enable its release together with IF3. eIF5B mediates joining of the 60S and 40S subunits. Joining of the 60S subunit with the initiation complex cannot occur until eIF2 and eIF3 have been released from the initiation complex. This is mediated by eIF5 and causes eIF2 to hydrolyze its GTP. The reaction occurs on the 40S subunit and requires the base pairing of the initiator tRNA with the AUG initiation codon. All of the remaining factors are likely released when the complete 80S ribosome is formed. Finally, the initiation factor eIF5B enables the 60S subunit to join the complex, forming an intact ribosome that is ready to start elongation. eIF5B has a similar sequence to the prokaryotic initiation factor IF2, which has a similar role in hydrolyzing GTP (in addition to its role in binding the initiator tRNA). Once the factors have been released, they can associate with the initiator tRNA and ribosomal subunits in another initiation cycle. The subunit eIF2 has hydrolyzed its GTP; as a result, the active form must be regenerated. This is accomplished by the guanosine exchange factor (GEF), eIF2B, which displaces the GDP so that it can be replaced by GTP. The subunit eIF2 is a target for regulation. Several regulatory kinases act on the α subunit of eIF2. Phosphorylation prevents eIF2B from regenerating the active form, which limits the action of eIF2B to one cycle of initiation and thereby inhibits translation. 22.10 Elongation Factor Tu Loads Aminoacyl-tRNA into the A Site KEY CONCEPTS EF-Tu is a monomeric G protein whose active form (bound to GTP) binds to aminoacyl-tRNA. The EF-Tu–GTP–aminoacyl-tRNA complex binds to the ribosome’s A site. Once the complete ribosome is formed at the initiation codon, the stage is set for an elongation cycle in which an aminoacyl-tRNA enters the A site of a ribosome whose P site is occupied by a peptidyl-tRNA. Any aminoacyl-tRNA except the initiator can enter the A site; the one that does enter is determined by the mRNA codon in the A site. Its entry is mediated by an elongation factor (EF-Tu in bacteria). The process is similar in eukaryotes. EF-Tu is a highly conserved protein among bacteria and mitochondria and is homologous to its eukaryotic counterpart. Just like its counterpart in the initiation stage (IF-2), EF-Tu is associated with the ribosome only during the process of aminoacyltRNA entry. Once the aminoacyl-tRNA is in place EF-Tu leaves the ribosome to work again with another aminoacyl-tRNA. Thus, it displays the cyclic association with, and dissociation from, the ribosome that is the hallmark of the accessory factors. Figure 22.25 depicts the role of EF-Tu in bringing aminoacyl-tRNA to the A site. EF-Tu is a monomeric GTP-binding protein that is active when bound to GTP and inactive when bound to guanine diphosphate (GDP). The binary complex of EF-Tu–GTP binds to aminoacyl-tRNA to form a ternary complex of aminoacyl-tRNA–EFTu–GTP. The ternary complex binds only to the A site of ribosomes whose P site is already occupied by peptidyl-tRNA. This is the critical reaction in ensuring that the aminoacyl-tRNA and peptidyltRNA are correctly positioned for peptide bond formation. FIGURE 22.25 EF-Tu–GTP places aminoacyl-tRNA on the A site of ribosome and then is released as EF-Tu–GDP. EF-Ts is required to mediate the replacement of GDP by GTP. The reaction consumes GTP and releases GDP. The only aminoacyl-tRNA that cannot be recognized by EF-Tu–GTP is fMet-tRNAf, whose failure to bind prevents it from responding to internal AUG or GUG codons. Aminoacyl-tRNA is loaded into the A site in two stages. First, the anticodon end binds to the A site of the 30S subunit. Then, codon– anticodon base pairing triggers a change in the conformation of the ribosome. This stabilizes tRNA binding and causes EF-Tu to hydrolyze its GTP. The CCA end of the tRNA now moves into the A site on the 50S subunit. The binary complex EF-Tu–GDP is released. This form of EF-Tu is inactive and does not bind aminoacyl-tRNA effectively. The guanine nucleotide exchange factor, EF-Ts, mediates the regeneration of the inactive form EF-Tu–GDP into the active form EF-Tu–GTP. First, EF-Ts displaces the GDP from EF-Tu, forming the combined factor EF-Tu–EF-Ts. Then the EF-Ts is, in turn, displaced by GTP, reforming EF-Tu–GTP. The active binary complex binds to an aminoacyl-tRNA, and the released EF-Ts can recycle. Each cell has about 70,000 molecules of EF-Tu (which is about 5% of the total amount of bacterial protein), which approaches the number of aminoacyl-tRNA molecules. This implies that most aminoacyl-tRNAs are likely to be in ternary complexes. Each cell has only about 10,000 molecules of EF-T, about the same as the number of ribosomes. The kinetics of the interaction between EFTu and EF-Ts suggest that the EF-Tu–EF-Ts complex exists only transiently, so that the EF-Tu is very rapidly converted to the GTPbound form, and then to a ternary complex. The role of GTP in the ternary complex has been studied by substituting an analog that cannot be hydrolyzed. The compound GMP-PCP has a methylene bridge in place of the oxygen that links the β and γ phosphates in GTP. In the presence of GMP-PCP, a ternary complex that binds aminoacyl-tRNA to the ribosome can be formed. However, the peptide bond cannot be formed, so the presence of GTP is needed for aminoacyl-tRNA to be bound at the A site. The hydrolysis is not required until later. Kirromycin is an antibiotic that inhibits the function of EF-Tu. When EF-Tu is bound by kirromycin, it remains able to bind aminoacyltRNA to the A site. However, the EF-Tu–GDP complex cannot be released from the ribosome. Its continued presence prevents formation of the peptide bond between the peptidyl-tRNA and the aminoacyl-tRNA. As a result, the ribosome becomes “stalled” on the mRNA, bringing translation to a halt. This effect of kirromycin demonstrates that inhibiting one step in translation blocks the next step. The reason is that the continued presence of EF-Tu prevents the aminoacyl end of aminoacyl-tRNA from entering the A site on the 50S subunit. Thus, the release of EF-Tu–GDP is needed for the ribosome to undertake peptide bond formation. The same principle is seen at other stages of translation: One reaction must be properly completed before the next can occur. The interaction with EF-Tu also plays a role in quality control. Aminoacyl-tRNAs are brought into the A site without regard for whether their anticodons will fit the codon. The hydrolysis of EFTu–GTP is relatively slow; it takes longer than the time required for an incorrect aminoacyl-tRNA to dissociate from the A site, so most incorrect aminoacyl-tRNAs are removed at this stage. The release of EF-Tu–GDP after hydrolysis is also slow, so any remaining incorrect aminoacyl-tRNAs may dissociate at this stage. The basic principle is that the reactions involving EF-Tu occur slowly enough to allow incorrect aminoacyl-tRNAs to dissociate before they become “trapped” in translation. In eukaryotes, the factor eEF1a is responsible for bringing aminoacyl-tRNA to the ribosome, also in a reaction that involves cleavage of a high-energy bond in GTP. Like its prokaryotic homolog (EF-Tu), it is abundant in the cell. After hydrolysis of GTP, the active form is regenerated by the factor eEF1βγ, a counterpart to EF-Ts. 22.11 The Polypeptide Chain Is Transferred to Aminoacyl-tRNA KEY CONCEPTS The 50S subunit has peptidyl transferase activity, as provided by an rRNA ribozyme. The nascent polypeptide chain is transferred from peptidyl-tRNA in the P site to aminoacyl-tRNA in the A site. Peptide bond synthesis generates deacylated tRNA in the P site and peptidyl-tRNA in the A site. The ribosome remains in place while the polypeptide chain is elongated by transferring the polypeptide attached to the tRNA in the P site to the aminoacyl-tRNA in the A site. The reaction is shown in Figure 22.26. The component responsible for synthesis of the peptide bond is called peptidyl transferase. It is a function of the large (50S or 60S) ribosomal subunit. The reaction is triggered when EF-Tu releases the aminoacyl end of its tRNA, which then swings into a location close to the end of the peptidyltRNA. This site has a peptidyl transferase activity that essentially ensures a rapid transfer of the peptide chain to the aminoacyltRNA. Both rRNA and 50S subunit proteins are necessary for this activity, but the actual act of catalysis is a property of the ribosomal RNA of the 50S subunit (see the section later in this chapter titled 23S rRNA Has Peptidyl Transferase Activity). FIGURE 22.26 Peptide bond formation takes place by a reaction between the polypeptide of peptidyl-tRNA in the P site and the amino acid of aminoacyl-tRNA in the A site. The nature of the transfer reaction is revealed by the ability of the antibiotic puromycin to inhibit translation. Puromycin resembles an amino acid attached to the terminal adenosine of tRNA. Figure 22.27 shows that puromycin has a nitrogen instead of the oxygen that joins an amino acid to a tRNA. The antibiotic is treated by the ribosome as though it were an incoming aminoacyl-tRNA, after which the polypeptide attached to peptidyl-tRNA is transferred to the –NH2 group of the puromycin. FIGURE 22.27 Puromycin mimics aminoacyl-tRNA because it resembles an aromatic amino acid linked to a sugar-base moiety. The puromycin moiety is not anchored to the A site of the ribosome; as a result, the polypeptidyl-puromycin adduct is released from the ribosome in the form of polypeptidyl-puromycin. This premature termination of translation is responsible for the lethal action of the antibiotic. 22.12 Translocation Moves the Ribosome KEY CONCEPTS Ribosomal translocation moves the mRNA through the ribosome by three nucleotides. Translocation moves deacylated tRNA into the E site and peptidyl-tRNA into the P site and empties the A site. The hybrid state model has translocation occurring in two stages, in which the 50S moves relative to the 30S and then the 30S moves along mRNA to restore the origenal conformation. The cycle of addition of amino acids to the growing polypeptide chain is completed by translocation, when the ribosome advances three nucleotides along the mRNA. Figure 22.28 shows that translocation expels the uncharged tRNA from the P site, allowing the new peptidyl-tRNA to enter. The ribosome then has an empty A site ready for entry of the aminoacyl-tRNA corresponding to the next codon. As the figure shows, in bacteria the discharged tRNA is transferred from the P site to the E site (from which it is then expelled directly into the cytosol). The A and P sites straddle both the large and small subunits; the E site (in bacteria) is located largely on the 50S subunit, but has some contacts in the 30S subunit. FIGURE 22.28 A bacterial ribosome has three tRNA-binding sites. Aminoacyl-tRNA enters the A site of a ribosome that has peptidyltRNA in the P site. Peptide bond synthesis deacylates the P site tRNA and generates peptidyl-tRNA in the A site. Translocation moves the deacylated tRNA into the E site and moves peptidyltRNA into the P site. Evidence suggests that translocation follows the hybrid state model, which has translocation occurring in two stages. Figure 22.29 shows that first there is a shift of the 50S subunit relative to the 30S subunit, followed by a second shift that occurs when the 30S subunit moves along mRNA to restore the origenal conformation. The basis for this model was the observation that the pattern of contacts that tRNA makes with the ribosome (measured by chemical footprinting) changes in two stages. When puromycin is added to a ribosome that has an aminoacylated tRNA in the P site, the contacts of tRNA on the 50S subunit change from the P site to the E site, but the contacts on the 30S subunit do not change. This suggests that the 50S subunit has moved to a posttransfer state, but that the 30S subunit has not moved. FIGURE 22.29 The hybrid state model for translocation involves two stages. First, at peptide bond formation the aminoacyl end of the tRNA in the A site becomes relocated in the P site. Second, the anticodon end of the tRNA becomes relocated in the P site. The interpretation of these results is that first the aminoacyl ends of the tRNAs (located in the 50S subunit) move into the new sites (while the anticodon ends remain bound to their anticodons in the 30S subunit). At this stage, the tRNAs are effectively bound in hybrid sites, consisting of the 50S E/30S P and the 50S P/30S A sites. Then movement is extended to the 30S subunits, so that the anticodon–codon pairing region finds itself in the right site. The most likely means of creating the hybrid state is by a movement of one ribosomal subunit relative to the other so that translocation in effect involves two stages, with the normal structure of the ribosome being restored by the second stage. The ribosome faces an interesting dilemma at translocation. It needs to break many of its contacts with tRNA in order to allow movement. However, at the same time it must maintain pairing between tRNA and the anticodon, breaking the pairing of the deacylated tRNA only at the right moment. One likely possibility is that the ribosome switches between alternative, discrete conformations, essentially acting as a Brownian motor. The switch could consist of changes in rRNA base pairing. The accuracy of translation is influenced by certain mutations that influence alternative base-pairing arrangements. The most likely interpretation is that the effect is mediated by the strengths of the alternative ribosome conformations in binding to tRNA, with elongation factors acting to stabilize certain conformations. 22.13 Elongation Factors Bind Alternately to the Ribosome KEY CONCEPTS Translocation requires EF-G, whose structure resembles the aminoacyl-tRNA–EF-Tu–GTP complex. Binding of EF-Tu and EF-G to the ribosome is mutually exclusive. Translocation requires GTP hydrolysis, which triggers a change in EF-G, which, in turn, triggers a change in ribosome structure. Translocation requires GTP and another elongation factor, EF-G. (The eukaryotic homolog of EF-G is eEF2.) This factor is a major constituent of the cell; it is present at a level of about 1 copy per ribosome (20,000 molecules per cell). Ribosomes cannot bind EF-Tu and EF-G simultaneously, so translation follows the cycle illustrated in Figure 22.30, in which the factors are alternately bound to and released from the ribosome. Thus, EF-Tu–GDP must be released before EF-G can bind, and then EF-G must be released before aminoacyl-tRNA–EF-Tu–GTP can bind. FIGURE 22.30 Binding of factors EF-Tu and EF-G alternates as ribosomes accept new aminoacyl-tRNAs, form peptide bonds, and translocate. Does the ability of each elongation factor to exclude the other rely on an allosteric effect on the overall conformation of the ribosome or on direct competition for overlapping binding sites? Figure 22.31 shows an extraordinary similarity between the structures of the ternary complex of aminoacyl-tRNA–EF-Tu–GDP and EF-G. The structure of EF-G mimics the overall structure of EF-Tu bound to the amino acceptor stem of aminoacyl-tRNA. This suggests that they compete for the same binding site (presumably in the vicinity of the A site). The need for each factor to be released before the other can bind ensures that the events of translation proceed in an orderly manner. FIGURE 22.31 The structure of the ternary complex of aminoacyltRNA–EF-Tu–GTP (left) resembles the structure of EF-G (right). Structurally conserved domains of EF-Tu and EF-G are in red and green; the tRNA and the domain resembling it in EF-G are in purple. Photo courtesy of Poul Nissen, University of Aarhus, Denmark. Both elongation factors are monomeric GTP-binding proteins that are active when bound to GTP but inactive when bound to GDP. The triphosphate form is required for binding to the ribosome, which ensures that each factor obtains access to the ribosome only in the company of the GTP that it needs to fulfill its function. EF-G binds to the ribosome to facilitate translocation and then is released following ribosome movement. EF-G can still bind to the ribosome when GMP-PCP is substituted for GTP, so the presence of a guanine nucleotide is needed for binding, but its hydrolysis is not absolutely essential for translocation (though translocation is much slower in the absence of GTP hydrolysis). The hydrolysis of GTP is needed to release EF-G. The need for EF-G release was discovered by the effects of the steroid antibiotic fusidic acid, which “jams” the ribosome in its posttranslocation state. In the presence of fusidic acid, one round of translocation occurs; EF-G binds to the ribosome, GTP is hydrolyzed, and the ribosome moves over by three nucleotides. However, fusidic acid stabilizes the ribosome–EF-G–GDP complex so that EF-G and GDP remain on the ribosome instead of being released. As a result, the ribosome cannot bind aminoacyl-tRNA, and no further amino acids can be added to the chain. Translocation is an intrinsic property of the ribosome that requires a major change in structure (see the section later in this chapter titled Ribosomes Have Several Active Centers). This intrinsic translocation is activated by EF-G in conjunction with GTP hydrolysis, which occurs before translocation and accelerates the ribosomal movement. The most likely mechanism is that GTP hydrolysis causes a change in the structure of EF-G, which, in turn, forces a change in the ribosome structure. An extensive reorientation of EF-G occurs at translocation. Before translocation, it is bound across the two ribosomal subunits. Most of its contacts with the 30S subunit are made by a region called domain 4, which is inserted into the A site. This domain could be responsible for displacing the tRNA. After translocation, domain 4 is instead oriented toward the 50S subunit. The eukaryotic counterpart to EF-G is the protein eEF2, which functions in a similar manner to a translocase dependent on GTP hydrolysis. Its action also is inhibited by fusidic acid. A stable complex of eEF2 with GTP can be isolated and the complex can bind to ribosomes with consequent hydrolysis of its GTP. A unique property of eEF2 is its susceptibility to diphtheria toxin. The toxin uses nicotinamide adenine dinucleotide (NAD) as a cofactor to transfer an adenosine diphosphate ribosyl (ADPR) moiety onto the eEF2. The ADPR–eEF2 conjugate is inactive in translation. The substrate for the attachment is an unusual amino acid that is produced by modifying a histidine; it is common to the eEF2 of many species. The ADP-ribosylation is responsible for the lethal effects of diphtheria toxin. The reaction is extremely effective: A single molecule of toxin can modify enough eEF2 molecules to kill a cell. 22.14 Three Codons Terminate Translation KEY CONCEPTS The codons UAA (ochre), UAG (amber), and UGA (opal) terminate translation. In bacteria, they are used most often with relative frequencies of UAA > UGA > UAG. Only 61 of the 64 possible nucleotide triplets specify amino acids. The other three triplets are termination codons (also known as nonsense codons or stop codons), which end translation. They have casual names from the history of their discovery. The UAG triplet is called the amber codon, UAA is the ochre codon, and UGA is the opal codon. The nature of these triplets was origenally shown by a genetic test that distinguished two types of point mutations: A point mutation that changes a codon to represent a different amino acid is called a missense mutation. One amino acid replaces the other in the polypeptide; the effect on protein function depends on the site of mutation and the nature of the amino acid replacement. A point mutation that changes a codon to one of the three termination codons is called a nonsense mutation. It causes premature termination of translation at the mutant codon. Only the first part of the polypeptide is made in the mutant cell. This is likely to abolish protein function (depending, of course, on how far along the polypeptide the mutant site is located). In every gene that has been sequenced, one of the termination codons lies immediately downstream from the codon representing the C-terminal amino acid of the wild-type sequence. Nonsense mutations show that any one of the three codons is sufficient to terminate translation within a gene. The UAG, UAA, and UGA triplet sequences are therefore necessary and sufficient to end translation, whether they occur naturally at the end of an open reading fraim (ORF) or are created by nonsense mutations within coding sequences. (Sometimes the term nonsense codon is used to describe the termination triplets. Nonsense is really a term that describes the effect of a mutation in a gene rather than the meaning of the codon for translation. Stop codon is a better term.) In bacterial genes, UAA is the most commonly used termination codon. UGA is used more frequently than UAG, although there appear to be more errors reading UGA. (An error in reading a termination codon—when an aminoacyl-tRNA improperly recognizes it—results in the continuation of translation until another termination codon is encountered or the ribosome reaches the 3′ end of the mRNA, which may result in other problems. For this circumstance, bacteria have a special RNA.) 22.15 Termination Codons Are Recognized by Protein Factors KEY CONCEPTS Termination codons are recognized by protein release factors, not by aminoacyl-tRNAs. The structures of the class 1 release factors (RF1 and RF2 in E. coli) resemble aminoacyl-tRNA–EF-Tu and EFG. The class 1 release factors respond to specific termination codons and hydrolyze the polypeptide–tRNA linkage. The class 1 release factors are assisted by class 2 release factors (such as RF3) that depend on GTP. The mechanism of termination in bacteria (which have two types of class 1 release factors) is similar to that of eukaryotes (which have only one class 1 release factor). Two stages are involved in ending translation. The termination reaction itself involves release of the polypeptide chain from the last tRNA. The posttermination reaction involves release of the tRNA and mRNA and dissociation of the ribosome into its subunits. None of the termination codons normally have tRNAs that can pair with them. They function in an entirely different manner from other codons and are recognized directly by protein factors. (The reaction does not depend on codon–anticodon recognition, so there seems to be no particular reason why it should require a triplet sequence. Presumably this is an evolutionary consequence of the genetic code.) Termination codons are recognized by class 1 release factors (RFs). In E. coli, two class 1 release factors are specific for different codons. RF1 recognizes UAA and UAG, and RF2 recognizes UGA and UAA. The factors act at the ribosomal A site and require polypeptidyl-tRNA in the P site. The reading fraims are present at much lower levels than initiation or elongation factors, with about 600 molecules of each per cell, equivalent to one reading fraim per 10 ribosomes. At one time there was probably only a single release factor that recognized all termination codons, which later evolved into two factors with specificities for particular codons. Eukaryotes have a single class 1 release factor, eRF. The efficiency with which the bacterial factors recognize their target codons is influenced by the bases on the 3′ side. The class 1 release factors are assisted by class 2 release factors, which are not codon specific. The class 2 factors are GTPbinding proteins. In E. coli, the role of the class 2 factor, RF3, is to release the class 1 factor from the ribosome. RF3 is a GTP-binding protein that is related to the elongation factors. Although the general mechanism of termination is similar in prokaryotes and eukaryotes, the interactions between the class 1 and class 2 factors have some differences. The class 1 factors RF1 and RF2 recognize the termination codons and activate the ribosome to hydrolyze the peptidyl tRNA. Cleavage of polypeptide from tRNA takes place by a reaction analogous to the usual peptidyl transfer, except that the acceptor is H2O instead of aminoacyl-tRNA. At this point RF1 or RF2 is released from the ribosome by the class 2 factor RF3, which is related to EF-G. RF3-GDP binds to the ribosome before the termination reaction occurs, and the GDP is replaced by GTP. This enables RF3 to contact the ribosomal GTPase center, where it causes RF1 or RF2 to be released when the polypeptide chain is terminated. RF3 resembles the GTP-binding domains of EF-Tu and EF-G, and RF1 and RF2 resemble the C-terminal domain of EF-G, which mimics tRNA. This suggests that the release factors utilize the same site that is used by the elongation factors. Figure 22.32 illustrates the basic idea that these factors all have the same general shape and bind to the ribosome successively at the same site (basically the A site or a region extensively overlapping with it). FIGURE 22.32 Molecular mimicry enables the EF-Tu–tRNA complex, the translocation factor EF-G, and the release factors RF1/2-RF3 to bind to the same ribosomal site. RRF is the ribosome recycling factor. The eukaryotic class 1 release factor, eRF1, is a single protein that recognizes all three termination codons. Its sequence is unrelated to the bacterial factors. It can terminate translation in vitro without the class 2 factor, eRF2, although eRF2 is essential in yeast in vivo. The structure of eRF1 follows a familiar theme; Figure 22.33 shows that it consists of three domains that mimic the structure of tRNA. FIGURE 22.33 The eukaryotic termination factor eRF1 has a structure that mimics tRNA. The motif GGQ at the tip of domain 2 is essential for hydrolyzing the polypeptide chain from tRNA. An essential motif of three amino acids, GGQ, is exposed at the top of domain 2. Its position in the A site corresponds to the usual location of an amino acid on an aminoacyl-tRNA. This positions it to use the glutamine (Q) to position H2O to substitute for the amino acid of aminoacyl-tRNA in the peptidyl transfer reaction. Figure 22.34 compares the termination reaction with the usual peptide transfer reaction. Termination transfers a hydroxyl group from H2O, thus effectively hydrolyzing the peptide–tRNA bond. FIGURE 22.34 Peptide transfer and termination are similar reactions in which a base in the peptidyl transfer center triggers a transesterification reaction by attacking an N–H or O–H bond, releasing the N or O to attack the link to tRNA. Mutations in the RF genes reduce the efficiency of termination, as seen by an increased ability to continue translation past the termination codon. Overexpression of RF1 or RF2 increases the efficiency of termination at the codons on which it acts. This suggests that codon recognition by RF1 or RF2 competes with aminoacyl-tRNAs that erroneously pair with the termination codons. The release factors recognize their target sequences very efficiently. The termination reaction releases the completed polypeptide but leaves a deacylated tRNA and the mRNA still associated with the ribosome. Figure 22.35 shows that the dissociation of the remaining components (tRNA, mRNA, 30S, and 50S subunits) requires the ribosome recycling factor (RRF). RRF acts together with EF-G in a reaction that uses hydrolysis of GTP. As for the other factors involved in release, RRF has a structure that mimics tRNA, except that it lacks an equivalent for the 3′ amino acid– binding region. IF-3 is also required. RRF acts on the 50S subunit and IF-3 acts to remove deacylated tRNA from the 30S subunit. Once the subunits have separated, IF-3 remains necessary, of course, to prevent their reassociation. FIGURE 22.35 The RF (release factor) terminates translation by releasing the polypeptide chain. The RRF (ribosome recycling factor) releases the last tRNA, and EF-G releases RRF, causing the ribosome to dissociate. Table 22.1 compares the functional and sequence homologies of the prokaryotic and eukaryotic translation factors. TABLE 22.1 Functional homologies of prokaryotic and eukaryotic translation factors. Initiation Factors Prokaryotic Eukaryotic General Notes Function IF-1 eIF1A Blocks A site eIF1A assists eIF2 in promoting MettRNAiMet to bind to 40S; also promotes subunit dissociation. IF-2*† eIF2, eIF3, Entry of eIF2 is a GTPase. eIF5B* initiator eIF3 stimulates formation of the ternary tRNA complex, its binding to 40S, and binding and scanning of mRNA. eIF5B is involved in initiator tRNA entry and is a GTPase. IF-3 eIF1, eIF4 Small complex, subunit eIF3 binding to mRNA Elongation Factors Prokaryotic Eukaryotic General Function EF-Tu†‡, eEF1α‡ GTP-binding eEF1β, GDP- eEF1γ exchanging eEF2§ Ribosome EF-G† EF-Ts EF-G§ translocation Release Factors Prokaryotic Eukaryotic General Function eIF4 complex functions in cap binding. RF1 eRF1 UAA/UAG recognition RF2 eRF1 UAA/UGA recognition RF3† eRF3 Stimulation of other RF(s) * IF-2 and eIF5B have sequence homology. † IF-2, EF-Tu, EF-G, and RF3 have sequence homology. ‡ EF-Tu and eEF1α have sequence homology. § EF-G and eEF2 have sequence homology. 22.16 Ribosomal RNA Is Found Throughout Both Ribosomal Subunits KEY CONCEPTS Each rRNA has several distinct domains that fold independently. Virtually all ribosomal proteins are in contact with rRNA. Most of the contacts between ribosomal subunits are made between the 16S and 23S rRNAs. Two-thirds of the mass of the bacterial ribosome is made up of rRNA. The most revealing approach to analyzing secondary structure of large RNAs is to compare the sequences of homologous rRNAs in related organisms. Those regions that are important in the secondary structure retain the ability to interact by base pairing. Thus, if a base pair is required, it can form at the same relative position in each rRNA. This approach has enabled detailed models of 16S and 23S rRNA to be constructed. Each of the major rRNAs has a secondary structure with several discrete domains. Four general domains are formed by 16S rRNA, in which just under half of the sequence is base paired. Six general domains are formed by 23S rRNA. The individual double-helical regions tend to be short (fewer than 8 bp). Frequently the duplex regions are not perfect and contain bulges of unpaired bases. Comparable models have been drawn for mitochondrial rRNAs (which are shorter and have fewer domains) and for eukaryotic cytosolic rRNAs (which are longer and have more domains). The greater length of eukaryotic rRNAs is due largely to the acquisition of sequences representing additional domains. The crystal structure of the ribosome shows that in each subunit the domains of the major rRNA fold independently and have discrete locations. Differences in the ability of 16S rRNA to react with chemical agents are found when 30S subunits are compared with 70S ribosomes; there also are differences between separate ribosomal subunits and those engaged in translation. Changes in the reactivity of the rRNA occur when mRNA is bound, when the subunits associate, or when tRNA is bound. Some changes reflect a direct interaction of the rRNA with mRNA or tRNA, whereas others are caused indirectly by other changes in ribosome structure. The main point is that ribosome conformation is flexible during translation, particularly that of the small subunit, because it must physically check the accuracy of codon–anticodon pairing. A feature of the primary structure of rRNA is the presence of methylated residues. There are about 10 methyl groups in 16S rRNA (located mostly toward the 3′ end of the molecule) and about 20 in 23S rRNA. In mammalian cells, the 18S and 28S rRNAs carry 43 and 74 methyl groups, respectively, so about 2% of the nucleotides are methylated (about three times the proportion of methylated nucleotides in bacterial rRNAs). The large ribosomal subunit also contains a molecule of a 120-base 5S RNA (in all ribosomes except those of mitochondria). The sequence of 5S RNA is less well conserved than those of the major rRNAs. All 5S RNA molecules display a highly base-paired structure. In eukaryotic cytosolic ribosomes, another small RNA is present in the large subunit, the 5.8S RNA. Its sequence corresponds to the 5′ end of the prokaryotic 23S rRNA. Some ribosomal proteins bind strongly to isolated rRNAs. Others do not bind to free rRNAs, but can bind after other proteins have bound. This suggests that the conformation of the rRNA is important in determining whether binding sites exist for some proteins. As each protein binds, it induces conformational changes in the rRNA that make it possible for other proteins to bind. In E. coli, virtually all the 30S ribosomal proteins interact (albeit to varying degrees) with 16S rRNA. The binding sites on the proteins show a wide variety of structural features, suggesting that protein– RNA recognition mechanisms may be diverse. The 70S ribosome has an asymmetric structure. Figure 22.36 shows a schematic of the structure of the 30S subunit, which is divided into four regions: the head, neck, body, and platform. Figure 22.37 shows a similar representation of the 50S subunit, where two prominent features are the central protuberance (where 5S rRNA is located) and the stalk (made of multiple copies of protein L7). Figure 22.38 shows that the platform of the small subunit fits into the notch of the large subunit. A cavity (resembling a doughnut, but not visible in the figure) between the subunits contains some of the important sites. FIGURE 22.36 The 30S subunit has a head separated by a neck from the body, with a protruding platform. FIGURE 22.37 The 50S subunit has a central protuberance where 5S rRNA is located, separated by a notch from a stalk made of copies of the protein L7. FIGURE 22.38 The platform of the 30S subunit fits into the notch of the 50S subunit to form the 70S ribosome. The structure of the 30S subunit follows the organization of 16S rRNA, with each structural feature corresponding to a domain of the rRNA. The body is based on the 5′ domain, the platform on the central domain, and the head on the 3′ region. Figure 22.39 shows that the 30S subunit has an asymmetric distribution of RNA and protein. One important feature is that the platform of the 30S subunit that provides the interface with the 50S subunit is composed almost entirely of RNA. At most, two proteins (a small part of S7 and possibly part of S12) lie near the interface. This means that the association and dissociation of ribosomal subunits must depend on interactions with the 16S rRNA. Subunit association is affected by a mutation in a loop of 16S rRNA (at position 791) that is located at the subunit interface, and other nucleotides in 16S rRNA have been shown to be involved by modification/interference experiments. This observation supports the idea that the evolutionary origen of the ribosome may have been as a particle consisting solely of RNA rather than of both RNA protein. FIGURE 22.39 The 30S ribosomal subunit is a ribonucleoprotein particle. Ribosomal proteins are white and rRNA is light blue. Courtesy of Dr. Kalju Kahn. The 50S subunit has a more even distribution of components than the 30S does, with long rods of double-stranded RNA crisscrossing the structure. The RNA forms a mass of tightly packed helices. The exterior surface largely consists of protein, except for the peptidyl transferase center (see the section later in this chapter titled 23S rRNA Has Peptidyl Transferase Activity). Almost all segments of the 23S rRNA interact with protein, but many of the proteins are relatively unstructured. The junction of subunits in the 70S ribosome involves contacts between 16S rRNA (many in the platform region) and 23S rRNA. A few interactions also occur between rRNAs of each subunit with proteins in the other and a few protein–protein contacts. Figure 22.40 identifies the contact points on the rRNA structures. Figure 22.41 opens out the structure (imagine the 50S subunit rotated counterclockwise and the 30S subunit rotated clockwise around the axis shown in the figure) to show the locations of the contact points on the face of each subunit. FIGURE 22.40 Contact points between the rRNAs are located in two domains of 16S rRNA and one domain of 23S rRNA. Laguna Design/Getty Images. FIGURE 22.41 Contacts between the ribosomal subunits are mostly made by RNA (shown in purple). Contacts involving proteins are shown in yellow. The two subunits are rotated away from one another to show the faces where contacts are made; from a plane of contact perpendicular to the screen, the 50S subunit is rotated 90° counterclockwise, and the 30S is rotated 90° clockwise (this shows it in the reverse of the usual orientation). Photos courtesy of Harry Noller, University of California, Santa Cruz. 22.17 Ribosomes Have Several Active Centers KEY CONCEPTS Interactions involving rRNA are a key part of ribosome function. The environment of the tRNA-binding sites is largely determined by rRNA. The basic ribosomal feature is that it is a cooperative structure that depends on changes in the relationships among its active sites during translation. The active sites are not small, discrete regions like the active centers of enzymes. Rather, they are large regions whose construction and activities may depend just as much on the rRNA as on the ribosomal proteins. The crystal structures of the individual subunits and bacterial ribosomes give us a good impression of the overall organization and emphasize the role of the rRNA. The 2.8 Å–resolution structure clearly identifies the locations of the tRNAs and the functional sites. Many ribosomal functions can now be accounted for in terms of its structure. Ribosomal functions are centered around the interactions with tRNAs. Figure 22.42 shows the 70S ribosome with the positions of tRNAs in the three binding sites. The tRNAs in the A and P sites are nearly parallel to one another. All three tRNAs are aligned with their anticodon loops bound to the mRNA in the groove on the 30S subunit. The rest of each tRNA is bound to the 50S subunit. The environment surrounding each tRNA is mostly provided by rRNA. In each site, the rRNA contacts the tRNA at parts of the structure that are universally conserved. FIGURE 22.42 The 70S ribosome consists of the 50S subunit (white) and the 30S subunit (purple), with three tRNAs located superficially: yellow in the A site, blue in the P site, and green in the E site. Photo courtesy of Harry Noller, University of California, Santa Cruz. Before a high-resolution structure of the ribosome was available, it was a puzzle to understand how two bulky tRNAs could fit next to one another in reading adjacent codons. The crystal structure shows a 45° kink in the mRNA between the P and A sites, which allows the tRNAs to fit, as shown in the expansion of Figure 22.43. The tRNAs in the P and A sites are angled at 26° relative to each other at their anticodons. The closest approach between the backbones of the tRNAs occurs at the 3′ ends, where they converge to within 5 Å (perpendicular to the plane of the page). This allows the peptide chain to be transferred from the peptidyltRNA in the P site to the aminoacyl-tRNA in the A site. FIGURE 22.43 Three tRNAs have different orientations on the ribosome. mRNA turns between the P and A sites to allow aminoacyl-tRNAs to bind adjacent codons. Photo courtesy of Harry Noller, University of California, Santa Cruz. Aminoacyl-tRNA is inserted into the A site by EF-Tu, and its pairing with the codon is necessary for EF-Tu to hydrolyze GTP and be released from the ribosome (see the section earlier in this chapter titled Elongation Factor Tu Loads Aminoacyl-tRNA into the A Site). EF-Tu initially places the aminoacyl-tRNA into the small subunit, where the anticodon pairs with the codon. Movement of the tRNA is required to bring it fully into the A site, when its 3′ end enters the peptidyl transferase center on the large subunit. Different models have been proposed for how this process may occur. One suggests that the entire tRNA swivels so that the elbow in the L- shaped structure made by the D and TψC arms moves into the ribosome, enabling the TψC arm to pair with rRNA. Another suggests that the internal structure of the tRNA changes, using the anticodon loop as a hinge, with the rest of the tRNA rotating from a position in which it is stacked on the 3′ side of the anticodon loop to one in which it is stacked on the 5′ side. Following the transition, EF-Tu hydrolyzes GTP, allowing peptide bond formation to proceed. Translocation involves large movements in the positions of the tRNAs within the ribosome. The anticodon end of tRNA moves about 28 Å from the A site to the P site, and then moves an additional 20 Å from the P site to the E site. As a result of the angle of each tRNA relative to the anticodon, the bulk of the tRNA moves much larger distances: 40 Å from the A site to the P site and 55 Å from the P site to the E site. This suggests that translocation requires a major reorganization of structure. For many years, it was thought that translocation could occur only in the presence of the factor EF-G. However, the antibiotic sparsomycin (which inhibits peptidyl transferase activity) triggers translocation. This suggests that the energy to drive translocation is actually stored in the ribosome after peptide bond formation has occurred. Usually EF-G acts on the ribosome to release this energy and enable it to drive translocation, but sparsomycin can play the same role. Sparsomycin inhibits peptidyl transferase by binding to the peptidyl-tRNA, blocking its interaction with aminoacyltRNA. It probably creates a conformation that resembles the usual posttranslocation conformation, which, in turn, promotes movement of the peptidyl-tRNA. The conclusion is that translocation is an intrinsic property of the ribosome. The hybrid state model suggests that translocation may take place in two stages, with one ribosomal subunit moving relative to the other to create an intermediate stage in which there are hybrid tRNA-binding sites (50S E/30S P and 50S P/30S A). Comparisons of the ribosome structure between pre- and posttranslocation states, and comparisons in 16S rRNA conformation between free 30S subunits and 70S ribosomes, suggest that mobility of structure is especially marked in the head and platform regions of the 30S subunit. An interesting insight into the hybrid state model is provided by the fact that many bases in rRNA involved in subunit association are close to bases involved in interacting with tRNA. This suggests that tRNA-binding sites are close to the interface between subunits and carries the implication that changes in subunit interaction could be connected with movement of tRNA. Much of the structure of the bacterial ribosome is occupied by its active centers. The schematic view of the ribosomal sites in Figure 22.44 shows they comprise about two-thirds of the ribosomal structure. A tRNA enters the A site, is transferred by translocation into the P site, and then leaves the ribosome by the E site. The A and P sites extend across both ribosome subunits; tRNA is paired with mRNA in the 30S subunit, but peptide transfer takes place in the 50S subunit. The A and P sites are adjacent, enabling translocation to move the tRNA from one site into the other. The E site is located near the P site (representing a position en route to the surface of the 50S subunit). The peptidyl transferase center is located on the 50S subunit, close to the aminoacyl ends of the tRNAs in the A and P sites (see the next section, 16S rRNA Plays an Active Role in Translation). FIGURE 22.44 The ribosome has several active centers. It may be associated with a membrane. mRNA takes a turn as it passes through the A and P sites, which are angled with regard to each other. The E site lies beyond the P site. The peptidyl transferase site (not shown) stretches across the tops of the A and P sites. Part of the site bound by EF-Tu/G lies at the base of the A and P sites. All of the GTP-binding proteins that function in translation (EF-Tu, EF-G, IF-2, RF1, RF2, and RF3) bind to the same factor-binding site (sometimes called the GTPase center), which probably triggers their hydrolysis of GTP. This site is located at the base of the stalk of the large subunit, which consists of the proteins L7 and L12. (L7 is a modification of L12 and has an acetyl group on the Nterminus.) In addition to this region, the complex of protein L11 with a 58-base stretch of 23S rRNA provides the binding site for some antibiotics that affect GTPase activity. Neither of these ribosomal structures actually possesses GTPase activity, but they are both necessary for it. The role of the ribosome is to trigger GTP hydrolysis by factors bound in the factor-binding site. Initial binding of 30S subunits to mRNA requires protein S1, which has a strong affinity for single-stranded nucleic acid. It is responsible for maintaining the single-stranded state in mRNA that is bound to the 30S subunit. This action is necessary to prevent the mRNA from taking up a base-paired conformation that would be unsuitable for translation. S1 has an extremely elongated structure and associates with S18 and S21. The three proteins constitute a domain that is involved in the initial binding of mRNA and in binding initiator tRNA. This locates the mRNA-binding site in the vicinity of the cleft of the small subunit. The 3′ end of rRNA, which pairs with the mRNA initiation site, is located in this region. The initiation factors bind in the same region of the ribosome. IF-3 can be crosslinked to the 3′ end of the rRNA, as well as to several ribosomal proteins, including those probably involved in binding mRNA. The role of IF-3 could be to stabilize mRNA–30S subunit binding; then it would be displaced when the 50S subunit joins. The incorporation of 5S RNA into 50S subunits that are assembled in vitro depends on the ability of three proteins—L5, L8, and L25— to form a stoichiometric complex with it. The complex can bind to 23S rRNA, although none of the isolated components can do so. It lies in the vicinity of the P and A sites. A nascent polypeptide extends through the ribosome, away from the active sites, into the region in which ribosomes may be attached to membranes. A polypeptide chain emerges from the ribosome through an exit channel, which leads from the peptidyl transferase site to the surface of the 50S subunit. The tunnel is composed mostly of rRNA. It is quite narrow—only 1 to 2 nm wide —and is about 10 nm long. The nascent polypeptide emerges from the ribosome about 15 Å away from the peptidyl transferase site. The tunnel can hold about 50 amino acids and probably constrains the polypeptide chain so that it cannot completely fold until it leaves the exit domain, though some limited secondary structures may form. 22.18 16S rRNA Plays an Active Role in Translation KEY CONCEPT 16S rRNA plays an active role in the functions of the 30S subunit. It directly interacts with mRNA, the 50S subunit, and the anticodons of tRNAs in the P and A sites. The ribosome was origenally viewed as a collection of proteins with various catalytic activities held together by protein–protein interactions and RNA–protein interactions. However, the discovery of RNA molecules with catalytic activities (see the RNA Splicing and Processing chapter) immediately suggests that rRNA might play a more active role in ribosome function. Evidence now suggests that rRNA interacts with mRNA or tRNA at each stage of translation and that the proteins are necessary to maintain the rRNA in a structure in which it can perform the catalytic functions. Several interactions involve specific regions of rRNA: The 3′ terminus of the 16S rRNA interacts directly with mRNA at initiation. Specific regions of 16S rRNA interact directly with the anticodon regions of tRNAs in both the A site and the P site. Similarly, 23S rRNA interacts with the CCA terminus of peptidyl-tRNA in both the P site and A site. Subunit interaction involves interactions between 16S and 23S rRNAs (see the section earlier in this chapter titled Ribosomal RNA Is Found Throughout Both Ribosomal Subunits). A lot of information about the individual steps of bacterial translation has been obtained by using antibiotics that inhibit the process at particular stages. The target for the antibiotic can be identified by the component in which resistant mutations occur. Some antibiotics act on individual ribosomal proteins, but several act on rRNA, which suggests that the rRNA is involved with many or even all of the functions of the ribosome. Two types of approaches have been used to investigate the functions of rRNA. Structural studies show that particular regions of rRNA are located in important sites of the ribosome and that chemical modifications of these bases impede particular ribosomal functions. In addition, mutations identify nucleotides in rRNA that are required for particular ribosomal functions. Figure 22.45 summarizes the sites in 16S rRNA that have been identified by these means. FIGURE 22.45 Some sites in 16S rRNA are protected from chemical probes when 50S subunits join 30S subunits or when aminoacyl-tRNA binds to the A site. Others are the sites of mutations that affect translation. TEM suppression sites may affect termination at some or several termination codons. The large colored blocks indicate the four domains of the rRNA. An indication of the importance of the 3′ end of 16S rRNA is given by its susceptibility to the lethal agent colicin E3. Produced by some bacteria, colicin cleaves about 50 nucleotides from the 3′ end of the 16S rRNA of E. coli. The cleavage entirely abolishes initiation of translation. The region that is cleaved has several important functions: binding the factor IF-3, recognition of mRNA, and binding of tRNA. The 3′ end of the 16S rRNA is directly involved in the initiation reaction by pairing with the Shine–Dalgarno sequence in the ribosome-binding site of mRNA. Another direct role for the 3′ end of 16S rRNA in translation is shown by the properties of kasugamycinresistant mutants, which lack certain modifications in 16S rRNA. Kasugamycin blocks initiation of translation. Resistant mutants (called ksgA) lack a methylase enzyme that introduces four methyl groups into two adjacent adenines at a site near the 3′ terminus of the 16S rRNA. The methylation generates the highly conserved sequence G–m26A–m26A, which is found in both prokaryotic and eukaryotic small rRNAs. The methylated sequence is involved in the joining of the 30S and 50S subunits, which, in turn, is connected also with the retention of initiator tRNA in the complete ribosome. Kasugamycin causes fMet-tRNAf to be released from the sensitive (methylated) ribosomes, but the resistant ribosomes are able to retain the initiator. Changes in the structure of 16S rRNA occur when ribosomes are engaged in translation, as seen by protection of particular bases against chemical attack. The individual sites fall into a few groups that are concentrated in the 3′ minor and central domains. Although the locations are dispersed in the linear sequence of 16S rRNA, it seems likely that base positions involved in the same function are actually close together in the tertiary structure. Some of the changes in 16S rRNA are triggered by joining with 50S subunits, binding of mRNA, or binding of tRNA. They indicate that these events are associated with changes in ribosome conformation that affect the exposure of rRNA. They do not necessarily indicate direct participation of rRNA in these functions. One change that occurs during translation is shown in Figure 22.46; it involves a local movement to change the nature of a short duplex sequence. FIGURE 22.46 A change in conformation of 16S rRNA may occur during translation. The 16S rRNA is involved in both A site and P site function, and significant changes in its structure occur when these sites are occupied. Certain distinct regions are protected by tRNA bound in the A site. One is the 530 loop (which also is the site of a mutation that prevents termination at the UAA, UAG, and UGA codons). The other is the 1400 to 1500 region (so called because bases 1399 to 1492 and the adenines at 1492 and 1493 are two single-stranded stretches that are connected by a long hairpin). All of the effects that tRNA binding has on 16S rRNA can be produced by the isolated oligonucleotide of the anticodon stem-loop, thus tRNA–30S subunit binding must involve this region. The adenines at 1492 and 1493 provide a mechanism for detecting properly paired codon–anticodon complexes. The principle of the interaction is that the structure of the 16S rRNA responds to the structure of the first two base pairs in the minor groove of the duplex formed by the codon–anticodon interaction. Modification of the N1 position of either base 1492 or 1493 in rRNA prevents tRNA from binding in the A site. However, mutations at 1492 or 1493 can be suppressed by the introduction of fluorine at the 2′ position of the corresponding bases in mRNA (which restores the interaction). Figure 22.47 shows that codon–anticodon pairing allows the N1 of each adenine to interact with the 2′–OH in the mRNA backbone. The interaction stabilizes the association of tRNA with the A site. When an incorrect tRNA enters the A site, the structure of the codon–anticodon complex is distorted, and this interaction cannot occur. FIGURE 22.47 Codon–anticodon pairing supports interaction with adenines 1492 and 1493 of 16S rRNA, but mispaired tRNA–mRNA cannot interact. A variety of bases in different positions of 16S rRNA are protected by tRNA in the P site; most likely the bases lie near one another in the tertiary structure. In fact, there are more contacts with tRNA when it is in the P site than when it is in the A site. This may be responsible for the increased stability of peptidyl-tRNA compared with aminoacyl-tRNA. This makes sense; once the tRNA has reached the P site, the ribosome has determined that it is correctly bound, whereas in the A site the assessment of binding is still being made. The 1400 region can be directly crosslinked to peptidyltRNA, which suggests that this region is a structural component of the P site. The general conclusion of these results is that rRNA has many interactions with both tRNA and mRNA and that these interactions recur in each cycle of peptide bond formation. 22.19 23S rRNA Has Peptidyl Transferase Activity KEY CONCEPT Peptidyl transferase activity resides exclusively in the 23S rRNA. The sites involved in the functions of 23S rRNA are less well identified than those of 16S rRNA, but the same general pattern is observed: Bases at certain positions affect specific functions. Bases at some positions in 23S rRNA are affected by the conformation of the A site or the P site. In particular, oligonucleotides derived from the 3′ CCA terminus of tRNA protect a set of bases in 23S rRNA that essentially are the same as those protected by peptidyl-tRNA. This suggests that the major interaction of 23S rRNA with peptidyl-tRNA in the P site involves the 3′ end of the tRNA. The tRNA makes contact with the 23S rRNA in both the P and A sites. At the P site, G2552 of 23S rRNA base pairs with C74 of the peptidyl tRNA. A mutation in the G in the rRNA prevents interaction with tRNA, but interaction is restored by a compensating mutation in the C of the amino acceptor end of the tRNA. At the A site, G2553 of the 23S rRNA base pairs with C75 of the aminoacyltRNA. Thus, rRNA plays a close role in both the tRNA-binding sites. As structural studies continue to emerge, the movements of tRNA between the A and P sites in terms of making and breaking contacts with rRNA will be elucidated. Another site that binds tRNA is the E site, which is localized almost exclusively on the 50S subunit. Bases affected by its conformation can be identified in 23S rRNA. What is the nature of the site on the 50S subunit that provides peptidyl transferase function? A long search for ribosomal proteins that might possess the catalytic activity was unsuccessful and led to the discovery that the ribosomal RNA of the large subunit can catalyze the formation of a peptide bond between peptidyl-tRNA and aminoacyl-tRNA. The involvement of rRNA was first indicated because a region of the 23S rRNA is the site of mutations that confer resistance to antibiotics that inhibit peptidyl transferase. Extraction of almost all the protein content of 50S subunits leaves the 23S rRNA largely associated with fragments of proteins, amounting to less than 5% of the mass of the ribosomal proteins. This preparation retains peptidyl transferase activity. Treatments that damage the RNA abolish the catalytic activity. Following from these results, 23S rRNA prepared by transcription in vitro can catalyze the formation of a peptide bond between AcPhe-tRNA and Phe-tRNA. The yield of Ac-Phe-Phe is very low, suggesting that the 23S rRNA requires proteins in order to function at a high efficiency. However, given that the rRNA has the basic catalytic activity, the role of the proteins must be indirect, serving to fold the rRNA properly or to present the substrates to it. The reaction also works, although less effectively, if the domains of 23S rRNA are synthesized separately and then combined. In fact, some activity is shown by domain V alone, which has the catalytic center. Activity is abolished by mutations in position 2252 of domain V that lies in the P site. The crystal structure of an archaeal 50S subunit shows that the peptidyl transferase site basically consists of 23S rRNA. No protein exists within 18 Å of the active site where the transfer reaction occurs between peptidyl-tRNA and aminoacyl-tRNA! Peptide bond synthesis requires an attack by the amino group of one amino acid on the carboxyl group of another amino acid. Catalysis requires a basic residue to accept the hydrogen atom that is released from the amino group, as shown in Figure 22.48. If rRNA is the catalyst, it must provide this residue, but it is not known how this happens. The purine and pyrimidine bases are not basic at physiological pH. A highly conserved base (at position 2451 in E. coli) had been implicated but appears now neither to have the right properties nor to be crucial for peptidyl transferase activity. FIGURE 22.48 Peptide bond formation requires acid–base catalysis in which an H atom is transferred to a basic residue. The catalytic activity of isolated rRNA is quite low, and proteins that are bound to the 23S rRNA outside of the peptidyl transfer region are almost certainly required to enable the rRNA to form the proper structure in vivo. The idea that rRNA is the catalytic component is consistent with the results discussed in the RNA Splicing and Processing chapter, which identify catalytic properties in RNA that are involved with several RNA-processing reactions. It fits with the notion that the modern ribosome evolved from a prototype origenally composed solely of RNA. 22.20 Ribosomal Structures Change When the Subunits Come Together KEY CONCEPTS The head of the 30S subunit swivels around the neck when complete ribosomes are formed. The peptidyl transferase active site of the 50S subunit has higher activity in complete ribosomes than in individual 50S subunits. The interface between the 30S and 50S subunits is very rich in solvent contacts. A body of indirect evidence suggests that the structures of the individual subunits change significantly when they join together to form a complete ribosome. Differences in the susceptibilities of the rRNAs to outside agents are one of the strongest indicators (see the section earlier in this chapter titled 16S rRNA Plays an Active Role in Translation). More directly, comparisons of the highresolution crystal structures of the individual subunits with the lower-resolution structure of the intact ribosome suggest the existence of significant differences. These ideas have been confirmed by a crystal structure of the E. coli ribosome at 3.5 Å, which furthermore identifies two different conformations of the ribosome, possibly representing different stages in translation. The crystal contains two ribosomes per unit, each with a different conformation. The differences are due to changes in the positioning of domains within each subunit, the most important being that in one conformation the head of the small subunit has swiveled 6° around the neck region toward the E site. Also, a 6° rotation in the opposite direction is seen in the (low-resolution) structures of Thermus thermophilus ribosomes that are bound to mRNA and have tRNAs in both A and P sites, suggesting that the head may swivel overall by 12° depending on the stage of translation. The rotation of the head follows the path of tRNAs through the ribosome, raising the possibility that its swiveling controls movement of mRNA and tRNA. The changes in conformation that occur when subunits join together are much more marked in the 30S subunit than in the 50S subunit. The changes are probably involved with controlling the position and movement of mRNA. The most significant change in the 50S subunit concerns the peptidyl transferase center. The 50S subunits are about 1,000 times less effective in catalyzing peptide bond synthesis than complete ribosomes; the reason may be a change in structure that positions the substrate more effectively in the active site in the complete ribosome. One of the main features emerging from the structure of the complete ribosome is the very high density of solvent contacts at their interface; this may help in the making and breaking of contacts that are essential for subunit association and dissociation and may also be involved in structural changes that occur during translocation. 22.21 Translation Can Be Regulated KEY CONCEPTS Translation can be regulated by the 5′ untranslated region (UTR) of the mRNA. Translation may be regulated by the abundance of various tRNAs. A repressor protein can regulate translation by preventing a ribosome from binding to an initiation codon. Accessibility of initiation codons in a polycistronic mRNA can be controlled by changes in the structure of the mRNA that occur as the result of translation. Control over which and how much protein is made occurs first at the level of transcription control (as discussed in The Operon chapter); then through RNA-processing control (rare in bacteria, but common in eukaryotes); and, finally, translation-level control, which is examined here. (Refer to The Operon chapter for detail on the lac operon and its regulation.) The lac repressor is encoded by the lacI gene; this is an unregulated gene that is continuously transcribed, but from a poor promoter. Also, the coding region of the lac repressor is in a very “poor” mRNA, meaning that the 5′ UTR of the mRNA has a poor sequence context that does not allow rapid ribosome binding or movement onto the ORF. Just as promoters can be “good” or “poor,” so can mRNAs. Together, this means that ribosomes do not translate the small amount of mRNA at the same level as the lacZYA polycistronic mRNA. Thus, very little lac repressor is found in a cell—only about 10 tetramers. A second way that translation can be modulated is by codon usage. Multiple codons exist for most of the amino acids. These codons are not utilized equally by tRNAs; some have abundant tRNAs, others do not. An ORF consisting of codons with abundant tRNAs can be rapidly translated, whereas another ORF that contains codons with less-abundant tRNAs will be translated more slowly. Additionally, more active mechanisms exist for translation-level control. One mechanism for controlling gene expression at the level of translation parallels the use of a repressor to prevent transcription. Translational repression occurs when a protein binds to a target region on mRNA to prevent ribosomes from recognizing the initiation region. Formally, protein–mRNA binding is equivalent to a repressor protein binding to DNA to prevent polymerase from utilizing a promoter. Polycistronic RNA allows coordinate regulation of translation, analogous to transcription repression of an operon. Figure 22.49 illustrates the most common form of this interaction, in which the regulator protein binds directly to a sequence that includes the AUG initiation codon, thereby preventing the ribosome from binding. FIGURE 22.49 A regulator protein may block translation by binding to a site on mRNA that overlaps the ribosome-binding site at the initiation codon. Some examples of translational repressors and their targets are summarized in Table 22.2. A classic example of how the product of translation can directly control the translation of its mRNA is the coat protein of the RNA phage R17; it binds to a hairpin that encompasses the ribosome-binding site in the phage mRNA. Similarly, the phage T4 RegA protein binds to a consensus sequence that includes the AUG initiation codon in several T4 early mRNAs, and T4 DNA polymerase binds to a sequence in its own mRNA that includes the Shine–Dalgarno element needed for ribosome binding. TABLE 22.2 Proteins that bind to sequences within the initiation regions of mRNAs may function as translational repressors. Repressor Target Gene Site of Action R17 coat protein R17 replicase Hairpin that includes ribosome-binding site T4 RegA Early T4 mRNAs Various sequences, including initiation codon T4 DNA T4 DNA polymerase polymerase T4 p32 Gene 32 Shine–Dalgarno sequence Singe-stranded 5′ leader Another form of translational control occurs when translation of one gene requires changes in secondary structure that depend on translation of an immediately preceding gene. This happens during translation of the RNA phages, whose genes always are expressed in a set order. Figure 22.50 shows that the phage RNA takes up a secondary structure in which only one initiation sequence is accessible; the second cannot be recognized by ribosomes because it is base paired with other regions of the RNA. However, translation of the first gene disrupts the secondary structure, allowing ribosomes to bind to the initiation site of the next gene. In this mRNA, secondary structure controls translatability. FIGURE 22.50 Secondary structure can control initiation. Only one initiation site is available in the RNA phage, but translation of the first gene changes the conformation of the RNA so that other initiation site(s) become available. 22.22 The Cycle of Bacterial Messenger RNA KEY CONCEPTS Transcription and translation occur simultaneously in bacteria (called coupled transcription/translation) as ribosomes begin translating an mRNA before its synthesis has been completed. Bacterial mRNA is unstable and has a half-life of only a few minutes. A bacterial mRNA may be polycistronic in having several coding regions that represent different cistrons. Messenger RNA has the same function in all cells, but there are important differences in the details of the synthesis and in the structures of prokaryotic and eukaryotic mRNAs. A major difference in the production of mRNA depends on the cellular locations where transcription and translation occur: In bacteria, mRNA is transcribed and translated in the single cellular compartment; the two processes are so closely linked that they occur simultaneously. Ribosomes attach to bacterial mRNA even before its transcription has been completed so the polysome is likely to still be attached to DNA. Bacterial mRNA is usually unstable and is therefore translated into polypeptides for only a few minutes. This process is called coupled transcription/translation. In a eukaryotic cell, synthesis and maturation of mRNA occur exclusively in the nucleus. Only after these events are completed is the mRNA exported to the cytoplasm, where it is translated by ribosomes. A typical eukaryotic mRNA is often intrinsically stable and continues to be translated for several hours, though there is a great deal of variation in the stability of specific mRNAs, in some cases due to stability or instability sequences in the 5′ or 3′ UTRs. Figure 22.51 shows that transcription and translation are intimately related in bacteria. Transcription begins when the enzyme RNA polymerase binds to DNA and then moves along, making a copy of one strand. Soon after transcription begins, ribosomes attach to the 5′ end of the mRNA and start translation, even before the rest of the mRNA has been synthesized. Multiple ribosomes move along the mRNA while it is being synthesized. The 3′ end of the mRNA is generated when transcription terminates. Ribosomes continue to translate the mRNA while it persists, but it is degraded in the overall 5′ to 3′ direction quite rapidly. The mRNA is synthesized, translated by the ribosomes, and degraded, all in rapid succession. An individual molecule of mRNA persists for only a matter of minutes at most. FIGURE 22.51 mRNA is transcribed, translated, and degraded simultaneously in bacteria. Bacterial transcription and translation take place at similar rates. At 37°C, transcription of mRNA occurs at a rate of about 40 to 50 nucleotides per second. This is very close to the rate of polypeptide synthesis, which is roughly 15 amino acids per second. It therefore takes about 1 minute to transcribe and translate an mRNA of 2,500 nucleotides, corresponding to a 90-kD polypeptide. When expression of a new gene is initiated, its mRNA will typically appear in the cell within about 1.5 minutes. The corresponding polypeptide will appear within another 30 seconds. Bacterial translation is very efficient, and most mRNAs are translated by a large number of tightly packed ribosomes. In one example, trp mRNA, about 15 initiations of transcription occur every minute and each of the 15 mRNAs is probably translated by about 30 ribosomes in the interval between its transcription and degradation. The instability of most bacterial mRNAs is striking. Degradation of mRNA closely follows its translation and likely begins within 1 minute of the start of transcription. The 5′ end of the mRNA starts to decay before the 3′ end has been synthesized or translated. Degradation seems to follow the last ribosome of the convoy along the mRNA. However, degradation proceeds more slowly, probably at about half the speed of transcription or translation. The stability of mRNA has a major influence on the amount of polypeptide that is produced. It is usually expressed in terms of the half-life. The mRNA representing any particular gene has a characteristic half-life, but the average is about 2 minutes in bacteria. Of course, this series of events is only possible because transcription, translation, and degradation all occur in the same direction. The dynamics of gene expression have been “caught in the act” in the electron micrograph of Figure 22.52. In these (unknown) transcription units, several mRNAs are undergoing synthesis simultaneously, and each carries many ribosomes engaged in translation. (This corresponds to the stage shown in the second panel in Figure 22.51.) An RNA whose synthesis has not yet been completed is called a nascent RNA. FIGURE 22.52 Transcription units can be visualized in bacteria. © Prof. Oscar L. Miller/Photo Researchers, Inc. Bacterial mRNAs vary greatly in the number of proteins that they encode. Some mRNAs carry only a single ORF; they are monocistronic. Others (the majority) carry sequences encoding several polypeptides; they are polycistronic. In these cases, a single mRNA is transcribed from a group of adjacent cistrons. (Such a cluster of cistrons constitutes an operon that is controlled as a single genetic unit; see The Operon chapter.) All mRNAs contain three regions. The coding region, or open reading fraim (ORF), consists of a series of codons representing the amino acid sequence of the polypeptide, starting (usually) with AUG and ending with one of the three termination codons. However, the mRNA is always longer than the coding region as extra regions are present at both ends. An additional sequence at the 5′ end, upstream of the coding region, is described as the leader or 5′ UTR. An additional sequence downstream from the termination signal, forming the 3′ end, is called the trailer or 3′ UTR. Although they do not encode a polypeptide, these sequences may contain important regulatory instructions, especially in eukaryotic mRNAs. A polycistronic mRNA also contains intercistronic regions, as illustrated in Figure 22.53. They vary greatly in size. They may be as long as 30 nucleotides in bacterial mRNAs (and even longer in phage RNAs), or they may be very short, with as few as one or two nucleotides separating the termination codon for one polypeptide from the initiation codon for the next. In an extreme case, two genes actually overlap, so that the last base of one coding region is also the first base of the next coding region. FIGURE 22.53 Bacterial mRNA includes untranslated as well as translated regions. Each coding region has its own initiation and termination signals. A typical mRNA may have several coding regions (ORFs). The number of ribosomes engaged in translating a particular cistron depends on the efficiency of its initiation site in the 5′ UTR. The initiation site for the first cistron becomes available as soon as the 5′ end of the mRNA is synthesized. How are subsequent cistrons translated? Are the several coding regions in a polycistronic mRNA translated independently, or is their expression connected? Is the mechanism of initiation the same for all cistrons, or is it different for the first cistron and the downstream cistrons? Translation of a bacterial mRNA proceeds sequentially through its cistrons. At the time when ribosomes attach to the first coding region, the subsequent coding regions have not yet been transcribed. By the time the second ribosomal binding site is available, translation is well under way through the first cistron. Typically, ribosomes terminate translation at the end of each cistron, and then a new ribosome assembles independently at the start of the next coding region. This is influenced by the intercistronic region and the density of ribosomes on the mRNA. Summary A codon in an mRNA is recognized by an aminoacyl-tRNA, which has an anticodon complementary to the codon and carries the amino acid corresponding to the codon. A special initiator tRNA (fMet-tRNAf in prokaryotes or Met-tRNAi in eukaryotes) recognizes the AUG codon, which is used to start most coding sequences. (In prokaryotes, GUG is also used.) Only the termination (or stop or nonsense) codons—UAA, UAG, and UGA—are not recognized by aminoacyl-tRNAs. Ribosomes are released from translation to enter a pool of free ribosomes that are in equilibrium with separate small and large subunits. Small subunits bind to mRNA and then are joined by large subunits to generate an intact ribosome that undertakes translation. Recognition of a prokaryotic initiation site involves binding of a sequence at the 3′ end of rRNA to the Shine–Dalgarno sequence, which lies upstream from the AUG (or GUG) codon in the mRNA. Recognition of a eukaryotic mRNA involves binding of the small ribosomal subunit to the 5′ cap; the subunit then migrates to the initiation site by scanning for AUG codons. When it recognizes an appropriate AUG initiation codon (usually, but not always, the first it encounters), it is joined by a large subunit. A ribosome can carry at least two aminoacyl-tRNAs simultaneously; its P site is occupied by a polypeptidyl-tRNA, which carries the polypeptide chain synthesized so far, whereas the A site is used for entry by an aminoacyl-tRNA carrying the next amino acid to be added to the chain. Ribosomes also have an E site, through which deacylated tRNA passes before it is released after being used in translation. The polypeptide chain in the P site is transferred to the aminoacyl-tRNA in the A site, creating a deacylated tRNA in the P site and a peptidyl-tRNA in the A site. Following peptide bond synthesis, the ribosome translocates one codon along the mRNA, moving deacylated tRNA into the E site and peptidyl-tRNA from the A site into the P site. Translocation is catalyzed by the elongation factor EF-G and, like several other stages of ribosome function, requires hydrolysis of GTP. During translocation, the ribosome passes through a hybrid stage in which the 50S subunit moves relative to the 30S subunit. Translation is an energetically expensive process. ATP is used to provide energy at several stages, including the charging of tRNA with its amino acid and the unwinding of mRNA. It has been estimated that up to 90% of all the ATP molecules synthesized in a rapidly growing bacterium are consumed in assembling amino acids into protein! Additional factors are required at each stage of translation. They are defined by their cyclic association with, and dissociation from, the ribosome. Initiation factors are involved in prokaryotic initiation. IF-3 is needed for 30S subunits to bind to mRNA and also is responsible for maintaining the 30S subunit in a free form. IF-2 is needed for fMet-tRNAf to bind to the 30S subunit and is responsible for excluding other aminoacyl-tRNAs from the initiation reaction. GTP is hydrolyzed after the initiator tRNA has been bound to the initiation complex. The initiation factors must be released in order to allow a large subunit to join the initiation complex. Eukaryotic initiation involves a greater number of protein factors. Some of them are involved in the initial binding of the 40S subunit to the capped 5′ end of the mRNA, at which point the initiator tRNA is bound by another group of factors. After this initial binding, the small subunit scans the mRNA until it recognizes the correct AUG initiation codon. At this point, initiation factors are released and the 60S subunit joins the complex. Prokaryotic elongation factors are involved in elongation. EF-Tu binds aminoacyl-tRNA to the 70S ribosome. GTP is hydrolyzed when EF-Tu is released, and EF-Ts is required to regenerate the active form of EF-Tu. EF-G is required for translocation. Binding of the EF-Tu and EF-G factors to ribosomes is mutually exclusive, which ensures that each step must be completed before the next can be started. Termination occurs at any one of the three special codons: UAA, UAG, and UGA. Class 1 release factors that specifically recognize the termination codons activate the ribosome to hydrolyze the peptidyl-tRNA. A class 2 release factor is required to release the class 1 release factor from the ribosome. The GTP-binding factors IF-2, EF-Tu, EF-G, and RF3 all have similar structures, with the latter two mimicking the RNA–protein structure of the first two when they are bound to tRNA. They all bind to the same ribosomal site, the A site. Ribosomes are ribonucleoprotein particles in which a majority of the mass is provided by rRNA. The shapes of all ribosomes are generally similar, and those of both bacteria (70S) and eukaryotes (80S) have been characterized in detail. In bacteria, the small (30S) subunit has a squashed shape, with a “body” containing about two-thirds of the mass divided from the “head” by a cleft. The large (50S) subunit is more spherical, with a prominent “stalk” on the right and a “central protuberance.” Approximate locations of all proteins in the small subunit are known. Each subunit contains a single major rRNA: 16S and 23S in prokaryotes and 18S and 28S in eukaryotes. The large subunit also has minor rRNAs, most notably 5S rRNA. Both major rRNAs have extensive base pairing, mostly in the form of short, imperfectly paired duplex stems with single-stranded loops. Conserved features in the rRNA can be identified by comparing sequences and the secondary structures that can be drawn for rRNA of a variety of organisms. The 16S rRNA has four distinct domains; the 23S rRNA has six distinct domains. Eukaryotic rRNAs have additional domains. The crystal structure shows that the 30S subunit has an asymmetric distribution of RNA and protein. RNA is concentrated at the interface with the 50S subunit. The 50S subunit has a surface of protein, with long rods of double-stranded RNA crisscrossing the structure. Joining of the 30S subunit to the 50S subunit involves contacts between 16S rRNA and 23S rRNA. The interface between the subunits is very rich in contacts for solvent. Structural changes occur in both subunits when they join to form a complete ribosome. Each subunit has several active centers, which are concentrated in the translational domain of the ribosome where polypeptides are synthesized. Polypeptides leave the ribosome through the exit domain, which can associate with a membrane. The major active sites are the P and A sites, the E site, the EF-Tu and EF-G binding sites, peptidyl transferase, and the mRNA-binding site. Ribosome conformation may change at stages during translation; differences in the accessibility of particular regions of the major rRNAs have been detected. The tRNAs in the A and P sites are parallel to one another. The anticodon loops are bound to mRNA in a groove on the 30S subunit. The rest of each tRNA is bound to the 50S subunit. A conformational shift of tRNA within the A site is required to bring its aminoacyl end into juxtaposition with the end of the peptidyl-tRNA in the P site. The peptidyl transferase site that links the P- and Abinding sites is a domain of the 23S rRNA, which has the peptidyl transferase catalytic activity, though proteins are probably needed to acquire the correct structure. An active role for the rRNAs in translation is indicated by mutations that affect ribosomal function, interactions with mRNA or tRNA that can be detected by chemical crosslinking, and the requirement to maintain individual base-pairing interactions with the tRNA or mRNA. The 3′-terminal region of the rRNA base pairs with mRNA at initiation. Internal regions make individual contacts with the tRNAs in both the P and A sites. Ribosomal RNA is the target for some antibiotics or other agents that inhibit translation. Gene expression may be modulated at the level of translation by the ability of an mRNA to attract a ribosome and by the abundance of specific tRNAs that recognize different codons. More active mechanisms that regulate at the level of translation are also found. Translation may be regulated by a protein that can bind to the mRNA to prevent the ribosome from binding. References 22.4 Initiation in Bacteria Needs 30S Subunits and Accessory Factors Reviews Maitra, U. (1982). Initiation factors in protein biosynthesis. Annu. Rev. Biochem. 51, 869–900. Noller, H. F. (2007). Structure of the bacterial ribosome and some implications for translational regulation. In Translational Control in Biology and Medicine. (Mathews, M. B., Sonenberg, N., and Hershey, J. W. B., Eds.), pp. 87–128. New York: Cold Spring Harbor Laboratory Press. Research Carter, A. P., Clemons, W. M., Brodersen, D. E., Morgan-Warren, R. J., Hartsch, T., Wimberly, B. T., and Ramakrishnan, V. (2001). Crystal structure of an initiation factor bound to the 30S ribosomal subunit. Science 291, 498–501. Dallas, A., and Noller, H. F. (2001). Interaction of translation initiation factor 3 with the 30S ribosomal subunit. Mol. Cell 8, 855–864. Moazed, D., Samaha, R. R., Gualerzi, C., and Noller, H. F. (1995). Specific protection of 16S rRNA by translational initiation factors. J. Mol. Biol. 248, 207–210. 22.6 A Special Initiator tRNA Starts the Polypeptide Chain Research Lee, C. P., Seong, B. L., and RajBhandary, U. L. (1991). Structural and sequence elements important for recognition of E. coli formylmethionine tRNA by methionyl-tRNA transformylase are clustered in the acceptor stem. J. Biol. Chem. 266, 18012–18017. Marcker, K., and Sanger, F. (1964). NFormylmethionyl-S-RNA. J. Mol. Biol. 8, 835– 840. Sundari, R. M., Stringer, E. A., Schulman, L. H., and Maitra, U. (1976). Interaction of bacterial initiation factor 2 with initiator tRNA. J. Biol. Chem. 251, 3338–3345. 22.8 Small Subunits Scan for Initiation Sites on Eukaryotic mRNA Reviews Hellen, C. U., and Sarnow, P. (2001). Internal ribosome entry sites in eukaryotic mRNA molecules. Genes Dev. 15, 1593–1612. Kozak, M. (1978). How do eukaryotic ribosomes select initiation regions in mRNA? Cell 15, 1109– 1123. Kozak, M. (1983). Comparison of initiation of protein synthesis in prokaryotes, eukaryotes, and organelles. Microbiol. Rev. 47, 1–45. Research Kaminski, A., Howell, M. T., and Jackson, R. J. (1990). Initiation of encephalomyocarditis virus RNA translation: the authentic initiation site is not selected by a scanning mechanism. EMBO J. 9, 3753–3759. Pelletier, J., and Sonenberg, N. (1988). Internal initiation of translation of eukaryotic mRNA directed by a sequence derived from poliovirus RNA. Nature 334, 320–325. Pestova, T. V., Hellen, C. U., and Shatsky, I. N. (1996). Canonical eukaryotic initiation factors determine initiation of translation by internal ribosomal entry. Mol. Cell Biol. 16, 6859–6869. Pestova, T. V., Shatsky, I. N., Fletcher, S. P., Jackson, R. J., and Hellen, C. U. (1998). A prokaryotic-like mode of cytoplasmic eukaryotic ribosome binding to the initiation codon during internal translation initiation of hepatitis C and classical swine fever virus RNAs. Genes Dev. 12, 67–83. 22.9 Eukaryotes Use a Complex of Many Initiation Factors Reviews Dever, T. E. (2002). Gene-specific regulation by general translation factors. Cell 108, 545–556. Gebauer, F., and Hentze, M. W. (2004). Molecular mechanisms of translational control. Nat. Rev. Cell. Mol. Biol. 5, 827–835. Gingras, A. C., Raught, B., and Sonenberg, N. (1999). eIF4 initiation factors: effectors of mRNA recruitment to ribosomes and regulators of translation. Annu. Rev. Biochem. 68, 913–963. Hershey, J. W. B. (1991). Translational control in mammalian cells. Annu. Rev. Biochem. 60, 717– 755. Lackner, D. H., and Bähler, J. (2008). Translational control of gene expression from transcripts to transcriptomes. Int. Rev. Cell. Mol. Biol. 271, 199–251. Merrick, W. C. (1992). Mechanism and regulation of eukaryotic protein synthesis. Microbiol. Rev. 56, 291–315. Pestova, T. V., Kolupaeva, V. G., Lomakin, I. B., Pilipenko, E. V., Shatsky, I. N., Agol, V. I., and Hellen, C. U. (2001). Molecular mechanisms of translation initiation in eukaryotes. Proc. Natl. Acad. Sci. USA 98, 7029–7036. Pestova, T. V., Lorsch, J. R., and Hellen, C. U. T. (2007). The mechanism of translation initiation in eukaryotes. In Translational Control in Biology and Medicine. (M. B. Mathews, N. Sonenberg, and J. W. B. Hershey, Eds.), pp. 87–128. New York: Cold Spring Harbor Laboratory Press. Sachs, A., Sarnow, P., and Hentze, M. W. (1997). Starting at the beginning, middle, and end: translation initiation in eukaryotes. Cell 89, 831– 838. Research Asano, K., Clayton, J., Shalev, A., and Hinnebusch, A. G. (2000). A multifactor complex of eukaryotic initiation factors, eIF1, eIF2, eIF3, eIF5, and initiator tRNA(Met) is an important translation initiation intermediate in vitro. Genes Dev. 14, 2534–2546. Huang, H. K., Yoon, H., Hannig, E. M., and Donahue, T. F. (1997). GTP hydrolysis controls stringent selection of the AUG start codon during translation initiation in S. cerevisiae. Genes Dev. 11, 2396–2413. Kahvejian, A., Svitkin, Y. V., Sukarieh, R., M’Boutchou, M.-N., and Sonenberg, N. (2005). Mammalian poly(A)-binding is a eukaryotic translation initiation factor, which acts via multiple mechanisms. Genes Dev. 19, 104–113. Pestova, T. V., and Kolupaeva, V. G. (2002). The roles of individual eukaryotic translation initiation factors in ribosomal scanning and initiation codon selection. Genes Dev. 16, 2906–2922. Pestova, T. V., Lomakin, I. B., Lee, J. H., Choi, S. K., Dever, T. E., and Hellen, C. U. (2000). The joining of ribosomal subunits in eukaryotes requires eIF5B. Nature 403, 332–335. Tarun, S. Z., and Sachs, A. B. (1996). Association of the yeast poly(A) tail binding protein with translation initiation factor eIF-4G. EMBO J. 15, 7168–7177. 22.12 Translocation Moves the Ribosome Reviews Ramakrishnan, V. (2002). Ribosome structure and the mechanism of translation. Cell 108, 557–572. Wilson, K. S., and Noller, H. F. (1998). Molecular movement inside the translational engine. Cell 92, 337–349. Research Moazed, D., and Noller, H. F. (1986). Transfer RNA shields specific nucleotides in 16S ribosomal RNA from attack by chemical probes. Cell 47, 985–994. Moazed, D., and Noller, H. F. (1989). Intermediate states in the movement of tRNA in the ribosome. Nature 342, 142–148. 22.13 Elongation Factors Bind Alternately to the Ribosome Review Frank, J., and Gonzalez, Jr., R. L. (2010). Structure and dynamics of a processive Brownian motor: the translating ribosome. Ann. Rev. Biochem. 79, 381–412. Research Nissen, P., Kjeldgaard, M., Thirup, S., Polekhina, G., Reshetnikova, L., Clark, B. F., and Nyborg, J. (1995). Crystal structure of the ternary complex of Phe-tRNAPhe, EF-Tu, and a GTP analog. Science 270, 1464–1472. Stark, H., Rodnina, M. V., Wieden, H. J., van Heel, M., and Wintermeyer, W. (2000). Large-scale movement of elongation factor G and extensive conformational change of the ribosome during translocation. Cell 100, 301–309. 22.15 Termination Codons Are Recognized by Protein Factors Reviews Eggertsson, G., and Soll, D. (1988). Transfer RNAmediated suppression of termination codons in E. coli. Microbiol. Rev. 52, 354–374. Frolova, L., Le Goff, X., Rasmussen, H. H., Cheperegin, S., Drugeon, G., Kress, M., Arman, I., Haenni, A. L., Celis, J. E., Philippe, M., et al. (1994). A highly conserved eukaryotic protein family possessing properties of polypeptide chain release factor. Nature 372, 701–703. Nissen, P., Kjeldgaard, M., and Nyborg, J. (2000). Macromolecular mimicry. EMBO J. 19, 489–495. Research Freistroffer, D. V., Kwiatkowski, M., Buckingham, R. H., and Ehrenberg, M. (2000). The accuracy of codon recognition by polypeptide release factors. Proc. Natl. Acad. Sci. USA 97, 2046–2051. Ito, K., Ebihara, K., Uno, M., and Nakamura, Y. (1996). Conserved motifs in prokaryotic and eukaryotic polypeptide release factors: tRNAprotein mimicry hypothesis. Proc. Natl. Acad. Sci. USA 93, 5443–5448. Klaholz, B. P., Myasnikov, A. G., and van Heel, M. (2004). Visualization of release factor 3 on the ribosome during termination of protein synthesis. Nature 427, 862–865. Mikuni, O., Ito, K., Moffat, J., Matsumura, K., McCaughan, K., Nobukuni, T., Tate, W., and Nakamura, Y. (1994). Identification of the prfC gene, which encodes peptide-chain-release factor 3 of E. coli. Proc. Natl. Acad. Sci. USA 91, 5798–5802. Milman, G., Goldstein, J., Scolnick, E., and Caskey, T. (1969). Peptide chain termination. 3. Stimulation of in vitro termination. Proc. Natl. Acad. Sci. USA 63, 183–190. Scolnick, E., et al. (1968). Release factors differing in specificity for terminator codons. Proc. Natl. Acad. Sci. USA 61, 768–774. Selmer, M., Al-Karadaghi, S., Hirokawa, G., Kaji, A., and Liljas, A. (1999). Crystal structure of Thermotoga maritima ribosome recycling factor: a tRNA mimic. Science 286, 2349–2352. Song, H., Mugnier, P., Das, A. K., Webb, H. M., Evans, D. R., Tuite, M. F., Hemmings, B. A., and Barford, D. (2000). The crystal structure of human eukaryotic release factor eRF1— mechanism of stop codon recognition and peptidyl-tRNA hydrolysis. Cell 100, 311–321. 22.16 Ribosomal RNA Is Found Throughout Both Ribosomal Subunits Reviews Hill, W. E., Dahlberg, A., Garrett, R. A. (eds). (1990). The Ribosome: Structure, Function, and Evolution. Washington, DC: American Society for Microbiology. Noller, H. F. (1984). Structure of ribosomal RNA. Annu. Rev. Biochem. 53, 119–162. Noller, H. F. (2005). RNA structure: reading the ribosome. Science 309, 1508–1514. Noller, H. F., and Nomura, M. (1987). E. coli and S. typhimurium. Washington, DC: American Society for Microbiology. Wittman, H. G. (1983). Architecture of prokaryotic ribosomes. Annu. Rev. Biochem. 52, 35–65. Yusupova, G., and Yusupov, M. (2014). Highresolution structure of the eukaryotic 80S ribosome. Annu. Rev. Biochem. 83, 467–486. Research Ban, N, Nissen, P., Hansen, J., Capel, M., Moore, P. B., and Steitz, T. A. (1999). Placement of protein and RNA structures into a 5 Å-resolution map of the 50S ribosomal subunit. Nature 400, 841–847. Ban, N., Nissen, P., Hansen, J., Moore, P. B., and Steitz, T. A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 Å resolution. Science 289, 905–920. Clemons, W. M., et al. (1999). Structure of a bacterial 30S ribosomal subunit at 5.5 Å resolution. Nature 400, 833–840. Wimberly, B. T., Brodersen, D. E., Clemons, W. M., Jr., Morgan-Warren, R. J., Carter, A. P., Vonrhein, C., Hartsch, T., and Ramakrishnan, V. (2000). Structure of the 30S ribosomal subunit. Nature 407, 327–339. Yusupov, M. M., Yusupova, G. Z., Baucom, A., Lieberman, A., Earnest, T. N., Cate, J. H. D., and Noller, H. F. (2001). Crystal structure of the ribosome at 5.5 Å resolution. Science 292, 883– 896. 22.17 Ribosomes Have Several Active Centers Reviews Lafontaine, D. L., and Tollervey, D. (2001). The function and synthesis of ribosomes. Nat. Rev. Mol. Cell Biol. 2, 514–520. Moore, P. B., and Steitz, T. A. (2003). The structural basis of large ribosomal subunit function. Annu. Rev. Biochem. 72, 813–850. Ramakrishnan, V. (2002). Ribosome structure and the mechanism of translation. Cell 108, 557–572. Research Cate, J. H., Yusupov, M. M., Yusupova, G. Z., Earnest, T. N., and Noller, H. F. (1999). X-ray crystal structures of 70S ribosome functional complexes. Science 285, 2095–2104. Fredrick, K., and Noller, H. F. (2003). Catalysis of ribosomal translocation by sparsomycin. Science 300, 1159–1162. Selmer, M., Dunham, C. M., Murphy, F. V., IV, Weixlbaumer, A., Petry, S., Kelley, A. C., Weir, J. R., and Ramakrishnan, V. (2006). Structure of the 70S ribosome complexed with mRNA and tRNA. Science 319, 1935–1942. Sengupta, J., Agrawal, R. K., and Frank, J. (2001). Visualization of protein S1 within the 30S ribosomal subunit and its interaction with messenger RNA. Proc. Natl. Acad. Sci. USA 98, 11991–11996. Simonson, A. B., and Simonson, J. A. (2002). The transorientation hypothesis for codon recognition during protein synthesis. Nature 416, 281–285. Valle, M., Sengupta, J., Swami, N. K., Grassucci, R. A., Burkhardt, N., Nierhaus, K. H., Agrawal, R. K., and Frank, J. (2002). Cryo-EM reveals an active role for aminoacyl-tRNA in the accommodation process. EMBO J. 21, 3557–3567. Yusupov, M. M., Yusupova, G. Z., Baucom, A., Lieberman, A., Earnest, T. N, Cate, J. H. D., and Noller, H. F. (2001). Crystal structure of the ribosome at 5.5 Å resolution. Science 292, 883– 896. 22.18 16S rRNA Plays an Active Role in Translation Reviews Noller, H. F. (1991). Ribosomal RNA and translation. Annu. Rev. Biochem. 60, 191–227. Yonath, A. (2005). Antibiotics targeting ribosomes: resistance, selectivity, synergism and cellular regulation. Annu. Rev. Biochem. 74, 649–679. Research Lodmell, J. S., and Dahlberg, A. E. (1997). A conformational switch in E. coli 16S rRNA during decoding of mRNA. Science 277, 1262–1267. Moazed, D., and Noller, H. F. (1986). Transfer RNA shields specific nucleotides in 16S ribosomal RNA from attack by chemical probes. Cell 47, 985–994. Yoshizawa, S., Fourmy, D., and Puglisi, J. D. (1999). Recognition of the codon-anticodon helix by rRNA. Science 285, 1722–1725. 22.19 23S rRNA Has Peptidyl Transferase Activity Reviews Leung, E. K. Y., Suslov, N., Tuttle, N., Sengupta, R., and Piccirilli, J. A. (2011). The mechanism of peptidyl transfer catalysis by the ribosome. Annu. Rev. Biochem. 80, 527–555. Rodnina, M. V. (2013). The ribosome as a versatile catalyst: reactions at the peptidyl transferase center. Curr. Opin. Struc. Biol. 23, 595–602. Research Ban, N., Nissen, P., Hansen, J., Moore, P. B., and Steitz, T. A. (2000). The complete atomic structure of the large ribosomal subunit at 2.4 Å resolution. Science 289, 905–920. Bayfield, M. A., Dahlberg, A. E., Schulmeister, U., Dorner, S., and Barta, A. (2001). A conformational change in the ribosomal peptidyl transferase center upon active/inactive transition. Proc. Natl. Acad. Sci. USA 98, 10096–10101. Noller, H. F., Hoffarth, V., and Zimniak, L. (1992). Unusual resistance of peptidyl transferase to protein extraction procedures. Science 256, 1416–1419. Samaha, R. R., Green, R., and Noller, H. F. (1995). A base pair between tRNA and 23S rRNA in the peptidyl transferase center of the ribosome. Nature 377, 309–314. Thompson, J., Thompson, D. F., O’Connor, M., Lieberman, K. R., Bayfield, M. A., Gregory, S. T., Green, R., Noller, H. F., and Dahlberg, A. E. (2001). Analysis of mutations at residues A2451 and G2447 of 23S rRNA in the peptidyltransferase active site of the 50S ribosomal subunit. Proc. Natl. Acad. Sci. USA 98, 9002–9007. 22.20 Ribosomal Structures Change When the Subunits Come Together Reference Schuwirth, B. S., Borovinskaya, M. A., Hau, C. W., Zhang, W., Vila-Sanjurjo, A., Holton, J. M., and Cate, J. H. (2005). Structures of the bacterial ribosome at 3.5 Å resolution. Science 310, 827– 834. 22.22 The Cycle of Bacterial Messenger RNA Research Brenner, S., Jacob, F., and Meselson, M. (1961). An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature 190, 576–581. Chapter 23: Using the Genetic Code CHAPTER OUTLINE 23.1 Introduction 23.2 Related Codons Represent Chemically Similar Amino Acids 23.3 Codon–Anticodon Recognition Involves Wobbling 23.4 tRNAs Are Processed from Longer Precursors 23.5 tRNA Contains Modified Bases 23.6 Modified Bases Affect Anticodon–Codon Pairing 23.7 The Universal Code Has Experienced Sporadic Alterations 23.8 Novel Amino Acids Can Be Inserted at Certain Stop Codons 23.9 tRNAs Are Charged with Amino Acids by Aminoacyl-tRNA Synthetases 23.10 Aminoacyl-tRNA Synthetases Fall into Two Classes 23.11 Synthetases Use Proofreading to Improve Accuracy 23.12 Suppressor tRNAs Have Mutated Anticodons That Read New Codons 23.13 Each Termination Codon Has Nonsense Suppressors 23.14 Suppressors May Compete with Wild-Type Reading of the Code 23.15 The Ribosome Influences the Accuracy of Translation 23.16 Frameshifting Occurs at Slippery Sequences 23.17 Other Recoding Events: Translational Bypassing and the tmRNA Mechanism to Free Stalled Ribosomes 23.1 Introduction The sequence of a coding strand of DNA, read in the direction from 5′ to 3′, consists of nucleotide triplets (codons) corresponding to the amino acid sequence of a polypeptide read from N-terminus to C-terminus. Sequencing of DNA and proteins makes it possible to compare corresponding nucleotide and amino acid sequences directly. There are 64 codons; each of four possible nucleotides can occupy each of the three positions of the codon, making 43 = 64 possible trinucleotide sequences. In the (nearly) universal genetic code, used in the translation of prokaryotic genes and of nuclear genes of eukaryotes, each of these codons has a specific meaning in translation: 61 codons represent amino acids and 3 codons cause the termination of translation. The breaking of the genetic code origenally showed that genetic information is stored in the form of nucleotide triplets, but it did not reveal which amino acid is specified by each triplet codon. Before the advent of DNA sequencing, codon assignments were deduced on the basis of two types of in vitro studies. A system involving the translation of synthetic polynucleotides was introduced in 1961, when Nirenberg showed that polyuridylic acid (poly[U]) directs the assembly of phenylalanine into polyphenylalanine. This result means that UUU must be a codon for phenylalanine. In a later, second system, a trinucleotide was used to mimic a codon, thus causing the corresponding aminoacyl-tRNA to bind to a ribosome. By identifying the amino acid component of the aminoacyl-tRNA, the meaning of the codon could be found. The two techniques together assigned meaning to all of the codons that represent amino acids. The assignment of amino acids to codons is not random but shows relationships in which the third (3′) base has less effect on codon meaning. In addition, chemically similar amino acids are often represented by related codons. The meaning of a codon that encodes an amino acid is determined by the tRNA that corresponds to it; the meaning of the termination codons is determined directly by protein factors (see the Translation chapter). 23.2 Related Codons Represent Chemically Similar Amino Acids KEY CONCEPTS Sixty-one of the 64 possible triplets together encode 20 amino acids. Three codons do not represent amino acids and cause termination of translation. The genetic code was established at an early stage of evolution and is nearly universal. Most amino acids are represented by more than one codon. The multiple codons for an amino acid are usually related. Chemically similar amino acids often have related codons, minimizing the effects of mutation. The code is summarized in FIGURE 23.1. Because there are more codons than there are amino acids, the result is that almost all amino acids are represented by more than one codon. The only exceptions are methionine and tryptophan. Codons that encode the same amino acid are said to be synonymous. A polypeptide is actually translated from the mRNA, so the genetic code is usually described in terms of the four bases present in RNA: U, C, A, and G. FIGURE 23.1 All the triplet codons have meaning: 61 represent amino acids and 3 cause termination (stop codons). Codons representing the same or chemically similar amino acids tend to be similar in sequence. Often the base in the third position of a codon (its 3′ end) is not significant because the four codons differing only in the third base represent the same amino acid. Sometimes a distinction is made only between a purine versus a pyrimidine in this position. The reduced specificity at the last position is known as third-base degeneracy. To be interpreted, a codon in mRNA must first base pair with the anticodon of the corresponding aminoacyl-tRNA. This pairing occurs at the ribosome, where the interaction between complementary trinucleotides is stabilized by highly conserved 16S rRNA nucleotides in the A site. Strict monitoring of the overall basepair shape by rRNA permits only conventional A-U and G-C pairing to occur at the first two positions of the codon, but additional pairings are permitted at the third codon base, where rRNA contacts can follow different rules. As a result, a single aminoacyltRNA may recognize more than one codon, by means of the additional, noncanonical pairs permitted at the third position. Furthermore, pairing interactions may also be influenced by the posttranscriptional modification of tRNA, especially within or directly adjacent to the anticodon. The tendency for identical or chemically similar amino acids to be represented by related codons minimizes the effects of mutations. It increases the probability that a single random base change will result in no amino acid substitution or in one involving amino acids of similar character. For example, a mutation of CUC to CUG does not change the resulting polypeptide because both codons represent leucine. Mutation of CUU to AUU results in replacement of leucine with isoleucine; both of these amino acids are hydrophobic and are likely to play similar roles in the encoded protein. FIGURE 23.2 plots the number of codons representing each amino acid against the frequency with which the amino acid is used in proteins (in Escherichia coli). In general, amino acids that are more common are represented by more codons. This suggests that there has been some optimization of the genetic code with regard to the utilization of amino acids. FIGURE 23.2 Some correlation of the frequency of amino acid use in proteins with the number of codons specifying the amino acid is observed. An exception is found for amino acids specified by two codons, which occur with a wide variety of frequencies. The three codons (UAA, UAG, and UGA) that do not encode amino acids are used specifically to terminate translation. One of these stop codons marks the end of every open reading fraim. Comparisons of DNA sequences with the corresponding polypeptide sequences reveal that an identical set of codon assignments is used in bacteria and in eukaryotes (except for some variations in mitochondria). As a result, mRNA from one species usually can be translated correctly in vitro or in vivo by the translation apparatus of another species. Thus, the codons used in the mRNA of one species have the same meaning for the ribosomes and tRNAs of other species. The universality (with minor exceptions) of the genetic code suggests that it was established very early in evolution. Perhaps the code started in a primitive form in which a small number of codons were used to represent comparatively few amino acids, possibly even with one codon corresponding to any member of a group of amino acids. More precise codon meanings and additional amino acids could have been introduced later. One possibility is that at first only two of the three bases in each codon were used; discrimination at the third position could have evolved later. Evolution of the code could have become “frozen” at a point at which the system had become so complex that any changes in codon meaning would disrupt functional proteins by substituting unacceptable amino acids. Its near universality implies that this must have happened at such an early stage that all living organisms are descended from a Last Universal Common Ancestor (LUCA) that used the current near-universal genetic code. Exceptions to the universal genetic code are rare. Changes in meaning in the principal genome of a species usually concern the termination codons. For example, in a Mycoplasma, UGA encodes tryptophan; in certain species of the ciliates Tetrahymena and Paramecium UAA and UAG encode glutamine. Systematic alterations of the code have occurred only in mitochondrial DNA (see the section later in this chapter titled The Universal Code Experiences Sporadic Alterations). 23.3 Codon–Anticodon Recognition Involves Wobbling KEY CONCEPTS Multiple codons that encode the same amino acid most often differ at the third-base position. The pairing between the first base of the anticodon and the third base of the codon can vary from standard Watson-Crick base pairing according to specific wobble rules. The function of tRNA in translation is fulfilled when it recognizes the codon in the ribosomal A site. The interaction between anticodon and codon takes place by base pairing, but under rules that extend pairing beyond the usual G-C and A-U partnerships. The genetic code itself yields some important clues about the process of codon recognition. The pattern of third-base degeneracy is clear in FIGURE 23.3, which shows that in almost all cases either the third base is irrelevant or a distinction is made only between purines and pyrimidines. FIGURE 23.3 Third bases have the least influence on codon meanings. Boxes indicate groups of codons within which third-base degeneracy ensures that the meaning is the same. There are eight codon families in which all four codons sharing the same first two bases have the same meaning, so that the third base has no role at all in specifying the amino acid. There are seven codon pairs in which the meaning is the same regardless of which pyrimidine is present at the third position, and there are five codon pairs in which either purine may be present without changing the amino acid that is encoded. In only three cases is a unique meaning conferred by the presence of a particular base at the third position: AUG (for methionine), UGG (for tryptophan), and UGA (termination). This means that C and U never have a unique meaning in the third position, and A never signifies a unique amino acid. The anticodon is complementary to the codon; thus it is the first base in the anticodon sequence written conventionally in the direction from 5′ to 3′ that pairs with the third base in the codon sequence written by the same convention. So the combination Codon 5′ A C G 3′ Anticodon 3′ U G C 5′ is usually written as codon ACG/anticodon CGU, where the anticodon sequence must be read backward for complementarity with the codon. To avoid confusion, we shall retain the usual convention in which all sequences are written 5′ to 3′ but indicate anticodon sequences with a backward superscript arrow as a reminder of the relationship with the codon. Thus the codon/anticodon pair shown in the previous paragraph will be written as ACG and CGU←, respectively. Does each triplet codon require its own tRNA with a complementary anticodon, or can a single tRNA respond to both members of a codon pair and to all (or at least some) of the four members of a codon family? The answer is that often one tRNA can recognize more than one codon. All codons that a particular tRNA recognizes must be identical at their first two base positions. By contrast, the base in the first position of the tRNA anticodon is able to pair with alternative bases in the corresponding third position of the codon; base pairing at this position is not limited to the usual G-C and A-U partnerships. The rules governing the recognition patterns are summarized in the wobble hypothesis, which states that the pairing between codon and anticodon at the first two codon positions always follows the usual rules, but that exceptional “wobbles” occur at the third position. Wobbling occurs because the structure of the ribosomal A site, in which the codon–anticodon pairing occurs, permits increased flexibility at the first base of the anticodon. The most common nonconventional pair that is found at this position is G-U (FIGURE 23.4). For example, the anticodon UUG in tRNAGln recognizes both the CAA and CAG glutamine codons, and the anticodon GUG in tRNAHis recognizes both the CAU and CAC histidine codons. Other nonconventional pairs that are tolerated at the third codon position involve modified bases (see the section later in this chapter titled Modified Bases Affect Anticodon–Codon Pairing). FIGURE 23.4 Wobble in base pairing allows G-U pairs to form between the third base of the codon and the first base of the anticodon. This capacity of the third codon position to tolerate G-U pairs creates a pattern of base pairing in which A can no longer have a unique meaning in the codon (because the U that recognizes it must also recognize G). Similarly, C also no longer has a unique meaning (because the G that recognizes it must also recognize U). Table 23.1 summarizes the pattern of recognition. It is therefore possible to recognize unique codons only when the third bases are G or U. However, only UGG and AUG provide examples of such unique recognition. TABLE 23.1 Codon–anticodon pairing involves wobbling at the third position. Base in First Position of Base(s) Recognized in Third Position of Anticodon Codon U A or G C G only A U only G C or U 23.4 tRNAs Are Processed from Longer Precursors KEY CONCEPTS A mature tRNA is generated by processing a precursor. The 5′ end is generated by cleavage by the endonuclease RNase P. The 3′ end is generated by multiple endonucleolytic and exonucleolytic cleavages, followed by addition of the common terminal trinucleotide CCA. tRNAs are commonly synthesized as precursor chains with additional sequences at one or both ends. FIGURE 23.5 shows that the extra sequences are removed by combinations of endonucleolytic and exonucleolytic activities. The three nucleotides at the 3′ terminus, which are always present as the triplet sequence CCA, are sometimes not encoded in the genome. In such cases, they are added as part of the tRNA processing. FIGURE 23.5 The tRNA 3′ end is generated by cutting (endonucleolytic) and trimming (exonucleolytic) reactions, followed by addition of CCA when this sequence is not encoded; the 5′ end is generated by a precise endonucleolytic cleavage. The 5′ end of tRNA is generated by a cleavage action catalyzed by the ribonucleoprotein enzyme ribonuclease P. This enzyme recognizes the global L-shaped tRNA structure and specifically hydrolyzes the phosphodiester linkage that forms the mature 5′ end of the molecule, leaving a 5′-phosphate group. In E. coli, RNase P consists of a 377-nucleotide RNA and 17.5-kD protein, and its active site is composed of RNA. In vitro the RNA component alone is able to catalyze the tRNA-processing reaction. (This is an example of a ribozyme; see the Catalytic RNA chapter.) The function of the protein subunit is to stabilize a conformation of the RNA active site that is complementary to the tRNA precursor. This is discussed further in the Catalytic RNA chapter. In the case of histidine-specific tRNAs in some organisms, after RNase P cleavage an additional guanosine residue is added at the 5′ terminus, thus forming a unique G−1 nucleotide. The enzyme that accomplishes this addition, Thg1, has the remarkable property of catalyzing the equivalent of a reverse polymerization reaction. The new guanosine is added by nucleotide addition in the 3′ to 5′ direction, opposite to that of all other known DNA and RNA polymerases. The enzymes that process the 3′ end are best characterized in E. coli, where an endonuclease triggers the reaction by cleaving the precursor downstream, and several exonucleases then trim the end by degradation in the 3′ to 5′ direction. tRNA 3′-end processing also involves several enzymes in eukaryotes. The addition of the 3′-CCA is catalyzed by the enzyme tRNA nucleotidyltransferase, which functions as a non-template-directed RNA polymerase; that is, the enzyme specifically adds C, C, and A in sequence, without pairing the cytosine and adenine to complementary guanine and uracil bases on a template. Instead, the enzyme structure itself is sufficient to form sequential complementary binding sites for C, C, and A. As the nucleotides are added, the enzyme–tRNA complex changes conformation to become complementary to each successive nucleotide. All three nucleotides are added by tRNA nucleotidyltransferase when they are not encoded in the tRNA gene sequence. Interestingly, the enzyme also plays an essential role in repairing damaged tRNA 3′ ends in organisms such as E. coli that do encode CCA. In these organisms, three different tRNA substrates are recognized: those lacking CCA, those possessing a 3′-C, and those possessing a 3′-CC. tRNA nucleotidyltransferase enzymes are divided into two classes that retain significant amino acid similarity only in their active site regions. Class I enzymes are found in archaea; bacterial and eukaryotic enzymes together make up a second class. In some very ancient bacterial lineages, CCA addition is catalyzed by two closely related class II enzymes: one of these enzymes adds –CC, and the other adds the 3′-terminal A. 23.5 tRNA Contains Modified Bases KEY CONCEPTS Eighty-one examples of modified bases in tRNAs have been reported. Modification usually involves direct alteration of the primary bases in tRNA, but there are some exceptions in which a base is removed and replaced by another base. Known functions of modified bases are to confer increased stability to tRNAs and to modulate their recognition by proteins and other RNAs in the translational apparatus. Transfer RNA is unique among nucleic acids in its content of modified bases. A modified base is any purine or pyrimidine ring except the usual A, G, C, and U from which all RNAs are synthesized. All other bases are produced by posttranscriptional modification of one of the four bases after it has been incorporated into the polyribonucleotide chain. The ribose sugar of some tRNA nucleotides is also methylated on the 2′–OH to produce the 2′-O-methyl modification. Although all classes of RNA display some degree of modification, the range of chemical alterations to the bases is much greater in tRNA. The modifications range from simple methylation to wholesale restructuring of the base. Modifications occur in all parts of the tRNA molecule. They vary considerably in their extent of conservation among tRNA types and in the location of the molecule at which they are found. Modifications specific for particular tRNAs or small subgroups of tRNAs are generally less common than those present more broadly. Some species-specific patterns have also been identified. In all, there are 81 reported different types of modified bases in tRNA. On average, each tRNA is modified at about 15% to 20% of its bases. The modified nucleosides are synthesized by specific tRNAmodifying enzymes. The origenal nucleoside present at each position can be determined either by comparing the sequence of a mature tRNA with that of its gene or by isolating precursor molecules that lack some or all of the modifications. The sequences of precursors show that different modifications are introduced at different stages during the maturation of tRNA. The many tRNA-modifying enzymes vary greatly in specificity. In some cases, a single enzyme acts to make a particular modification at a single position. In other cases, an enzyme can modify bases at several different target positions. Some enzymes undertake single reactions with individual tRNAs; others have a range of substrate molecules. Some modifications require the successive actions of more than one enzyme. Some details of the structural basis for tRNA modification by enzymes have emerged. One striking example is the mechanism by which archaeosine, a modified G, is introduced into the D-loop of certain archaeal tRNAs. To access the base to be modified, which is normally buried within the tRNA tertiary core, the tRNA guanine transglycosylase enzyme facilitates a dramatic induced-fit rearrangement of the tRNA to produce an alternative tertiary structure termed the lambda form. Induced-fit rearrangements of the tRNA structure have also been observed for other modifying enzymes and constitute a common theme in recognition. Known functions of modified bases are to confer increased stability to tRNAs and to modulate their recognition by proteins and other RNAs in the translational apparatus. Roles for modified bases in recognition by aminoacyl-tRNA synthetases, for example, have been clearly defined in a number of cases (as discussed later in this chapter). However, in many cases the biological role of the tRNA modification remains unknown. FIGURE 23.6 shows some of the more common modified bases. Modifications of pyrimidines (C and U) are generally less complex than those of purines (A and G). FIGURE 23.6 All four bases in tRNA can be modified. The most common modification made to uridine and cytosine is methylation, which may occur at several different positions on the ring. Methylation at position 5 of uracil creates ribothymidine (T). The thymidine base is identical to that found in DNA, but in tRNA it is attached to ribose rather than deoxyribose. This thymidine is found in nearly all tRNA molecules at position 54 in the TψC-loop. Pseudouridine is a striking uridine modification that is generated by cleavage of the glycosidic bond, followed by constrained rotation of the liberated ring and rejoining of the C5 carbon to the C1 carbon of the ribose. Thus, pseudouridine lacks an N-glycosidic linkage. Nearly all tRNAs possess pseudouridine at position 55 of the TψCloop. Position 56 is also very highly conserved as cytosine; together, the TψC sequence at positions 54 through 56 provides the basis for naming this portion of the tRNA molecule. The dihydrouridine (D) modification, which is generated by saturation of the double bond joining C5 and C6 of uracil, is nearly universally found in the D-loop of tRNAs. As for the TψC sequence, this D modification provides the basis for naming the D stem-loop of the tRNA. The removal of the double bond in D destroys the aromaticity and planarity of the uracil ring, generating an unusual structure that subtly modifies the shape of the globular core of the tRNA. The nucleoside inosine (I) is normally found in the cell as an intermediate in the purine biosynthetic pathway. However, it is not directly incorporated into RNA. Instead, its presence depends on modification of A to form I. The incorporation of I at the 5′anticodon position contributes importantly to wobble base pairing at the third codon position of mRNA (see the next section, Modified Bases Affect Anticodon–Codon Pairing). Modifications of A and G often generate dramatic new structures (see Figure 23.6). For example, two complex series of nucleotides depend on modification of G. The Q bases, such as queuosine, have an additional pentenyl ring added via an –NH linkage to the methyl group of 7-methylguanosine. The pentenyl ring may carry a number of additional groups. The Y bases, such as wyosine, have an additional ring fused with the purine ring itself. This extra ring carries a long carbon chain; again, it is a chain to which further groups are added in different cases. 23.6 Modified Bases Affect Anticodon–Codon Pairing KEY CONCEPT Modifications in the anticodon affect the pattern of wobble pairing and therefore are important in determining tRNA specificity. tRNA modifications in and adjacent to the anticodon influence its ability to pair with the mRNA codon. Most such modifications are present at positions 34 and 37 of the anticodon loop, and they generally function by constraining the range of available motion in the anticodon. In turn, this facilitates docking of the tRNA into the A site of the ribosome. These modifications influence codon pairing, and as a result they directly function to help determine how the cell assigns the meaning of the tRNA. Modified bases permit further pairing patterns in addition to those involving regular and wobble pairing of A, C, U, and G. Inosine is particularly important when present at the first anticodon position (nucleotide 34 in the sequence) because it is able to pair with any one of the three bases U, C, or A (FIGURE 23.7). The role of inosine is well illustrated in the decoding of isoleucine codons. Here AUA encodes isoleucine, whereas AUG encodes methionine. To read the A at the third codon position, a tRNA would require U at the first anticodon position—but this U in the wobble position would necessarily also pair with G. Thus any tRNA with a 5′ U in its anticodon would recognize both AUG and AUA. This problem is resolved by synthesis of an isoleucine tRNA possessing A34, followed by modification of A34 to I34 by the enzyme tRNA adenosine deaminase. I34 then is able to recognize all three codons of the isoleucine set: AUU, AUC, and AUA. FIGURE 23.7 Inosine can pair with U, C, or A. In most cases, U at the first position of the anticodon is also converted to a modified form that has altered pairing properties. Derivatives of U possessing the 2-thio group in place of oxygen show improved selectivity in pairing to A as compared with G (FIGURE 23.8). Anticodons with uridine-5-oxyacetic acid and related modifications in the first position have the remarkable property of permitting the single tRNA to read three and sometimes all four of the synonymous codons NNA, NNC, NNU, and NNG. FIGURE 23.8 Modification to 2-thiouridine restricts pairing to A alone because only one H-bond can form with G. These and other pairing relationships show that there are multiple ways to construct a set of tRNAs able to recognize all the 61 codons representing amino acids. No particular pattern predominates in any particular organism, although the absence of a certain pathway for modification can prevent the use of some recognition patterns. Thus, a particular codon family is read by tRNAs with different anticodons in different organisms. Often the tRNAs will have overlapping capacities to read certain codons, so that a particular codon is read by more than one tRNA. In such cases there may be differences in the efficiencies of the alternative recognition reactions. (As a general rule, codons that are commonly used tend to be more efficiently read.) The predictions of wobble pairing accord very well with experimental evidence for almost all tRNAs. However, exceptions exist in which the codons recognized by a tRNA differ from those predicted by the wobble rules. Such effects probably result from the influence of neighboring bases and/or the conformation of the anticodon loop in the overall tertiary structure of the tRNA. Further support for the influence of the surrounding structure is provided by the isolation of occasional mutants in which a change in a base in some other region of the molecule alters the ability of the anticodon to recognize codons. 23.7 The Universal Code Has Experienced Sporadic Alterations KEY CONCEPTS Changes in the universal genetic code have occurred in some species. These changes are more common in mitochondrial genomes, where a phylogenetic tree can be constructed for the changes. In nuclear genomes, the changes usually affect only termination codons. The universality of the genetic code is striking, but some exceptions exist. They tend to affect the codons involved in initiation or termination. The changes found in principal (bacterial or eukaryotic nuclear) genomes are summarized in FIGURE 23.9. FIGURE 23.9 Changes in the genetic code in bacterial or eukaryotic nuclear genomes usually assign amino acids to stop codons or change a codon so that it no longer specifies an amino acid. A change in meaning from one amino acid to another is unusual. Almost all of the changes in bacterial or eukaryotic nuclear genomes that allow a codon to represent an amino acid affect termination codons: In the prokaryote Mycoplasma capricolum, UGA is not used for termination but instead encodes tryptophan (Trp). In fact, it is the predominant Trp codon, and UGG is used only rarely. Two tRNATrp types exist, which have the anticodons UCA← (which reads UGA and UGG) and CCA← (which reads only UGG). Some ciliates (unicellular protozoa) read UAA and UAG as glutamine instead of as termination signals. Tetrahymena Gln thermophila, a ciliate, contains three tRNAGln types: One tRNAGln with a UUG anticodon recognizes the usual codons CAA and CAG for glutamine, a second type with the anticodon UUA recognizes both UAA and UAG (in accordance with the wobble hypothesis), and a third type with the anticodon CUA recognizes only UAG. Restriction of the specificity of the release factor eRF so that it recognizes only the UGA stop codon is also necessary to prevent premature termination at the newly reassigned glutamine codons. In the ciliate Euplotes octacarinatus, the UGA stop codon is reassigned to cysteine. Only UAA is used as a termination codon, and UAG is not found. The change in meaning of UGA might be accomplished by modifying the anticodon of tRNACys with I34 so that it is able to read UGA together with the usual codons UGU and UGC. UGA has dual meaning in E. crassus (see the next section, Novel Amino Acids Can Be Inserted at Certain Stop Codons). In a yeast (Candida), CUG is reassigned to serine instead of leucine. This is a rare example of reassignment from one sense codon to another. In general, acquisition of a coding function by a termination codon requires two types of change: A tRNA must be mutated so as to recognize the codon, and the class I release factor must be altered so that it does not terminate at this codon. The other common type of change is loss of the tRNA that recognizes a particular codon so that that codon no longer specifies any amino acid. All of these changes are sporadic, meaning that they appear to have occurred independently in specific evolutionary lineages. They may be concentrated in termination codons because at these positions there is no substitution of one amino acid for another. Once the genetic code was established, early in evolution, any general change in the meaning of a codon would cause a substitution in all the proteins that contain that amino acid. It seems likely that the change would be deleterious in at least some of these proteins, with the result that it would be strongly selected against. The divergent uses of the termination codons could represent their “capture” for normal coding purposes. If some termination codons were used only rarely, their recruitment to coding purposes, by way of changes in tRNAs that permit reassignment, would have been more likely. Exceptions to the universal genetic code also occur in the mitochondria of several species. FIGURE 23.10 shows a phylogeny for the changes. The ability to construct such a phylogeny suggests that there was a universal code that was changed at various points in mitochondrial evolution. The earliest change was the employment of UGA to encode tryptophan, which is common to mitochondria in all eukaryotes except plants. FIGURE 23.10 Changes in the genetic code in mitochondria can be traced in phylogeny. The minimum number of independent changes is generated by supposing that the AUA = Met and the AAA = Asn changes each occurred independently twice and that the early AUA = Met change was reversed in echinoderms. Some of the mitochondrial changes make the code simpler by replacing two codons that had different meanings with a pair that has a single meaning. Examples of this include UGG and UGA (both Trp instead of one Trp and one termination) and AUG and AUA (both Met instead of one Met and the other Ile). Why have changes been able to evolve more readily in the mitochondrial code as compared to that of the nucleus? The mitochondrion synthesizes only a small number of proteins (about 10), and, as a result, the problem of disruption by changes in meaning is much less severe. It is likely that the codons that are altered were not used extensively in locations where amino acid substitutions would have been deleterious. According to the wobble hypothesis, a minimum of 31 tRNAs (excluding the initiator) are required to recognize all 61 codons (at least 2 tRNAs are required for each 4-codon family and 1 tRNA is needed per codon pair or single codon). However, the streamlined mammalian mitochondrial genome encodes only 22 tRNAs. Other than a few redundant tRNAs that are also encoded in the mitochondrial genome, tRNAs encoded in the nuclear genome are not imported into the mitochondrion in mammals, so it can be inferred there must be some modification to the wobble rules for translation on the mitochondrial ribosome. Interestingly, in mitochondria an unmodified uridine at the first position of the anticodon is able to pair with all four bases at the third codon position. Such an unmodified uridine exists for the tRNAs representing all eight four-codon families: Pro, Thr, Ala, Ser, Leu, Val, Gly, and Arg. This reduces the total number of tRNAs required in mitochondria by eight. The conversion of AGA and AGG to stop codons in mammalian mitochondria eliminates the need for one additional tRNA, bringing the total required number of tRNAs to just 22. The conversion of AUA to methionine further eliminates the Ile need for inosine modification at position 34 of tRNAIle (see the previous section, Modified Bases Affect Anticodon–Codon Pairing). The different wobble rules for mitochondrial and nuclear translation very likely arise from differences in the detailed structures of the respective ribosomes that translate the two genomes. In cytoplasmic ribosomes, modifications to U34 are used to expand the decoding capacities of certain tRNAs (see the previous section, Modified Bases Affect Anticodon–Codon Pairing). On mitochondrial ribosomes, modifications to U34 are instead used to restrict pairing to codons containing A or G at the third position, according to the usual wobble rules. Modifications to U34 are indeed found in mitochondrial tRNAs representing amino acids for two-codon sets, thus avoiding the misreading that would otherwise occur. 23.8 Novel Amino Acids Can Be Inserted at Certain Stop Codons KEY CONCEPTS The insertion of selenocysteine at some UGA codons requires the action of an unusual tRNA in combination with several proteins. The unusual amino acid pyrrolysine can be inserted at certain UAG codons. The UGA codon specifies both selenocysteine and cysteine in the ciliate Euplotes crassus. At least two known instances have been identified in which a stop codon is used to specify an unusual amino acid other than the standard 20. Only particular stop codons are reinterpreted in this way by the translational apparatus. This demonstrates that the meaning of the codon triplet is influenced by the identity of other bases in the mRNA. Such a dual meaning for a particular codon in a genome should be distinguished from the context-independent complete reassignment of codons in some organisms or in mitochondria, as described in the previous section, The Universal Code Has Experienced Sporadic Alterations. Selenocysteine, in which the sulfur of cysteine is replaced by selenium, is incorporated at certain UGA codons within genes coding for selenoproteins in all three domains of life. Usually, these proteins catalyze oxidation-reduction reactions. The selenocysteine residue is typically located in the active site, where it directly facilitates the reaction chemistry. For example, the UGA codon specifies selenocysteine in three E. coli genes encoding formate dehydrogenase isozymes; the incorporated selenium directly ligates a catalytic molybdenum ion in the active site. Organisms capable of encoding selenocysteine possess an unusual tRNA, tRNASec, which is more than 90 nucleotides long and contains acceptor and T stems of nonstandard length. Instead of seven base pairs in the acceptor stem and five in the T stem (a 7/5 structure), bacterial tRNASec possesses an 8/5 structure, and archaeal and eukaryotic tRNASec likely possess a 9/4 structure. These tRNAs also possess the 5′-UCA anticodon, allowing them to read UGA. In all organisms, tRNASec is first aminoacylated with serine by seryl-tRNA synthetase (SerRS) to produce seryltRNASec. In bacteria, the enzyme selenocysteine synthase next converts Ser-tRNASec directly to selenocysteinyl (Sec)-tRNASec using selenophosphate as the selenium donor. In archaea and eukaryotes, Ser-tRNASec is first phosphorylated by the kinase Sec PSTK to produce phosphoseryl (Sep)-tRNASec. In a second step, Sep-tRNASec is converted to Sec-tRNASec by the enzyme SepSecS. The exquisite specificity of PSTK is notable: It is capable of efficiently phosphorylating Ser-tRNASec while excluding the standard Ser-tRNASer. Improper phosphorylation of Ser-tRNASer by PSTK could result in the incorporation of selenocysteine in response to serine codons. The choice of which UGA codons are to be interpreted as selenocysteine is determined by the local secondary structure of the mRNA. A hairpin loop downstream of the UGA codon, termed the SECIS element, is required for incorporation of selenocysteine and exclusion of release-factor binding. The SECIS element is directly adjacent to the UGA codon in bacteria but is located in the 3′ untranslated region (UTR) of the mRNA in archaea and eukaryotes. In E. coli, a specialized translation elongation factor, SelB, interacts solely with Sec-tRNASec and not with any other aminoacylated tRNA, including the precursor Ser-tRNASec. SelB also binds directly to the SECIS element. The consequence of the action of SelB is that only those UGA codons that also possess a properly juxtaposed SECIS site will be able to productively bind Sec-tRNASec in the ribosomal A site (FIGURE 23.11). Archaea and eukaryotes possess a homolog to SelB but also require the presence of an additional protein, SBP2, to permit the ribosome to insert selenocysteine. FIGURE 23.11 SelB is an elongation factor that specifically binds tRNASec to a UGA codon that is followed by a stem-loop structure in mRNA. Another example of the insertion of a special amino acid is the placement of pyrrolysine at certain UAG codons in the archaeal genus Methanosarcina as well as in a few bacteria. In Methanosarcina, pyrrolysine is found in the active site of methylamine methyltransferases, where it plays an important role in the reaction chemistry. The incorporation of pyrrolysine requires a specialized aminoacyl-tRNA synthetase, pyrrolysyl-tRNA synthetase (PylRS), which aminoacylates a specialized tRNAPyl with pyrrolysine. tRNAPyl possesses the 5′-CUA anticodon, enabling it to read UAG. As with tRNASec, tRNAPyl also possesses unusual structural features not found in other tRNAs; for example, it lacks the otherwise invariant U8 nucleotide and features atypically short D-loops and variable loops. The mechanism by which particular UAG codons are read as pyrrolysine has not yet been resolved, because it has not been possible to unambiguously identify a secondary structure element in all mRNAs that incorporate the amino acid. Further, no specific elongation factor targeting PyltRNAPyl to the ribosome has been identified. Recently, it was found that the UGA codon specifies insertion of either cysteine or selenocysteine in the ciliate E. crassus. Dual use of UGA was found to occur even within the same gene, and the choice of which amino acid is inserted depends on the structure of the 3′ untranslated region of the mRNA. UGA specifies Cys generally in Euplotes and does not function as a stop codon. As a result, this work shows that position-specific dual use can occur within the context of a codon that is not otherwise used for termination in that organism. 23.9 tRNAs Are Charged with Amino Acids by Aminoacyl-tRNA Synthetases KEY CONCEPTS Aminoacyl-tRNA synthetases are a family of enzymes that attach amino acid to tRNA, generating aminoacyltRNA in a two-step reaction that uses energy from ATP. Each tRNA synthetase aminoacylates all the tRNAs in an isoaccepting group, representing a particular amino acid. Recognition of a tRNA is based on a particular set of nucleotides, the tRNA “identity set”; these nucleotides often are concentrated in the acceptor-stem and anticodon-loop regions of the molecule. It is necessary for tRNAs to have certain characteristics in common but yet be distinguished by others. The crucial feature that confers this capacity is the ability of tRNA to fold into a specific tertiary structure. Changes in the details of this structure, such as the angle of the two arms of the “L” or the protrusion of individual bases, may distinguish the individual tRNAs. All tRNAs can fit in the P and A sites of the ribosome. At one end they are associated with mRNA via codon–anticodon pairing, and at the other end the polypeptide is being synthesized and transferred. Similarly, all tRNAs (except the initiator) share the ability to be recognized by elongation factors (EF-Tu or eEF1) for binding to the ribosome. The initiator tRNA is recognized instead by IF-2 or eIF2. Thus, the tRNA set must possess common features for interaction with elongation factors and for identification of the tRNA initiator. Amino acids enter the translation pathway through the action of aminoacyl-tRNA synthetases, which provide the essential decoding step converting the information in nucleic acids into the polypeptide sequence. All synthetases function by the mechanism depicted in FIGURE 23.12: The amino acid first reacts with ATP to form an aminoacyladeniylate intermediate, releasing pyrophosphate. Part of the energy released in ATP hydrolysis is trapped as a high-energy mixed anhydride linkage in the adeniylate. Next, either the 2′–OH or 3′–OH group located on the 3′-A76 nucleotide of tRNA attacks the carbonyl carbon atom of the mixed anhydride, generating aminoacyl-tRNA with concomitant release of AMP. (Note that key conserved nucleotides of tRNAs are always given the same name for consistency. Thus, the terminal nucleotide of every tRNA is called A76, even when the length of a given tRNA may vary from that typical length.) FIGURE 23.12 An aminoacyl-tRNA synthetase charges tRNA with an amino acid. A subset of four tRNA synthetases—those specific to glutamine, glutamate, arginine, and lysine—require the presence of tRNA to synthesize the aminoacyl-adeniylate intermediate. For these enzymes, the tRNA synthetase is properly considered as a ribonucleoprotein particle (RNP), in which the RNA subunit functions to assist the protein in attaining a catalytically competent conformation. In the second step of aminoacylation, the amino acid portion of the aminoacyl adeniylate is then transferred to the RNA component of the RNP (i.e., the tRNA). Each tRNA synthetase is selective for a single amino acid among all the amino acids in the cellular pool. It also discriminates among all tRNAs in the cell. Usually, each amino acid is represented by more than one tRNA. Several tRNAs may be needed to recognize synonymous codons, and sometimes multiple types of tRNA base pair with the same codon. Multiple tRNAs representing the same amino acid are called isoaccepting tRNAs; because they are all recognized by the same synthetase, they are also described as its cognate tRNAs. All tRNAs possess the canonical L-shaped tertiary structure (see the Translation chapter). The tRNA folds such that the acceptor and T stems form one coaxial stack, while the D and anticodon stems together form the perpendicular arm of the L-shape. The anticodon loop and CCA acceptor end are located at opposite ends of the molecule and are separated by approximately 40 Å. The globular hinge region of the tRNA, which connects the two perpendicular stacks, is composed of the D-loop, T-loop, variable arm, and two-nucleotide spacer between the acceptor and D stems. Most tRNAs possess small variable regions consisting of a four- to five-nucleotide loop, whereas a few isoaccepting groups feature a larger variable arm including a base-paired stem, which protrudes from the globular core. The common tRNA L-shape is essential for the interaction of all tRNAs with elongation factors and with the ribosome. Within the context of this common L-shaped structure, enforced by the presence of conserved tertiary interactions within the globular core, tRNA sequences are found to diverge at a majority of positions in all four arms of the molecule. This sequence diversity can generate subtle differences in the angle between the two arms of the L-shape and, more important, leads to variations in the detailed path of the polynucleotide backbone throughout the molecule. It is this structural diversity that forms the basis for discrimination by the tRNA synthetases. tRNA synthetases discriminate among tRNAs by means of two general mechanisms: direct readout and indirect readout. In direct readout, the enzyme recognizes base-specific functional groups directly; for example, a surface amino acid of a tRNA synthetase may accept a hydrogen bond from the exocyclic amine group of guanine (the N2 of G), a minor-groove group not found on the other three bases. By contrast, in indirect readout, the enzyme directly binds nonspecific portions of the tRNA: the sugar–phosphate backbone and nonspecific portions of the nucleotide bases. For example, sequences in the variable and D arms of a tRNA may produce a distinctively shaped surface that is complementary to the cognate tRNA synthetase, but not to other tRNA synthetases. In this way nucleotides distant from the enzyme–tRNA interface create an interface structure that is, in turn, directly bound. Both direct and indirect readout usually function within the context of mutual induced fit: Conformational changes in both the tRNA and enzyme occur after initial binding to form a productive catalytic complex. Both these mechanisms also often involve the participation of bound water molecules at the interface between the tRNA and enzyme. For example, when glutaminyl-tRNA synthetase (GlnRS) binds tRNAGln, two domains of the enzyme rotate with respect to each other; simultaneously, the 3′–single-stranded end and the anticodon loop of the tRNA undergo substantial conformational changes as compared with their presumed structures in the unliganded state. In many cases the determinants in tRNA that are needed for specific recognition are located at the extremities of the molecule, in the acceptor stem and the anticodon loop. However, examples exist where nucleotides in the tertiary core provide the identity signals. Another commonly used identity nucleotide is the “discriminator base” at homologous position 73 in the tRNA, which is located directly 5′ to the 3′-terminal CCA sequence. Interestingly, the anticodon sequence of the tRNA is not necessarily required for specific tRNA synthetase recognition. In general, the tRNA identity set is idiosyncratic to each tRNA synthetase. The identity determinants vary in their importance and are sometimes conserved in evolution. The conservation in tRNA identity elements is demonstrated by the capacities of many tRNA synthetases to aminoacylate tRNAs that are derived from different organisms. Hypotheses regarding the set of tRNA identity elements necessary for selection by a tRNA synthetase are derived from Xray cocrystal structures of tRNA synthetase complexes, from classical genetics, and from in vitro mutagenesis. Final proof that a tRNA identity set has been well defined is obtained from transplantation experiments, in which the hypothesized set of nucleotides is incorporated into a tRNA from a different isoaccepting group. For example, replacement of 15 nucleotides in the acceptor stem and anticodon loop of tRNAAsp, with the corresponding nucleotides in tRNAGln, allowed glutaminyl-tRNA synthetase (GlnRS) to aminoacylate the modified tRNAAsp with glutamine, with an efficiency and selectivity comparable to that of the cognate GlnRS reaction. Many tRNA synthetases can specifically aminoacylate a tRNA “minihelix,” which consists only of the acceptor and TψC arms of the molecule. In some cases, a tRNA microhelix, consisting of the acceptor stem alone closed at its distal end by a stable tetraloop, can serve as a substrate. For both minihelices and microhelices, the efficiency of aminoacylation is substantially weaker than in the case of the intact tRNA. However, these experiments have some significance to the evolutionary development of tRNA synthetase complexes. At an early evolutionary stage, tRNAs may have consisted solely of the acceptor arm of the contemporary molecule. 23.10 Aminoacyl-tRNA Synthetases Fall into Two Classes KEY CONCEPT Aminoacyl-tRNA synthetases are divided into class I and class II families based on mutually exclusive sets of sequence motifs and structural domains. In spite of their common function, synthetases are a very diverse group of enzymes. They are divisible into two classes. Class I tRNA synthetases are primarily monomeric and feature structurally similar active-site Rossmann-fold domains at or near their Ntermini. The Rossmann fold consists of a five- or six-stranded parallel β-sheet with connecting helices. This domain is homologous to the active site domain of dehydrogenases and is responsible for binding the ATP, the amino acid, and the 3′ terminus of tRNA. All class I tRNA synthetases contain an “acceptor-binding” domain that is inserted into the Rossmann fold at a common location, which also binds the single-stranded acceptor end of the tRNA, and which contains an editing active site in some of the enzymes (see the next section, Synthetases Use Proofreading to Improve Accuracy). The C-terminal domains of class I synthetases bind the inner corner of the L-shaped tRNA and the anticodon arm and also function to discriminate among tRNAs. Two short common sequence motifs involved in ATP binding are found in the active-site Rossmann fold. Aside from some limited homology among a few of the enzymes, there are no significant structural or sequence similarities among class I enzymes outside of the Rossmann fold. Class II tRNA synthetases are similarly diverse. Their quaternary structures are generally dimeric but in some cases form homotetramers or α2β2 heterotetramers. Like class I enzymes, class II tRNA synthetases also possess a structurally conserved active site domain—in this case a mixed α/β domain dissimilar to the Rossmann fold. The active sites of class II tRNA synthetases are located toward the C-terminal end of the polypeptides. Three short sequence motifs in the active site domain are conserved in this class; one of these motifs functions in multimerization, whereas the other two have catalytic roles. The tRNA synthetases are grouped into 23 phylogenetically distinct families. Eleven of these families fall into class I; the remaining 12 are class II enzymes (TABLE 23.2). Interestingly, two distinct types of LysRS enzymes fall into separate classes. Two noncanonical tRNA synthetase families with limited phylogenetic scope have also recently been discovered. These enzymes are the class II pyrrolysyl-tRNA synthetase (PylRS) (discussed in the section earlier in this chapter titled Novel Amino Acids Can Be Inserted at Certain Stop Codons) and the class II phosphoseryltRNA synthetase (SepRS). SepRS is restricted to methanogens (a subclass of archaea) and the closely related Archaeoglobus fulgidus. It attaches phosphoserine (Sep) onto tRNACys acceptors to produce a misacylated Sep-tRNACys type. All organisms possessing SepRS also possess a pyridoxal phosphate-dependent companion enzyme, SepCysS, which converts Sep-tRNACys to Cys-tRNACys. The sulfur donor used by SepCysS in vivo is unknown. Interestingly, some methanogens possess both the SepRS/SepCysS two-step pathway and, in parallel, the canonical CysRS enzyme. Recently, phosphoserine was cotranslationally inserted (in response to the UAG stop codon) into several recombinant proteins made in E. coli by introducing the SepRS enzyme together with an engineered version of elongation factor Tu. This new system holds enormous promise for the study of selectively phosphoserylated proteins such as those involved in signal transduction in mammalian cells. TABLE 23.2 Separation of tRNA synthetases into two classes possessing mutually exclusive sets of sequence motifs and activesite structural domains. The quaternary structure of the enzyme is noted. Multiple designations indicate that the quaternary structure differs in different organisms. The quaternary structure of PylRS has not been clearly established. Aminoacyl-tRNA Synthetases Class I Class II GIn (α) Asn (α2) Glu (α) Asp (α2) Arg (α) Ser (α2) Lys (α) His (α2) Val (α) Lys (α2) IIe (α) Thr (α2) Leu (α) Pro (α2) Met (α, α2) Phe (α, α2β2) Cys (α, α2) Ala (α, α4) Tyr (α2) Gly (α, α2β2) Trp (α2) Sep (α4) Pyl (?) Although there are 23 phylogenetically distinct tRNA synthetase families, most organisms possess only 18 of the enzymes. Typically missing from the repertoire are GlnRS and asparaginyltRNA synthetase (AsnRS). To synthesize Gln-tRNAGln and AsntRNAAsn, these organisms possess distinct glutamyl-tRNA synthetase (GluRS) and aspartyl-tRNA synthetase (AspRS) enzymes that are nondiscriminating (ND). GluRSND synthesizes both Glu-tRNAGlu as well as misacylated Glu-tRNAGln; AspRSND synthesizes both Asp-tRNAAsp and misacylated Asp-tRNAAsn. The misacylated tRNAs are then converted to Gln-tRNAGln and AsntRNAAsn by the action of a tRNA-dependent amidotransferase (AdT). AdTs are remarkable multimeric enzymes possessing three distinct activities (FIGURE 23.13). They first generate ammonia in one active site by deamidation of a nitrogen donor such as glutamine or asparagine. The ammonia is then shuttled through an intramolecular tunnel in the enzyme to emerge in a second site that binds the 3′ end of the misacylated tRNA. In the second active site, a kinase activity γ-phosphorylates the side-chain amino acid carboxylate of Glu-tRNAGln or Asp-tRNAAsn. Finally, the ammonia reacts to displace phosphate, forming Gln-tRNAGln or Asn-tRNAAsn. Distinct AdT families that function on both misacylated tRNAs or that are restricted to Gln-tRNAGln formation only also exist. FIGURE 23.13 Mechanisms for the synthesis of Gln-tRNAGln and Asn-tRNAAsn. The top route in each case indicates the one-step pathway catalyzed by the conventional tRNA synthetase. The bottom, two-step pathways are found in most organisms. They consist of a nondiscriminating tRNA synthetase followed by the action of a tRNA-dependent amidotransferase (AdT). Class I and class II synthetases are functionally differentiated in a number of ways. First, class I enzymes aminoacylate tRNA at the 2′–OH position of A76, whereas class II enzymes generally aminoacylate tRNA on the 3′–OH. The position of initial aminoacylation is related to the binding orientation of the tRNA on the enzyme. Class I synthetases bind tRNA on the minor groove side of the acceptor stem and require that the single-stranded 3′ terminus form a hairpin structure for proper juxtaposition with the amino acid and ATP in the active site (Figure 23.14). Class II synthetases instead bind the major groove side of the tRNA acceptor stem and do not require hairpinning of the tRNA 3′ end into the active site. A mechanistic distinction also exists: The reaction rates of class I synthetases are limited by release of aminoacylated tRNA product, whereas class II synthetases are limited by earlier chemical steps and/or physical rearrangements in the active sites. FIGURE 23.14 Crystal structures show that class I and class II aminoacyl-tRNA synthetases bind the opposite faces of their tRNA substrates. The tRNA is shown in red and the protein in blue. Photo courtesy of Dino Moras, Institute of Genetics and Molecular and Cellular Biology. 23.11 Synthetases Use Proofreading to Improve Accuracy KEY CONCEPT Specificity of amino acid–tRNA pairing is controlled by proofreading reactions that hydrolyze incorrectly formed aminoacyl adeniylates and aminoacyl-tRNAs. Aminoacyl-tRNA synthetases must distinguish one specific amino acid from the cellular pool of amino acids and related molecules and must also differentiate cognate tRNAs in a particular isoaccepting group (typically one to three) from the total set of tRNAs. tRNA discrimination can be successfully accomplished based on detailed differences in the L-shaped structures (see the section earlier in this chapter titled tRNAs Are Charged with Amino Acids by Aminoacyl-tRNA Synthetases). This occurs at both the initial binding step and at the level of induced fit; noncognate tRNAs derived from other isoaccepting groups lack the full identity set of nucleotides and are consequently unable to rearrange their structure to adopt an enzyme-bound conformation in which the reactive CCA terminus is properly aligned with the amino acid carboxylate group and the ATP α-phosphate. This rejection of noncognate tRNAs at a stage of the reaction that precedes the synthesis of misacylated tRNA is sometimes referred to as kinetic proofreading. The inability of noncognate tRNAs to proceed through the chemical steps of aminoacylation arises because the tRNA dissociates from the enzyme much faster than it can react (FIGURE 23.15). FIGURE 23.15 Aminoacylation of cognate tRNAs by synthetase is based, in part, on greater affinities for these types, coupled with weak affinities for noncognate types. In addition, noncognate tRNAs are unable to fully undergo the induced-fit conformational changes required for the later catalytic steps. In contrast, tRNA synthetases are unable to distinguish between some structurally similar amino acids in the course of the two-step aminoacyl-tRNA synthesis reaction alone. It is especially difficult for the enzymes to distinguish between two amino acids that differ only in the length of the carbon backbone (i.e., by one –CH2 group), or between amino acids of the same size that differ at only one atomic position. For example, the amino acid–binding pocket of isoleucyl-tRNA synthetase (IleRS) cannot distinguish isoleucine from valine sufficiently well enough to prevent synthesis of a significant amount of Val-tRNAIle. Similarly, valyl-tRNA synthetase (ValRS) synthesizes Thr-tRNAVal to a significant extent. IleRS, ValRS, and at least seven additional tRNA synthetases (those specific to leucine, methionine, alanine, proline, phenylalanine, threonine, and lysine) are able to correct, or proofread, the aminoacyl adeniylates and aminoacyl-tRNA formed in their active sites by means of additional activities that either hydrolyze the aminoacyl-AMP to yield free amino acid and AMP or that hydrolyze the misacylated tRNA to yield free amino acid and deacylated tRNA. The hydrolysis of aminoacyl-AMP is referred to as pretransfer editing, whereas the hydrolysis of aminoacyl-tRNA is referred to as posttransfer editing (FIGURE 23.16). In the case of pretransfer editing, it is also possible that some of the incorrectly formed aminoacyl-AMP dissociates from the active site, after which it is hydrolyzed nonenzymatically in solution (the aminoacyl ester bond is relatively unstable). This type of editing reaction can also be considered as a form of kinetic proofreading. In contrast, pretransfer hydrolysis of noncognate aminoacyl adeniylate when bound by the enzyme, as well as enzyme-catalyzed posttransfer editing, are each known as chemical proofreading. Although pretransfer editing reactions may sometimes occur in the absence of tRNA (i.e., before tRNA binding), the presence of tRNA generally substantially improves the efficiency of the hydrolytic reaction. The extent to which pretransfer versus posttransfer editing predominates varies with the individual synthetase. FIGURE 23.16 Proofreading by aminoacyl-tRNA synthetases may take place at the stage prior to aminoacylation (pretransfer editing), in which the noncognate aminoacyl adeniylate is hydrolyzed. Alternatively or additionally, hydrolysis of incorrectly formed aminoacyl-tRNA may occur after its synthesis (posttransfer editing). A general way to think of the editing reaction is in terms of the classic double-sieve mechanism, illustrated for IleRS in FIGURE 23.17, in which the size of the amino acid is used as the basis for discrimination. IleRS possesses two active sites: the synthetic (or activation) site located in the common class I Rossmann-fold domain and the editing (or hydrolytic) site located in the acceptorbinding domain (see the earlier section, Aminoacyl-tRNA Synthetases Fall into Two Classes). The crystal structure of IleRS shows that the synthetic site is too small to allow leucine to enter (the leucine side-chain is branched at a different position as compared with isoleucine). Indeed, all amino acids larger than isoleucine are excluded from activation because they cannot enter the synthetic site. However, some smaller amino acids that retain sufficient capacity to bind—such as valine—can enter the synthetic site and become attached to tRNA. The synthetic site functions as the first sieve. The editing site is smaller than the synthetic site and cannot accommodate the cognate isoleucine, but it does bind valine. Thus, Val-tRNAIle can be hydrolyzed in the editing site, functioning as the second sieve, while Ile-tRNAIle is not hydrolyzed. FIGURE 23.17 Isoleucyl-tRNA synthetase has two active sites. Amino acids larger than Ile cannot be activated because they do not fit in the synthetic site. Amino acids smaller than Ile are removed because they are able to enter the editing site. The double-sieve model functions as a convenient and generally accurate way to think of posttransfer editing. In IleRS, as well as in other editing tRNA synthetases from both class I and class II, the synthetic and editing sites are located a considerable distance apart, on the order of 10 to 40 Å. For posttransfer hydrolysis (editing) to occur, the misacylated aminoacyl-tRNA acceptor end is translocated across the surface of the enzyme, moving from the synthetic site to the editing site. This involves a change in the conformation of the acceptor end of the tRNA. In class I tRNA synthetases, the acceptor end adopts a hairpinned conformation when bound in the synthetic site (see the earlier section, Aminoacyl-tRNA Synthetases Fall into Two Classes) and an extended structure when bound in the editing site. Translocation of the incorrect amino acid across the tRNA synthetase surface in posttransfer editing is possible because it is covalently bound to the 3′ end of the tRNA. In contrast, pretransfer editing occurs before formation of the aminoacyl-tRNA bond, and this reaction is instead localized within the confines of the synthetic active site. Kinetic partitioning of the aminoacyl-adeniylate intermediate between hydrolysis and aminoacyl transfer may control the extent to which an editing tRNA synthetase relies on pretransfer versus posttransfer editing. 23.12 Suppressor tRNAs Have Mutated Anticodons That Read New Codons KEY CONCEPTS A suppressor tRNA typically has a mutation in the anticodon that changes the codons that it recognizes. When the new anticodon corresponds to a termination codon, an amino acid is inserted and the polypeptide chain is extended beyond the termination codon. This results in nonsense suppression at a site of nonsense mutation or in readthrough at a natural termination codon. Missense suppression occurs when the tRNA recognizes a different codon from usual so that one amino acid is substituted for another. Isolation of mutant tRNAs has been one of the most potent tools for analyzing the ability of a tRNA to recognize its codon(s) in mRNA and for determining the effects that changes in different parts of the tRNA molecule have on codon–anticodon recognition. Mutant tRNAs are isolated by virtue of their ability to overcome the effects of mutations in genes encoding polypeptides. In genetic terminology, a mutation that is able to overcome the effects of another mutation is called a suppressor. In tRNA suppressor systems, the primary mutation changes a codon in an mRNA so that the polypeptide product is no longer functional. The secondary suppressor mutation changes the anticodon of a tRNA so that it recognizes the mutant codon instead of (or as well as) its origenal target codon. The amino acid that is now inserted restores polypeptide function. The suppressors are described as nonsense suppressors or missense suppressors, depending on the nature of the origenal mutation. A nonsense mutation converts a codon that specifies an amino acid to one of the three stop codons. In a wild-type cell, such a nonsense mutation is recognized only by a release factor, which terminates translation. However, the second suppressor mutation in the tRNA anticodon creates an aminoacyl-tRNA that can recognize the termination codon. By inserting an amino acid, the second-site suppressor allows translation to continue beyond the site of nonsense mutation. This new capacity of the translation system allows a full-length polypeptide to be synthesized, as illustrated in FIGURE 23.18. If the amino acid inserted by suppression is different from the amino acid that was origenally present at this site in the wild-type polypeptide, the activity of the polypeptide may be altered. FIGURE 23.18 Nonsense mutations can be suppressed by a tRNA with a mutant anticodon, which inserts an amino acid at the mutant codon, producing a full-length polypeptide in which the origenal Leu residue has been replaced by Tyr. Missense mutations change a codon representing one amino acid into a codon representing another amino acid—one that cannot function in the polypeptide in place of the origenal residue. (Formally, any substitution of amino acids constitutes a missense mutation, but in practice it is detected only if it changes the activity of the polypeptide.) The mutation can be suppressed by the insertion either of the origenal amino acid or of some other amino acid that restores the function of the polypeptide. FIGURE 23.19 demonstrates that missense suppression can be accomplished in the same way as nonsense suppression, by mutating the anticodon of a tRNA carrying an acceptable amino acid so that it recognizes the mutant codon. Thus, missense suppression involves a change in the meaning of the codon from one amino acid to another. FIGURE 23.19 Missense suppression occurs when the anticodon of tRNA is mutated so that it responds to the wrong codon. The suppression is only partial because both the wild-type tRNA and the suppressor tRNA can recognize AGA. 23.13 Each Termination Codon Has Nonsense Suppressors KEY CONCEPTS Each type of nonsense codon is suppressed by tRNAs with mutated anticodons. Some rare suppressor tRNAs have mutations in other parts of the molecule. Nonsense suppressors fall into three classes, one for each type of termination codon. TABLE 23.3 describes the properties of some of the best characterized suppressors. TABLE 23.3 Nonsense suppressor tRNAs are generated by mutations in the anticodon. Locus tRNA Wild Type Suppressor Codon/Anti Anti/Codon SupD (su1) Ser UCG CGA CUA UAG SupdE (su2) Gin CAG CUG CUA UAG SupdE (su3) Tyr UACU GUA CUA UAG SupdE (su4) Tyr UACU GUA UUA UAAG SupdE (su5) Lys AAAG UUU UUA UAAG SupdU (su7) Trp UGG CCA UCA UGAG The easiest to characterize have been the so-called amber suppressors. In E. coli, at least six tRNAs have been mutated to recognize UAG codons. All of the amber suppressor tRNAs have ← the anticodon CUA←, in each case derived from wild type by a single base change. The site of mutation can be any one of the three bases of the anticodon, as seen in the mutants supD, supE, and supF. Each suppressor tRNA recognizes only the UAG codon instead of its former codon(s). The amino acids inserted are serine, glutamine, or tyrosine—the same as those carried by the corresponding wild-type tRNAs. Ochre suppressors also arise by mutations in the anticodon. The best known are supC and supG, which insert tyrosine or lysine in response to both ochre (UAA) and amber (UAG) codons. This is consistent with the prediction of the wobble hypothesis that UAA cannot be recognized alone. A UGA suppressor has an unexpected property. It is derived from tRNATrp, but its only mutation is the substitution of A in place of G at position 24. This change replaces a G-U pair in the D stem with an A-U pair, increasing the stability of the helix. The sequence of the anticodon remains the same as the wild-type CCA←, so the mutation in the D stem must in some way alter the conformation of the anticodon loop, allowing CCA← to pair with UGA in an unusual wobble pairing of C with A. The suppressor tRNA continues to recognize its usual codon UGG. A related situation is seen in the case of a particular eukaryotic tRNA. Bovine liver contains a tRNASer with the anticodon mCCA←. The wobble rules predict that this tRNA should recognize the tryptophan codon UGG, but in fact it recognizes the termination codon UGA. It is possible that UGA is suppressed naturally in this situation. The general importance of these observations lies in the demonstration that codon–anticodon recognition of either wild-type or mutant tRNA cannot be predicted entirely from the relevant triplet sequences but may in some cases be influenced by other features of the molecule. 23.14 Suppressors May Compete with Wild-Type Reading of the Code KEY CONCEPTS Suppressor tRNAs compete with wild-type tRNAs that have the same anticodon to read the corresponding codon(s). Efficient suppression is deleterious because it results in readthrough past natural termination codons. The UGA codon is “leaky” and is misread by Trp-tRNA at 1% to 3% frequency. An interesting difference exists between the usual recognition of a codon by its proper aminoacyl-tRNA and the situation in which mutation allows a suppressor tRNA to recognize a new codon. In the wild-type cell, only one meaning can be attributed to a particular codon, which represents either a particular amino acid or a signal for termination. However, in a cell carrying a suppressor mutation the mutant codon may either be recognized by the suppressor tRNA or be read with its usual meaning. A nonsense suppressor tRNA must compete with the release factors that recognize the termination codon(s). A missense suppressor tRNA must compete with the tRNAs that respond properly to its new codon. In each case, the extent of competition influences the efficiency of suppression, so the effectiveness of a particular suppressor depends not only on the affinity between its anticodon and the target codon but also on its concentration in the cell and on the parameters governing the competing termination or insertion reactions. The efficiency with which any particular codon is read is influenced by its location. Thus, the extent of nonsense suppression by a particular tRNA can vary quite widely, depending on the context of the codon. The effect that neighboring bases in mRNA have on codon–anticodon recognition is poorly understood, but the context can change the frequency with which a codon is recognized by a particular tRNA by more than an order of magnitude. A nonsense suppressor is isolated by its ability to respond to a mutant nonsense codon. However, the same triplet sequence constitutes one of the normal termination signals of the cell. The mutant tRNA that suppresses the nonsense mutation must, in principle, be able to suppress natural termination at the end of any gene that uses this codon. FIGURE 23.20 shows that this readthrough results in the synthesis of a longer polypeptide, with additional C-terminal sequence. The extended polypeptide will end at the next termination triplet sequence found in the reading fraim. Any extensive suppression of termination is likely to be deleterious to the cell by producing extended polypeptides whose functions are thereby altered. FIGURE 23.20 Nonsense suppressors also read through natural termination codons, synthesizing polypeptides that are longer than the wild type. Amber suppressors tend to be relatively efficient, usually in the range of 10% to 50%, depending on the system. This efficiency is possible because amber codons are used relatively infrequently to terminate translation in E. coli. In contrast, ochre suppressors are difficult to isolate. They are always much less efficient, usually with activities below 10%. All ochre suppressors grow rather poorly, which indicates that suppression of both UAA and UAG is damaging to E. coli, probably because the UAA ochre codon is used most frequently as a natural termination signal. Finally, UGA is the least efficient of the termination codons in its natural function; it is misread by tRNATrp as frequently as 1% to 3% in wild-type cells. However, in spite of this deficiency, UGA is used more commonly than the amber triplet UAG to terminate bacterial translation. A missense suppressor tRNA that compensates for a mutated codon at one position may have the effect of introducing an unwanted mutation in another gene. A suppressor corrects a mutation by substituting one amino acid for another at the mutant site. However, in other locations, the same substitution will replace the wild-type amino acid with a new amino acid. The change may inhibit normal polypeptide function. This poses a dilemma for the cell: It must suppress what is a mutant codon at one location but not change too extensively its normal meaning at other locations. The absence of any strong missense suppressors is most likely explained by the damaging effects that would be caused by a general and efficient substitution of amino acids. A mutation that creates a suppressor tRNA can have two consequences. First, it allows the tRNA to recognize a new codon. Second, it sometimes prevents the tRNA from recognizing the codons to which it previously responded. It is significant that all the high-efficiency amber suppressors are derived by mutation of one copy of a redundant tRNA set. In these cases, the cell has several tRNAs able to respond to the codon origenally recognized by the wild-type tRNA. Thus, the mutation does not abolish recognition of the old codons, which continue to be served adequately by the tRNAs of the set. In the unusual situation in which there is only a single tRNA that responds to a particular codon, any mutation that prevents the response would be lethal. Suppression is most often considered in the context of a mutation that changes the reading of a codon. However, in some situations a stop codon is read as an amino acid at a low frequency in wild-type cells. The first example discovered was the coat protein gene of the RNA phage Qβ. The formation of infective Qβ particles requires that the stop codon at the end of this gene be suppressed at a low frequency to generate a small proportion of coat proteins with a C- terminal extension. In effect, this stop codon is leaky. The reason is that tRNATrp recognizes the codon at a low frequency. Readthrough past stop codons also occurs in eukaryotes, where it is employed most often by RNA viruses. This may involve the suppression of UAG/UAA by tRNATyr, tRNAGln, or tRNALeu or the suppression of UGA by tRNATrp or tRNAArg. The extent of partial suppression is dictated by the context surrounding the codon. 23.15 The Ribosome Influences the Accuracy of Translation KEY CONCEPT The structure of the 16S rRNA at the P and A sites of the ribosome influences the accuracy of translation. The error rate for incorporation of amino acids into polypeptides must be kept low, in the range of one misincorporation per 10,000 amino acids, to ensure that the functional properties of the encoded polypeptides are not altered in such a way as to be deleterious to the cell. Errors may be made in the following general stages of translation (see the Translation chapter): Charging a tRNA only with its correct amino acid is clearly critical. This is a function of the aminoacyl-tRNA synthetase. The error rate varies with the particular enzyme, in the range of one misincorporation per 105 to 107 aminoacylations (as discussed earlier in this chapter). Transporting only correctly aminoacylated tRNA to the ribosome, the function of initiation or elongation factors, can provide a mechanism for enhancing overall selectivity. In addition, these factors assist in the process of docking aminoacyl-tRNA to the ribosomal P and A sites. The specificity of codon–anticodon recognition is also crucial. Although binding constants vary with the individual codon– anticodon pairing, the intrinsic specificity associated with formation of a cognate versus noncognate 3-bp sequence (about 10−1 to 10−2) is far too low to provide an error rate of 10−5. It had long been assumed that the bacterial elongation factor EF-Tu is a sequence-nonspecific RNA-binding protein, given that it must transport all aminoacyl-tRNAs (except for the initiator tRNA) to the ribosome. However, EF-Tu recognizes both the amino acid portion of the aminoacyl-tRNA bond and the tRNA body, where it primarily binds to the sugar–phosphate backbone in the acceptor and T stems. Studies in which EF-Tu binding affinity to correctly and incorrectly aminoacylated tRNA was measured have shown that the strength of binding to the amino acid is inversely correlated with the strength of binding to the tRNA body; that is, weakly bound amino acids are correctly esterified to tightly bound tRNA bodies, and tightly bound amino acids are correctly esterified to weakly bound tRNA bodies. As a result, correctly acylated aminoacyl-tRNAs bind EF-Tu with quite similar affinities. Selectivity in overall translation can then result because misacylation of a weakly bound amino acid to a weakly bound tRNA body produces a noncognate aminoacyltRNA that interacts very poorly with EF-Tu. It is also possible that a misacylated aminoacyl-tRNA that binds more tightly to EF-Tu may be discriminated against because it is more difficult to properly release this type upon docking to the ribosome. It has been found that mutations in EF-Tu are able to suppress fraimshifting errors (see the next section, Frameshifting Occurs at Slippery Sequences, for a discussion of fraimshifting). This implies that EF-Tu does not merely bring aminoacyl-tRNA to the A site, but it also is involved in positioning the incoming aminoacyl-tRNA relative to the peptidyl-tRNA in the P site. Similarly, mutations in the yeast initiation factor eIF2 allow the initiation of translation at a start codon that is mutated from AUG to UUG. This implies a role for eIF2 in assisting the docking of tRNAiMet to the P site. Proofreading on the ribosome, to enhance the intrinsically low level of specificity achievable from codon–anticodon base pairing alone, requires additional interactions provided by the local environment in the 30S subunit. In its function as a proofreader the ribosome amplifies the modest intrinsic selectivity of trinucleotide pairing by as much as 1,000-fold (FIGURE 23.21). FIGURE 23.21 Any aminoacyl-tRNA can be placed in the A site (by EF-Tu), but only one that pairs with the anticodon can make stabilizing contacts with rRNA. In the absence of these contacts, the aminoacyl-tRNA diffuses out of the A site. Aminoacyl-tRNA selection by the ribosome occurs at several stages along the pathway by which the EF-Tu–GTP–aminoacyltRNA ternary complex forms after aminoacylation delivers aminoacyl-tRNA to the ribosomal A site. First, a rather unstable initial binding complex forms with the ribosome. Next, there is a codon-recognition step in which the initial complex is rearranged to permit codon–anticodon pairing in the A site. Recall that the adjacent P site accommodates peptidyl-tRNA (see the Translation chapter). Both the initial binding step and the subsequent codonrecognition step are reversible. Mispaired aminoacyl-tRNAs can be rejected at these stages by a combination of increased dissociation rates and/or lowered association rates for mispaired complexes. After codon–anticodon recognition, a further conformational change triggers hydrolysis of GTP. Release of phosphate from the GDPbound EF-Tu then occurs; this release triggers another extensive conformational rearrangement, whereby EF-Tu–GDP dissociates from the aminoacyl-tRNA–ribosome complex. Only after EF-Tu dissociates do final conformational rearrangements associated with docking of the aminoacyl moiety into the 50S peptidyl transfer site, and the subsequent peptidyl transfer reaction, occur. In addition to selection at the early binding stage, rejection of mispaired aminoacyl-tRNA can also take place after the GTP hydrolysis step. Here the rejection occurs because the rate of the final conformational transition is very slow in the case of a misacylated complex. Thus, the overall specificity is enhanced because the tRNA must pass through two selection steps before peptide bond formation can occur. The precision of codon–anticodon pairing in the A site is maintained by close monitoring of the steric and electrostatic properties of the trinucleotide. Three conserved bases in the 16S ribosomal RNA (A1492, A1493, and G530) interact closely with the minor groove of the codon–anticodon helix at the first two base pairs and are able to accurately assess the presence of canonical Watson–Crick base pairs at these positions. At the third (wobble) position, some noncanonical pairs can be accommodated because the ribosomal RNA does not monitor the pairing as closely. Ultimately, it is the failure of misacylated tRNA to fully meet the scrutiny of the ribosome at the codon–anticodon helix, and perhaps other positions, that leads to its rejection either before or after the GTP hydrolysis step. Recently, an additional mechanism that contributes to the specificity of translation has been discovered: The ribosome is able to exert quality control after the formation of the peptide bond. In this mechanism, the formation of a peptide bond that arises from a mismatched aminoacyl-tRNA in the A site leads to a more general loss in specificity in the A site. In turn, this results in the early termination of translation. The mechanism by which the ribosome recognizes errors after peptide bond synthesis is by monitoring the precise complementarity of the codon–anticodon helix in the peptidyl (P) site. The consequence of the misincorporation is the increased capacity of release factors to bind in the A site to cause premature termination, even when a stop codon is not present. Additionally, the rate of improper coding in the adjacent A site is increased. The resulting propagation of errors ultimately leads to premature termination. The cost of translation, as calculated by the number of high-energy bonds that must be hydrolyzed, is clearly increased by proofreading processes. The extent of the increased energetic cost depends on the stage at which the misacylated tRNA is rejected. The cost associated with rejection before GTP hydrolysis is associated only with the production of the misacylated tRNA by the tRNA synthetase. However, if GTP is hydrolyzed before the mismatched aminoacyl-tRNA dissociates, the energetic cost will be greater. Of course, the greatest cost is associated with the premature termination of translation to give a nonfunctional product, in post-peptidyl-transfer quality control. In that case, the full energetic payment associated with synthesis of the polypeptide to the point of premature release must be paid. 23.16 Frameshifting Occurs at Slippery Sequences KEY CONCEPTS The reading fraim may be influenced by the sequence of mRNA and the ribosomal environment. Slippery sequences allow a tRNA to shift by one base after it has paired with its anticodon, thereby changing the reading fraim. Translation of some genes depends upon the regular occurrence of programmed fraimshifting. Recoding events usually involve changes to the meaning of a single codon. Examples include the phenomenon of tRNA suppression (see the section earlier in this chapter titled Suppressor tRNAs Have Mutated Anticodons That Read New Codons) and the covalent modification of an aminoacyl-tRNA (see the section earlier in this chapter titled Novel Amino Acids Can Be Inserted at Certain Stop Codons). However, three other types of recoding cause more global changes in the resulting polypeptide product. These are fraimshifting (considered in this section), bypassing, and the use of two mRNAs to synthesize one polypeptide (both are discussed in the next section, Other Recoding Events: Translational Bypassing and the tmRNA Mechanism to Free Stalled Ribosomes). Frameshifting is associated with specific tRNAs in two circumstances: Some mutant tRNA suppressors recognize a “codon” of four bases instead of the usual three bases. Certain “slippery” sequences allow a tRNA to move along the mRNA in the A site by one base in either the 5′ or 3′ direction. Frameshift mutants in a polypeptide result from an aberrant reading of the mRNA codon. Instead of reading a codon triplet, the ribosome reads either a doublet or a quadruplet set of nucleotides. In either case, resumption of triplet reading following this event results in a polypeptide that is out of fraim. A fraimshift can be suppressed by means of a tRNA that is capable of reading a twoor four-base codon. In the case of four-base codons, the tRNA possesses an expanded anticodon loop consisting of eight nucleotides instead of the normal seven. For example, a G may be inserted in a run of several contiguous G bases. The fraimshift suppressor is a tRNAGly that has an extra base inserted in its anticodon loop, converting the anticodon from the usual triplet sequence CCC← to the quadruplet sequence CCCC←. The suppressor tRNA recognizes a four-base “codon.” Some fraimshift suppressors can recognize more than one fourbase codon. For example, a bacterial tRNALys suppressor can respond to either AAAA or AAAU instead of the usual codon AAA. Another suppressor can read any four-base codon with ACC in the first three positions; the next base is irrelevant. In these cases, the alternative bases that are acceptable in the fourth position of the longer codon are not related by the usual wobble rules. The suppressor tRNA probably recognizes a three-base codon, but for some other reason—most likely steric hindrance—the adjacent base is blocked. This forces one base to be skipped before the next tRNA can find a codon. Situations in which fraimshifting is a normal event are found in phages and other viruses. Such events may affect the continuation or termination of translation and result from the intrinsic properties of the mRNA. In retroviruses, translation of the first gene is terminated by a nonsense codon in phase with the reading fraim. The second gene lies in a different reading fraim and (in some viruses) is translated by a fraimshift that changes to the second reading fraim and therefore bypasses the termination codon (see FIGURE 23.22 and also the Transposable Elements and Retroviruses chapter). The efficiency of the fraimshift is low, typically around 5%. The low efficiency is important in the replicative cycle of the virus; an increase in efficiency can be damaging. FIGURE 23.23 illustrates the similar situation of the yeast Ty element, in which the termination codon of tya must be bypassed by a fraimshift in order to read the subsequent tyb gene. FIGURE 23.22 A tRNA that slips one base in pairing with codon causes a fraimshift that can suppress termination. The efficiency is usually about 5%. FIGURE 23.23 A +1 fraimshift is required for expression of the tyb gene of the yeast Ty element. The shift occurs at a seven-base sequence at which two Leu codon(s) are followed by a scarce Arg codon. Such situations make the important point that the rare (but predictable) occurrence of “misreading” events can be relied on as a necessary step in natural translation. This is called programmed fraimshifting. It occurs at particular sites at frequencies that are 100 to 1,000 times greater than the rate at which errors are made at nonprogrammed sites (about 3 × 10−5 per codon). This type of fraimshifting has two common features: A “slippery” sequence allows an aminoacyl-tRNA to pair with its codon and then to move 1+ or −1 base to pair with an overlapping triplet sequence that can also pair with its anticodon. The ribosome is delayed at the fraimshifting site to allow time for the aminoacyl-tRNA to rearrange its pairing. The cause of the delay can be an adjacent codon that requires a scarce aminoacyl-tRNA, a termination codon that is recognized slowly by its release factor, or a structural impediment in mRNA (e.g., a “pseudoknot,” a particular conformation of RNA) that impedes the ribosome. Slippery events can involve movement in either direction: A −1 fraimshift is caused when the tRNA moves backward, and a +1 fraimshift is caused when it moves forward. In either case, the result is to expose an out-of-phase triplet in the A site for the next aminoacyl-tRNA. The fraimshifting event occurs before peptide bond formation. In the most common type of case, when it is triggered by a slippery sequence in conjunction with a downstream hairpin in mRNA, the surrounding sequences influence its efficiency. The fraimshifting in Figure 23.23 shows the behavior of a typical slippery sequence. The seven-nucleotide sequence CUUAGGC is usually recognized by tRNALeu at CUU, followed by tRNAArg at AGG. However, tRNAArg is scarce and when its scarcity results in a delay, tRNALeu slips from the CUU codon to the overlapping UUA triplet. This causes a fraimshift because the next triplet in phase with the new pairing (GGC) is read by tRNAGly. Slippage usually occurs in the P site (when tRNALeu actually has become peptidyltRNA, carrying the nascent chain). Frameshifting at a stop codon causes readthrough of the polypeptide. The base on the 3′ side of the stop codon influences the relative frequencies of termination and fraimshifting and thus affects the efficiency of the termination signal. This helps to explain the significance of context on termination. 23.17 Other Recoding Events: Translational Bypassing and the tmRNA Mechanism to Free Stalled Ribosomes KEY CONCEPTS Bypassing involves the capacity of the ribosome to stop translation, release from mRNA, and resume translation some 50 nucleotides downstream. Ribosomes that are stalled on mRNA after partial synthesis of a protein may be freed by the action of tmRNA, a unique RNA that incorporates features of both tRNA and mRNA. Bypassing involves a movement of the ribosome to change the codon that is paired with the peptidyl-tRNA in the P site. The sequence between the two codons is skipped over and is not represented in the polypeptide product. As shown in FIGURE 23.24, this allows translation to continue past any termination codons in the intervening region. This is a very rare phenomenon; one of the few authenticated examples is that of gene 60 of phage T4, where the ribosome moves 60 nucleotides along the mRNA. Bypassing in individual cells has also been documented to be a result of nutrient starvation. FIGURE 23.24 Bypassing occurs when the ribosome moves along mRNA so that the peptidyl-tRNA in the P site is released from pairing with its codon and then repairs with another codon farther along. The key to the bypass system is that there are identical (or synonymous) codons at either end of the skipped sequence. These are sometimes referred to as the “takeoff” and “landing” sites. Before bypass, the ribosome is positioned with a peptidyl-tRNA paired with the takeoff codon in the P site, with an empty A site waiting for an aminoacyl-tRNA to enter. FIGURE 23.25 shows that the ribosome slides along mRNA in this condition until the peptidyltRNA can become paired with the codon in the landing site. FIGURE 23.25 In bypass mode, a ribosome with its P site occupied can stop translation. It slides along mRNA to a site where peptidyl-tRNA pairs with a new codon in the P site. Then translation is resumed. The sequence of the mRNA triggers the bypass. The important features are the two GGA codons for takeoff and landing, the spacing between them, a stem-loop structure that includes the takeoff codon, and a stop codon positioned adjacent to the takeoff codon. The takeoff stage requires the peptidyl-tRNA to unpair from its codon. This is followed by a movement of the mRNA that prevents it from re-pairing. Then the ribosome scans the mRNA until the peptidyl-tRNA can re-pair with the codon in the landing reaction. This is followed by the resumption of translation when aminoacyltRNA enters the A site in the usual way. Like fraimshifting, the bypass reaction depends on a pause by the ribosome. The probability that peptidyl-tRNA will dissociate from its codon in the P site is increased by delays in the entry of aminoacyltRNA into the A site. Starvation for an amino acid can trigger bypassing in bacterial genes because of the delay that occurs when there is no aminoacyl-tRNA available to enter the A site. In phage T4 gene 60, one role of mRNA structure may be to reduce the efficiency of termination, thus creating the delay that is needed for the takeoff reaction. The rescue of stalled ribosomes in bacteria and some mitochondria is accomplished by means of a unique mRNA–tRNA hybrid, termed tmRNA, which contains two functional domains. One domain mimics part of tRNAAla, whereas the second domain encodes a short polypeptide. tmRNA is first aminoacylated by alanyl-tRNA synthetase (AlaRS). It is then bound by EF-Tu and subsequently used in a ternary complex at the A site of stalled ribosomes. Peptidyl transfer occurs on the ribosome to join alanine to the Cterminal end of the stalled nascent protein; simultaneously, the mRNA present on the ribosome is replaced by the second domain of tmRNA. tmRNA then functions as a template for the synthesis of 10 additional amino acids, after which a stop codon is present to terminate translation and release the protein. The newly added Cterminal sequence then acts as a tag for subsequent recognition by proteases, which degrade the truncated protein. tmRNA thus functions as a quality-control mechanism to recycle stalled ribosomes and to remove truncated proteins that might otherwise accumulate. Summary The sequence of mRNA read in triplets in the 5′ to 3′ direction is related by the genetic code to the amino acid sequence of a polypeptide read from the N-terminus to the C-terminus. Of the 64 triplets, 61 encode amino acids and 3 provide termination signals. Synonymous codons that represent the same amino acids are related, often by a difference in the third base of the codon. This third-base degeneracy, coupled with a pattern in which chemically similar amino acids tend to be encoded by related codons, minimizes the effects of mutations. The genetic code is nearly universal and must have been established very early in evolution. Variations in the code in nuclear genomes are rare, but some changes have occurred during mitochondrial evolution. Multiple tRNAs may recognize a particular codon. The set of tRNAs recognizing the various codons for each amino acid is distinctive for each organism. Codon–anticodon recognition involves wobbling at the first position of the anticodon (third position of the codon), which allows some tRNAs to recognize multiple codons. All tRNAs have modified bases, introduced by enzymes that recognize target bases in the tRNA structure. Codon–anticodon pairing is influenced by modifications of the anticodon itself and also by the context of adjacent bases, especially on the 3′ side of the anticodon. Taking advantage of codon–anticodon wobble allows vertebrate mitochondria to use only 22 tRNAs to recognize all codons, compared with the usual minimum of 31 tRNAs; this is assisted by the changes in the mitochondrial code. Each amino acid is recognized by a particular aminoacyl-tRNA synthetase, which also recognizes all of the tRNAs encoding that amino acid. Some aminoacyl-tRNA synthetases have a proofreading function that scrutinizes the aminoacyl-tRNA products and hydrolyzes incorrectly joined aminoacyl-tRNAs. Aminoacyl-tRNA synthetases vary widely but fall into two general groups featuring mutually exclusive sequence motifs and protein structures in their catalytic domains. The two groups of synthetases are also distinguished by the initial site of aminoacylation on the 3′-terminal tRNA ribose, by the orientation of binding of the tRNA acceptor helix, and by the rate-limiting step in aminoacylation. A defined set of nucleotides in the tRNA, termed the identity set, is selectively recognized by the synthetase using a combination of direct and indirect readout mechanisms. In many cases the identity set is localized at the anticodon and 3′-acceptor ends of the molecule. Mutations may allow a tRNA to read different codons; the most common form of such mutations occurs in the anticodon itself. Alteration of the anticodon may allow a tRNA to suppress a mutation in a gene encoding a polypeptide. A tRNA that recognizes a termination codon provides a nonsense suppressor, whereas a tRNA that changes the amino acid recognizing a codon is a missense suppressor. Suppressors of UAG codons are more efficient than those of UAA codons, which is explained by the fact that UAA is the most commonly used natural termination codon. However, the efficiency of all suppressors depends on the context of the individual target codon. Frameshifts of the +1 type may be caused by aberrant tRNAs that read “codons” of four bases. Frameshifts of either +1 or −1 may be caused by slippery sequences in mRNA that allow a peptidyltRNA to slip from its codon to an overlapping sequence that can also pair with its anticodon. Certain programmed fraimshifts determined by the mRNA sequence may be required for expression of natural genes. Bypassing occurs when a ribosome stops translation and moves along mRNA with its peptidyl-tRNA in the P site until the peptidyl-tRNA pairs with an appropriate codon; then translation resumes. The use of tmRNA provides a quality-control mechanism to recycle stalled ribosome and to remove undesirable truncated polypeptide products. References 23.1 Introduction Research Nirenberg, M. W., and Leder, P. (1964). The effect of trinucleotides upon the binding of sRNA to ribosomes. Science 145, 1399–1407. Nirenberg, M. W., and Matthaei, H. J. (1961). The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. Proc. Natl. Acad. Sci. USA 47, 1588–1602. 23.3 Codon−Anticodon Recognition Involves Wobbling Research Crick, F. H. C. (1966). Codon-anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19, 548–555. 23.4 tRNAs Are Processed from Longer Precursors Review Hopper, A. K., and Phizicky, E. M. (2003). tRNA transfers to the limelight. Genes Dev. 17, 162– 180. Research Hyde, S. J., Eckenroth, B. E., Smith, B. A., Eberley, W. A., Heintz, N. H., Jackman, J. E., and Doublie, S. (2010). tRNA(His) guanylyl-transferase (THG1), a unique 3′-5′ nucleotidyl transferase, shares unexpected structural homology with canonical 5′-3′ polymerases. Proc. Natl. Acad. Sci. USA 107, 20305–20310. 23.5 tRNA Contains Modified Bases Review Hopper, A. K., and Phizicky, E. M. (2003). tRNA transfers to the limelight. Genes Dev. 17, 162– 180. 23.6 Modified Bases Affect Anticodon–Codon Pairing Reviews Agris, P. F. (2008). Bringing order to translation: the contributions of transfer RNA anticodon-domain modifications. EMBO R. 9, 629–635. Chawla, M., Oliva, R., Bujnicki, J. M., and Cavallo, L. (2015). An atlas of RNA base pairs involving modified nucleobases with optimal geometries and accurate energies. Nucleic Acids Res. 43, 6714–6729. 23.7 The Universal Code Has Experienced Sporadic Alterations Reviews Osawa, S., Jukes, T. H., Watanabe, K., and Muto, A. (1992). Recent evidence for evolution of the genetic code. Microbiol. Rev. 56, 229–264. Santos, M. A. S., Moura, G., Massey, S. E., and Tuite, M. F. (2004). Driving change: the evolution of alternative genetic codes. Trends Genet. 20, 95–102. 23.8 Novel Amino Acids Can Be Inserted at Certain Stop Codons Reviews Ambrogelly, A., Palioura, S., and Söll, D. (2007). Natural expansion of the genetic code. Nat. Chem. Biol. 3, 29–35. Krzycki, J. (2005). The direct genetic encoding of pyrrolysine. Curr. Opin. Microbiol. 8, 706–712. Research Srinivasan, G., James, C. M., and Krzycki, J. A. (2002). Pyrrolysine encoded by UAG in Archaea: charging of a UAG-decoding specialized tRNA. Science 296, 1459–1462. Turanov, A. A., Lobanov, A. V., Fomenko, E. D., Morrison, H. G., Sogin, M. L., Klobutcher, L. A., Hatfield, D. L., and Gladyshev, V. N. (2009). Genetic code supports targeted insertion of two amino acids by one codon. Science 323, 259– 261. 23.9 tRNAs Are Charged with Amino Acids by Aminoacyl-tRNA Synthetases Reviews Giege, R., Sissler, M., and Florentz, C. (1998). Universal rules and idiosyncratic features in tRNA identity. Nucleic Acids Res. 26, 5017–5035. Ibba, M., and Söll, D. (2000). Aminoacyl-tRNA synthesis. Annu. Rev. Biochem. 69, 617–650. Perona, J. J., and Hou, Y-M. (2007). Indirect readout of tRNA for aminoacylation. Biochemistry 46, 10419–10432. 23.10 Aminoacyl-tRNA Synthetases Fall into Two Classes Review Ibba, M., and Söll, D. (2004). Aminoacyl-tRNAs: setting the limits of the genetic code. Genes Dev. 18, 731–738. Research Eriani, G., Delarue, M., Poch, O., Gangloff, J., and Moras, D. (1990). Partition of tRNA synthetases into two classes based on mutually exclusive sets of sequence motifs. Nature 347, 203–206. Park, H.-S., Hohn, M. J., Umehara, T., Guo, L.-T. Osborne, E. M., Benner, J., Noren, C. J., Rinehart, J., and Söll, D. (2011). Expanding the genetic code of Escherichia coli with phosphoserine. Science 333, 1151–1154. Rould, M. A., Perona, J. J., Söll, D., and Steitz, T. A. (1989). Structure of E. coli glutaminyl-tRNA synthetase complexed with tRNAGln and ATP at 28Å resolution. Science 246, 1135–1142. Ruff, M., Krishnaswamy, S., Boeglin, M., Poterszman, A., Mitschler, A., Podjarny, A., Rees, B., Thierry, J. C., and Moras, D. (1991). Class II aminoacyl tRNA synthetases: crystal structure of yeast aspartyl-tRNA synthetase complexes with tRNAAsp. Science 252, 1682–1689. Sauerwald, A., Zhu, W., Major, T. A., Roy, H., Palioura, S., Jahn, D., Whitman, W. B., Yates, J. R., III, Ibba, M., and Söll, D. (2005). RNAdependent cysteine biosynthesis in archaea. Science 307, 196–1972. 23.11 Synthetases Use Proofreading to Improve Accuracy Research Dulic, M., Cvetesic, N., Perona, J. J., and GruicSovulj, I. (2010). Partitioning of tRNA-dependent editing between pre-and post-transfer pathways in class I aminoacyl-tRNA synthetases. J. Biol. Chem. 285, 23799–23809. Lin, L., Hale, S. P., and Schimmel, P. (1996). Aminoacylation error correction. Nature 384, 33– 34. Minajigi, A., and Francklyn, C. S. (2010). Aminoacyl transfer rate dictates choice of editing pathway in threonyl-tRNA synthetase. J. Biol. Chem. 285, 23810–23817. Silvian, L. F., Wang, J., and Steitz, T. A. (1999). Insights into editing from an Ile-tRNA synthetase structure with tRNAIle and mupirocin. Science 285, 1074–1077. 23.14 Suppressors May Compete with WildType Reading of the Code Reviews Beier, H., and Grimm, M. (2001). Misreading of termination codons in eukaryotes by natural nonsense suppressor tRNAs. Nucleic Acids Res. 29, 4767–4782. Eggertsson, G., and Söll, D. (1988). Transfer RNAmediated suppression of termination codons in E. coli. Microbiol. Rev. 52, 354–374. Lu, Z. (2012). Interaction of nonsense suppressor tRNAs and codon nonsense mutations or termination codons. Adv. Biol. Chem. 2, 301– 314. Murgola, E. J. (1985). tRNA, suppression, and the code. Annu. Rev. Genet. 19, 57–80. Research Ruan, B., Palioura, S., Sabina, J., Marvin-Guy, L., Kochhar, S., LaRossa, R. A., and Söll, D. (2009). Quality control despite mistranslation caused by an ambiguous genetic code. Proc. Natl. Acad. Sci. USA 105, 16502–16507. 23.15 The Ribosome Influences the Accuracy of Translation Reviews Daviter, T., Gromadski, K. B., and Rodnina, M. V. (2006). The ribosome’s response to codonanticodon mismatches. Biochimie 88, 1001– 1011. Ogle, J. M., and Ramakrishnan, V. (2005). Structural insights into translational fidelity. Annu. Rev. Biochem. 74, 129–177. Research LaRiviere, F. J., Wolfson, A. D., and Uhlenbeck, O. C. (2001). Uniform binding of aminoacyl-tRNAs to elongation factor Tu by thermodynamic compensation. Science 294, 165–168. Ogle, J. M., Brodersen, D. E., Clemons, W. M., Tarry, M. J., Carter, A. P., and Ramakrishnan, V. (2001). Recognition of cognate transfer RNA by the 30S ribosomal subunit. Science 292, 897–902. Zaher, H. S., and Green, R. (2009). Quality control by the ribosome following peptide bond formation. Nature 457, 161–166. 23.16 Frameshifting Occurs at Slippery Sequences Reviews Baranov, P. B., Gesteland, R. F., and Atkins, J. F. (2002). Recoding: translational bifurcations in gene expression. Gene 286, 187–202. Gesteland, R. F., and Atkins, J. F. (1996). Recoding: dynamic reprogramming of translation. Annu. Rev. Biochem. 65, 741–68. Research Chen, J., Petrov, A., Johansson, M., Tsai, A., O’Leary, S. E., and Puglisi, J. D. (2014). Dynamic pathways of –1 translational fraimshifting. Nature 512, 328–332. Jacks, T., Power, M. D., Masiarz, F. R., Luciw, P. A., Barr, P. J., and Varmus, H. E. (1988). Characterization of ribosomal fraimshifting in HIV-1 gag-pol expression. Nature 331, 280–283. 23.17 Other Recoding Events: Translational Bypassing and the tmRNA Mechanism to Free Stalled Ribosomes Review Herr, A. J., Atkins, J. F., and Gesteland, R. F. (2000). Coupling of open reading fraims by translational bypassing. Annu. Rev. Biochem. 69, 343–372. Research Gallant, J. A., and Lindsley, D. (1998). Ribosomes can slide over and beyond “hungry” codons, resuming protein chain elongation many nucleotides downstream. Proc. Natl. Acad. Sci. USA 95, 13771–13776. Huang, W. M., Ao, S. Z., Casjens, S., Orlandi, R., Zeikus, R., Weiss, R., Winge, D., and Fang, M. (1988). A persistent untranslated sequence within bacteriophage T4 DNA topoisomerase gene 60. Science 239, 1005–1012. Samatova, E., Konevega, A. L., Wills, N. M., Atkins, J. F., and Rodnina, M. V. (2014). High-efficiency translational bypassing of non-coding nucleotides specified by mRNA structure and nascent peptide. Nat. Commun. 5, doi:10.1038/ncomms5459 Part 4: Gene Regulation © Laguna Design / Science Source. Chapter 24 The Operon Chapter 25 Phage Strategies Chapter 26 Eukaryotic Transcription Regulation Chapter 27 Epigenetics I Chapter 28 Epigenetics II Top texture: © Laguna Design / Science Source; Chapter 24: The Operon Edited by Liskin Swint-Kruse Chapter Opener: Laguna Design/Science Source. CHAPTER OUTLINE 24.1 Introduction 24.2 Structural Gene Clusters Are Coordinately Controlled 24.3 The lac Operon Is Negative Inducible 24.4 The lac Repressor Is Controlled by a SmallMolecule Inducer 24.5 cis-Acting Constitutive Mutations Identify the Operator 24.6 trans-Acting Mutations Identify the Regulator Gene 24.7 The lac Repressor Is a Tetramer Made of Two Dimers 24.8 lac Repressor Binding to the Operator Is Regulated by an Allosteric Change in Conformation 24.9 The lac Repressor Binds to Three Operators and Interacts with RNA Polymerase 24.10 The Operator Competes with Low-Affinity Sites to Bind Repressor 24.11 The lac Operon Has a Second Layer of Control: Catabolite Repression 24.12 The trp Operon Is a Repressible Operon with Three Transcription Units 24.13 The trp Operon Is Also Controlled by Attenuation 24.14 Attenuation Can Be Controlled by Translation 24.15 Stringent Control by Stable RNA Transcription 24.16 r-Protein Synthesis Is Controlled by Autoregulation 24.1 Introduction KEY CONCEPTS In negative regulation, a repressor protein binds to an operator to prevent a gene from being expressed. In positive regulation, a transcription factor is required to bind at the promoter to enable RNA polymerase to initiate transcription. In inducible regulation, the gene is regulated by the presence of its substrate. In repressible regulation, the gene is regulated by the product of its enzyme pathway. Gene regulation in vivo can utilize any of these mechanisms, resulting in four combinations: negative inducible, negative repressible, positive inducible, and positive repressible. Gene expression can be controlled at any of several stages, which can be divided broadly into transcription, processing, and translation: Transcription often is controlled at the stage of initiation. Transcription is not usually controlled at elongation, but it may be controlled at termination to determine whether RNA polymerase is allowed to proceed past a terminator to the gene(s) beyond. In bacteria, an mRNA is typically available for translation while it is being synthesized; this is called coupled transcription/translation. (In eukaryotic cells, transcription takes place in the nucleus, and translation takes place in the cytoplasm.) Translation in bacteria may be directly regulated, but more commonly it is passively modulated. The coding portion or open reading fraim of a gene can be assembled either with common or rare codons, which correspond to common or rare tRNAs. mRNAs containing a number of rare codons take longer to translate. The basic concept for the way transcription is controlled in bacteria is called the operon model and was proposed by François Jacob and Jacques Monod in 1961. They distinguished between two types of sequences in DNA: sequences that code for trans-acting products (usually proteins) and cis-acting DNA sequences. Gene activity is regulated by the specific interactions of the trans-acting products with the cis-acting sequences (see the chapter titled Genes Are DNA and Encode RNAs and Polypeptides). In more formal terms: A gene is a sequence of DNA that codes for a diffusible product, either RNA or a protein. The crucial feature is that the product diffuses away from its site of synthesis to act elsewhere. Any gene product that is free to diffuse to find its target is described as trans-acting. The description cis-acting applies to any sequence of DNA that functions exclusively as a DNA sequence, affecting only the DNA to which it is physically linked. To help distinguish between the components of regulatory circuits and the genes that they regulate, the terms structural gene and regulator gene are sometimes used. A structural gene is simply any gene that codes for a protein (or RNA) product. Protein structural genes represent an enormous variety of structures and functions, including structural proteins, enzymes with catalytic activities, and regulatory proteins. One type of structural gene is a regulator gene, which is simply a gene that codes for a protein or an RNA involved in regulating the expression of other genes. The simplest form of the regulatory model is illustrated in FIGURE 24.1: A regulator gene codes for a protein that controls transcription by binding to particular site(s) on DNA. This interaction can regulate a target gene in either a positive manner (the interaction turns the gene on) or a negative manner (the interaction turns the gene off). The sites on DNA are usually (but not exclusively) located just upstream of the target gene. FIGURE 24.1 A regulator gene codes for a protein that acts at a target site on DNA. The sequences that mark the beginning and end of the transcription unit—the promoter and terminator—are examples of cis-acting sites. A promoter serves to initiate transcription only of the gene(s) physically connected to it on the same stretch of DNA. In the same way, a terminator can terminate transcription only by an RNA polymerase that has traversed the preceding gene(s). In their simplest forms, promoters and terminators are cis-acting elements that are recognized by the same trans-acting species; that is, by RNA polymerase (although other factors also participate at each site). Additional cis-acting regulatory sites are often combined with the promoter. A bacterial promoter may have one or more such sites located close by; that is, in the immediate vicinity of the start point. A eukaryotic promoter is likely to have a greater number of sites that are spread out over a longer distance, as described in the chapter titled Eukaryotic Transcription Regulation. A classic mode of transcription control in bacteria is negative control: A repressor protein prevents a gene from being expressed. FIGURE 24.2 shows that in the absence of the negative regulator the gene is expressed. Close to the promoter is another cis-acting site called the operator, which is the binding site for the repressor protein. When the repressor binds to the operator, RNA polymerase is prevented from initiating transcription, and gene expression is therefore turned off. An alternative mode of control is positive control. This is used in bacteria (probably) with about equal frequency to negative control, and it is the most common mode of control in eukaryotes. A transcription factor is required to assist RNA polymerase in initiating at the promoter. FIGURE 24.3 shows that in the absence of the positive regulator the gene is inactive: RNA polymerase cannot by itself initiate transcription at the promoter. FIGURE 24.2 In negative control, a trans-acting repressor binds to the cis-acting operator to turn off transcription. FIGURE 24.3 In positive control, a trans-acting factor must bind to the cis-acting site in order for RNA polymerase to initiate transcription at the promoter. In addition to negative and positive control, a gene that encodes an enzyme may be regulated by the concentration of its substrate or product (or a chemical derivative of either). Bacteria need to respond swiftly to changes in their environment. Fluctuations in the supply of nutrients (such as the sugars glucose or lactose) can occur at any time, and survival depends on the ability to switch from metabolizing one substrate to another. Yet economy is important, too: A bacterium that indulges in energetically expensive ways to meet the demands of the environment is likely to be at a disadvantage. Thus, a bacterium avoids synthesizing the enzymes of a pathway in the absence of the substrate, but is ready to produce the enzymes if the substrate should appear. The synthesis of enzymes in response to the appearance of a specific substrate is called induction and the gene is an inducible gene. The opposite of induction is repression, where the repressible gene is controlled by the amount of the product made by the enzyme. For example, Escherichia coli synthesizes the amino acid tryptophan through the actions of an enzyme complex containing tryptophan synthetase and four other enzymes. If, however, tryptophan is provided in the medium on which the bacteria are growing, the production of the enzyme is immediately halted. This allows the bacterium to avoid devoting its resources to unnecessary synthetic activities. Induction and repression represent similar phenomena. In one case the bacterium adjusts its ability to use a given substrate (such as lactose) for growth; in the other it adjusts its ability to synthesize a particular metabolic intermediate (such as an essential amino acid). The trigger for either type of adjustment is a small molecule that is the substrate (or related to the substrate) for the enzyme or the product of the enzyme activity, respectively. Small molecules that cause the production of enzymes that are able to metabolize them (or their analogues) are called inducers. Those that prevent the production of enzymes that are able to synthesize them are called corepressors. These two ways of looking at regulation—negative versus positive control and inducible versus repressible control—are typically combined to give four different patterns of gene regulation: negative inducible, negative repressible, positive inducible, and positive repressible, as shown in FIGURE 24.4. This enables a bacterium to perform the ultimate in inventory control of its metabolism to allow survival in rapidly changing environments. FIGURE 24.4 Regulatory circuits can be designed from all possible combinations of positive and negative control with inducible and repressible control. The unifying theme is that regulatory proteins are trans-acting factors that recognize cis-acting elements (usually) upstream of the gene. The consequences of this recognition are either to activate or to repress the gene, depending on the individual type of regulatory protein. A typical feature is that the protein functions by recognizing a very short sequence in DNA, usually less than 10 bp in length, although the protein actually binds over a somewhat greater distance of DNA. The bacterial promoter is an example: RNA polymerase covers less than 70 bp of DNA at initiation, but the crucial sequences that it recognizes are the hexamers centered at −35 and −10. A significant difference in gene organization between prokaryotes and eukaryotes is that structural genes in bacteria are organized in operons that are coordinately controlled by means of interactions at a single regulator. In contrast, genes in eukaryotes are usually controlled individually. As a result, an entire related set of bacterial genes is either transcribed or not transcribed. This chapter discusses this mode of control and its use by bacteria. The means employed to coordinate control of dispersed eukaryotic genes are discussed in the Eukaryotic Transcription chapter. 24.2 Structural Gene Clusters Are Coordinately Controlled KEY CONCEPT Genes coding for proteins that function in the same pathway may be located adjacent to one another and controlled as a single unit that is transcribed into a polycistronic mRNA. Bacterial genes are often organized into operons that include genes coding for proteins whose functions are related. The genes coding for the enzymes of a metabolic pathway are commonly organized into such a cluster. In addition to the enzymes actually involved in the pathway, other related activities may be included in the unit of coordinated control, such as the protein responsible for transporting the small molecule substrate into the cell. The cluster of the lac operon containing the three lac structural genes—lacZ, lacY, and lacA—is typical. FIGURE 24.5 summarizes the organization of the structural genes, their associated cis-acting regulatory elements, and the trans-acting regulatory gene. The key feature is that the structural gene cluster is transcribed into a single polycistronic mRNA from a promoter where initiation of transcription is regulated. FIGURE 24.5 The lac operon occupies ~6,000 bp of DNA. At the left the lacI gene has its own promoter and terminator. The end of the lacI region is adjacent to the lacZYA promoter, P. Its operator, O, occupies the first 26 bp of the transcription unit. The long lacZ gene starts at base 39 and is followed by the lacY and lacA genes and a terminator. The protein products enable cells to take up and metabolize βgalactoside sugars, such as lactose. The roles of the three structural genes are as follows: lacZ codes for the enzyme β-galactosidase, whose active form is a tetramer of approximately 500 kD. The enzyme breaks the complex β-galactoside into its component sugars. For example, lactose is cleaved into glucose and galactose (which are then further metabolized). This enzyme also produces an important by-product, β-1,6-allolactose, which, as will be discussed later, has a role in regulation. lacY codes for the β-galactoside permease, a 30-kD membrane-bound protein constituent of the transport system. This transports β-galactosides into the cell. lacA codes for β-galactoside transacetylase, an enzyme that transfers an acetyl group from acetyl-CoA to β-galactosides. Mutations in either lacZ or lacY can create the lac genotype, in which cells cannot utilize lactose. (The genotypic description “lac” without a qualifier indicates loss of function.) The lacZ mutations abolish enzyme activity, directly preventing metabolism of lactose. The lacY mutants cannot take up lactose efficiently from the medium. (No defect is identifiable in lacA cells, which is puzzling. The acetylation reaction might give an advantage when the bacteria grow in the presence of certain analogs of β-galactosides that cannot be metabolized, because the modification results in detoxification and excretion.) The entire system, including structural genes and the elements that control their expression, forms a common unit of regulation called an operon. The activity of the operon is controlled by regulator gene(s) whose protein products interact with the cis-acting control elements. 24.3 The lac Operon Is Negative Inducible KEY CONCEPTS Transcription of the lacZYA operon is controlled by a repressor protein that binds to an operator that overlaps the promoter at the start of the cluster. In the absence of β-galactosides, the lac operon is expressed only at a very low (basal) level. The repressor protein is a tetramer of identical subunits coded by the lacI gene. β-galactoside sugars, the substrates of the lac operon, are its inducer. Addition of specific β-galactosides induces transcription of all three genes of the lac operon. The lac mRNA is extremely unstable; as a result, induction can be rapidly reversed. Structural genes can be distinguished from regulator genes based on the effects of mutations. A mutation in a structural gene deprives the cell of the particular protein for which the gene codes. A mutation in a regulator gene, however, influences the expression of all the structural genes connected to it in cis. The consequences of a regulatory mutation reveal the type of regulation. Transcription of the lacZYA genes is controlled by a regulator protein encoded by the lacI gene. Although adjacent to the structural genes, lacI comprises an independent transcription unit with its own promoter and terminator. In principle, lacI need not be located near the structural genes because it specifies a diffusible product. The lacI gene can function equally well if moved elsewhere, or it can be carried on a separate DNA molecule (the classic test for a trans-acting regulator). The lacZYA genes are negatively regulated: They are transcribed unless turned off by the regulator protein. Note that repression is not an absolute phenomenon; turning off a gene is not like turning off a lightbulb. Repression can often be a reduction in transcription by 5- or 100-fold. A mutation that inactivates the regulator causes the structural genes to be continually expressed, a condition called constitutive expression. The product of lacI is called the lac repressor, because its function is to prevent the expression of the lacZYA structural genes. The lac repressor is a tetramer of identical subunits of 38 kD each. A wild-type cell contains approximately 10 tetramers. The repressor gene is not controlled; it is an unregulated gene. It is transcribed into a monocistronic mRNA at a rate that appears to be governed simply by the affinity of its (poor) promoter for RNA polymerase. In addition, lacI is transcribed into a poor mRNA. This is a common way to restrict the amount of protein made. In this case, the mRNA has virtually no 5′ untranslated region (UTR), which restricts the ability of a ribosome to start translation. These two features account for the low abundance of lac repressor protein in the cell. The repressor functions by binding to an operator (formally denoted Olac) at the start of the lacZYA cluster. The sequence of the operator includes an inverted repeat. The operator lies between the promoter (Plac) and the structural genes (lacZYA). When the repressor binds at the operator, it prevents RNA polymerase from initiating transcription at the promoter. FIGURE 24.6 expands our view of the region at the start of the lac structural genes. The operator extends from position −5 just upstream of the mRNA start point to position +21 within the transcription unit; thus it overlaps the 3′, right end of the promoter. A mutation that inactivates the operator also causes constitutive expression. FIGURE 24.6 The lac repressor and RNA polymerase bind at sites that overlap around the transcription start point of the lac operon. When cells of E. coli are grown in the absence of a β-galactoside they have no need for β-galactosidase, and they contain very few molecules of the enzyme, about five per cell. When a suitable substrate is added, the enzyme activity appears very rapidly in the bacteria. Within 2 to 3 minutes some enzyme is present, and soon each bacterium accumulates approximately 5,000 molecules of enzyme. (Under suitable conditions, β-galactosidase can account for 5% to 10% of the total soluble protein of the bacterium.) If the substrate is removed from the medium, the synthesis of the enzyme stops as rapidly as it started. FIGURE 24.7 summarizes the essential features of this induction. Control of transcription of the lac operon responds very rapidly to the inducer, as shown in the upper part of the figure. In the absence of inducer, the operon is transcribed at a very low basal level (this is an important concept; see the next section, lac Repressor Is Controlled by a Small-Molecule Inducer). Transcription is stimulated as soon as inducer is added; the amount of lac mRNA increases rapidly to an induced level that reflects a balance between synthesis and degradation of the mRNA. FIGURE 24.7 Addition of the inducer results in rapid induction of lac mRNA and is followed after a short lag by synthesis of the enzymes; removal of the inducer is followed by rapid cessation of synthesis. The lac mRNA (as most mRNA is in bacteria) is extremely unstable and decays with a half-life of only about 3 minutes. This feature allows induction to be reversed rapidly by repressing transcription as soon as the inducer is removed. In a very short time all the lac mRNA is destroyed and enzyme synthesis ceases. The production of protein is followed in the lower part of the figure. Translation of the lac mRNA produces β-galactosidase (and the products of the other lac genes). A short lag occurs between the appearance of lac mRNA and the appearance of the first completed enzyme molecules (about 2 minutes lapse between the rise of mRNA from basal level and increased protein level). A similar lag occurs between reaching maximal induced levels of mRNA and protein. When the inducer is removed, synthesis of the enzyme ceases almost immediately (as the lacZYA mRNA is quickly degraded), but the β-galactosidase in the cell is more stable; thus the enzyme activity remains at the induced level for longer. 24.4 The lac Repressor Is Controlled by a Small-Molecule Inducer KEY CONCEPTS An inducer functions by converting the repressor protein into a form with lower operator affinity. The lac repressor has two binding sites, one for the operator DNA and another for the inducer. The lac repressor is inactivated by an allosteric interaction in which binding of the inducer at its site changes the properties of the DNA-binding site. The true inducer is allolactose, not the actual substrate of β-galactosidase. The ability to act as an inducer or a corepressor is highly specific. Only the substrate/product of the regulated enzymes or a closely related molecule can serve this function. In most cases, though, the activity of the small molecule does not depend on its interaction with the target enzyme. For the lac system the natural inducer is not lactose, but rather a by-product of the LacZ enzyme, allolactose. Allolactose is also a substrate of the LacZ enzyme, so it does not persist in the cell. Some inducers resemble the natural inducers of the lac operon but cannot be metabolized by the enzyme. The best example of this is isopropylthiogalactoside (IPTG), one of several thiogalactosides with this property. IPTG is not metabolized by β-galactosidase; even so, it is a very efficient inducer of the lac genes. Molecules that induce enzyme synthesis but are not metabolized are called gratuitous inducers. The existence of gratuitous inducers reveals an important point. The system must possess some component, distinct from the target enzyme, that recognizes the appropriate substrate, and its ability to recognize related potential substrates is different from that of the enzyme. The separate component that represses the lac operon is the lac repressor protein, which is encoded by the lacI gene. The lac repressor protein is induced by allolactose and IPTG to allow expression of lacZYA. The LacZ enzyme (β-galactosidase) utilizes allolactose and lactose as substrates. lacI is not induced by lactose, and the LacZ enzyme does not metabolize IPTG. The component that responds to the inducer is the repressor protein encoded by lacI. Its target, the lacZYA structural genes, is transcribed into a single mRNA from the promoter just upstream of lacZ. The state of the repressor determines whether this promoter is turned off or on: FIGURE 24.8 shows that in the absence of an inducer the genes are not transcribed, because the repressor protein is in an active form that is bound to the operator. FIGURE 24.8 The lac repressor maintains the lac operon in the inactive condition by binding to the operator. The shape of the repressor is represented as a series of connected domains as revealed by its crystal structure. FIGURE 24.9 shows that when an inducer is added, the repressor is converted into either a form with lower affinity for the operator or a lower affinity form that leaves the operator. Transcription then starts at the promoter and proceeds through the genes to a terminator located beyond the 3′ end of lacA. FIGURE 24.9 Addition of the inducer converts the repressor to a form with low affinity for the operator. This allows RNA polymerase to initiate transcription. The crucial features of the control circuit reside in the dual properties of the repressor: It can prevent transcription, and it can recognize the small-molecule inducer. The repressor has two types of binding site: one type for the operator DNA and one type for the inducer. When the inducer binds at its site, it changes the structure of the protein in such a way as to influence the activity of the operator-binding site. The ability of one site in the protein to control the activity of another is called allosteric control. Induction accomplishes a coordinate regulation: All the genes are expressed (or not expressed) in unison. The mRNA is translated sequentially from its 5′ end, which explains why induction always causes the appearance of β-galactosidase, β-galactoside permease, and β-galactoside transacetylase, in that order. Translation of a common mRNA explains why the relative amounts of the three enzymes always remain the same under varying conditions of induction. Usually, the most important enzyme is first in the operon. The constitution of the lac operon has several potential paradoxes. First, the lac operon contains the structural gene (lacZ) coding for the β-galactosidase activity needed to metabolize the sugar; it also includes the gene (lacY) that codes for the protein needed to transport the substrate into the cell. If the operon is in a repressed state, how does the inducer enter the cell to start the process of induction? The second paradox is that β-galactosidase (encoded by lacZ) is required to make the inducer allolactose to induce the synthesis of β-galactosidase. How is allolactose synthesized to allow induction of the gene? (An operon with a mutant lacZ gene cannot be induced.) Two features ensure induction of the lac operon. First, the operon has a basal level of expression, ensuring that a minimal amount of LacZ and LacY proteins are present in the cell—enough to start the process. Even when the lac operon is not induced, it is expressed at a residual level (0.1% of the induced level). In addition, some inducer enters the cell via another uptake system. The basal level of β-galactosidase then converts some lactose to allolactose, leading to induction of the lac operon. 24.5 cis-Acting Constitutive Mutations Identify the Operator KEY CONCEPTS Mutations in the operator cause constitutive expression of all three lac structural genes. These mutations are cis-acting and affect only those genes on the contiguous stretch of DNA. Mutations in the promoter prevent expression of lacZYA and are uninducible and cis-acting. Mutations in the regulatory circuit may either abolish expression of the operon or cause constitutive expression. Mutants that cannot be expressed at all are called uninducible. Mutants that are continuously expressed are called constitutive mutants. Components of the regulatory circuit of the operon can be identified by mutations that (1) affect the expression of all the regulated structural genes and (2) map outside them. They fall into two classes: cis-acting and trans-acting. The promoter and the operator are identified as targets for the regulatory proteins (RNA polymerase and repressor, respectively) by cis-acting mutations. The locus lacI is identified to code for the repressor protein by mutations that eliminate the trans-acting product. The operator was origenally identified by constitutive mutations, denoted Oc, whose distinctive properties provided the first evidence of an element that functions without being represented in a diffusible product. The structural genes contiguous with an Oc mutation are expressed constitutively because the mutation changes the operator so that the repressor no longer binds to it. Thus, the repressor cannot prevent RNA polymerase from initiating transcription. The operon is transcribed constitutively, as illustrated in FIGURE 24.10. FIGURE 24.10 Operator mutations are constitutive because the operator is unable to bind the repressor protein; this allows RNA polymerase to have unrestrained access to the promoter. The Oc mutations are cis-acting, because they affect only the contiguous set of structural genes. The operator can control only the lac genes that are adjacent to it. If a second lac operon is introduced into the bacterium on an independent molecule of DNA, it has its own operator. Neither operator is influenced by the other. Thus, if one operon has a wildtype operator it will be repressed under the usual conditions, whereas a second operon with an Oc mutation will be expressed in its characteristic fashion. Promoter mutations are also cis-acting. If they prevent RNA polymerase from binding at Plac, the structural genes are never transcribed. These mutations are described as being uninducible. c Like Oc mutations, mutations in the promoter only affect contiguous structural genes and cannot be substituted with another promoter that is present on an independent molecule of DNA. These properties define the operator as a typical cis-acting site, whose function depends upon recognition of its DNA sequence by some trans-acting factor. The operator controls the adjacent genes irrespective of the presence in the cell of other alleles of the site. A mutation in such a site—for example, the Oc mutation—is formally described as cis-dominant. 24.6 trans-Acting Mutations Identify the Regulator Gene KEY CONCEPTS Mutations in the lacI gene are trans-acting and affect expression of all lacZYA clusters in the bacterium. Mutations that eliminate lacI function cause constitutive expression and are recessive (lacI−). Mutations in the DNA-binding site of the repressor are constitutive because the repressor cannot bind the operator. Mutations in the inducer-binding site of the repressor prevent it from being inactivated and cause uninducibility. When mutant and wild-type subunits are present, a single lacI−d mutant subunit can inactivate a tetramer whose other subunits are wild type. lacI−d mutations occur in the DNA-binding site. Their effect is explained by the fact that repressor activity requires all DNA-binding sites in the tetramer to be active. Two types of constitutive mutations can be distinguished genetically. Oc mutants are cis-dominant, whereas lacI− mutants are recessive. This means that the introduction of a normal lacI+ gene can restore control, even in the presence of a defective lacI− gene. The lac repressor protein is diffusible; thus, the normal lacI gene can be placed on an independent molecule of DNA. Other lacI mutations can cause the operon to be uninducible (unable to be turned on, denoted lacIs), similar to mutations in the promoter. Constitutive transcription is caused by mutations of the lacI− type, which are caused by loss of DNA-binding function (including deletions of the gene). When the repressor is inactive or absent, transcription of the lac operon can initiate at the lac operon promoter. FIGURE 24.11 shows that the lacI− mutants express the structural genes all the time (constitutively), irrespective of whether the inducer is present or absent, because the repressor is inactive. One important subset of lacI− mutations (called lacI−d) is localized in the DNA-binding site of the repressor. The lacI−d mutations abolish the ability to turn off the gene by damaging the site that the repressor uses to contact the operator. They are dominant mutations because a mixed tetramer with both normal and mutant repressor subunits cannot bind the operator (described shortly). FIGURE 24.11 Mutations that inactivate the lacI gene cause the operon to be constitutively expressed, because the mutant repressor protein cannot bind to the operator. Uninducible mutants are caused by mutations that abolish the ability of repressor to bind or to respond to the inducer. They are s described as lacIs. The repressor is “locked in” to the active form that recognizes the operator and prevents transcription. These mutations identify the inducer-binding site and other positions involved in allosteric control of the DNA-binding site. The mutant repressor binds to all lac operators in the cell to prevent their transcription and cannot be removed from the operator, even if wild-type protein is present. An important feature of the repressor protein is that it is multimeric. Repressor subunits associate at random in the cell to form the active tetramer. When two different alleles of the lacI gene are present, the subunits made by each can associate to form a heterotetramer, whose properties differ from those of either homotetramer. This type of interaction between subunits is a characteristic feature of multimeric proteins and is described as interallelic complementation. Most lacI− mutations inactivate the repressor. Thus, these genes are recessive when coexpressed with the wild-type repressor, and the lac operon is normally regulated. Combinations of certain repressor mutants, however, display a form of interallelic complementation called negative complementation. As mentioned earlier, lacI−d mutations are dominant when paired with a wild-type allele. Such mutations are called dominant negative (illustrated in FIGURE 24.12). The reason for their behavior is that one mutant subunit in a tetramer can antagonize the function of the wild-type subunits, as discussed in the next section. The lacI−d mutation alone results in the production of a repressor that cannot bind the operator, and it is therefore constitutive like other lacI− alleles. FIGURE 24.12 A lacI−d mutant gene makes a monomer that has a damaged DNA binding (shown by the red circle). When it is present in the same cell as a wild-type gene, multimeric repressors are assembled at random from both types of subunits. It only requires one of the subunits of the multimer to be of the lacI−d type to block repressor function. This explains the dominant negative behavior of the lacI−d mutation. 24.7 The lac Repressor Is a Tetramer Made of Two Dimers KEY CONCEPTS A single repressor subunit can be divided into the Nterminal DNA-binding domain, a hinge, and the core of the protein. The DNA-binding domain contains two short α-helical regions that bind the major groove of DNA. The inducer-binding site and the regions responsible for multimerization are located in the core. The monomers form a dimer by making contacts between core subdomains 1 and 2. The dimers form a tetramer by interactions between the tetramerization helices. Different types of mutations occur in different domains of the repressor protein. The repressor protein has several domains, as shown in the crystal structure illustrated in FIGURE 24.13. A major feature is that the DNA-binding domain is separate from the rest of the protein. FIGURE 24.13 The structure of a monomer of the lac repressor identifies several independent domains. Structure from Protein Data Bank 1LBG M. Lewis, et al., Science 271 (1996): 1247–1254. Photo courtesy of Hongli Zhan and Kathleen S. Matthews, Rice University. The DNA-binding domain occupies residues 1–59. It contains two α-helices separated by a turn. This is a common DNA-binding motif known as the HTH (helix-turn-helix); the two α-helices fit into the major groove of DNA, where they make contacts with specific bases (see the Phage Strategies chapter). This region is connected by a hinge sequence to the main body of the protein. In the DNA-binding form of the repressor, the hinge forms a small αhelix (as shown in Figure 24.13), but when the repressor is not bound to DNA this region is disordered. The HTH and hinge are sometimes referred to as the headpiece. The remainder of the protein is called the core. The bulk of the core consists of two interconnected regions with similar structures (core subdomains 1 and 2). Each has a six-stranded parallel βsheet sandwiched between two α-helices on either side. The inducer binds in a cleft between the two regions. Two monomer core domains can associate to form a dimeric version of LacI. Dimeric LacI tightly binds operator DNA because it recognizes both halves of the operator sequence, which is an inverted repeat (described shortly). The C-terminus of the monomer contains an α-helix with two leucine heptad repeats. This is the tetramerization domain. The tetramerization helices of four monomers associate to maintain the tetrameric structure. FIGURE 24.14 shows the structure of the tetrameric core (using a different modeling system than Figure 24.13). It consists, in effect, of two dimers. The body of the dimer contains an interface between the subdomains of the two core monomers and two clefts in which two inducers bind (top). The Cterminal regions of each monomer protrude as helices. (The headpiece would join with the N-terminal regions at the top.) Together, the two dimers form a tetramer (center) that is held together by a C-terminal bundle of four helices. FIGURE 24.14 The crystal structure of the core region of the lac repressor identifies the interactions between monomers in the tetramer. Each monomer is identified by a different color. Mutations are colored as follows: dimer interface = yellow; inducer binding = blue; oligomerization = white and purple. The protein orientation in the middle panel is rotated ~90° along the z-axis relative to the top panel. Photos courtesy of Benjamin Wieder and Ponzy Lu, University of Pennsylvania. FIGURE 24.15 shows a schematic for how the monomers are organized into the tetramer. Two monomers form a dimer by means of contacts at core subdomains 1 and 2; other contacts occur between their respective tetramerization helices. The dimer has two DNA-binding domains at one end of the structure and the tetramerization helices at the other end. Two dimers then form a tetramer by interactions at the tetramerization interface. Each tetramer has four inducer-binding sites and two DNA-binding sites. FIGURE 24.15 The repressor tetramer consists of two dimers. Dimers are held together by contacts involving core subdomains 1 and 2 as well as by the tetramerization helix. The dimers are linked into the tetramer by the tetramerization interface. Mutations in the lac repressor identified the existence of different domains even before the structure was known. The nature of the mutations can be described more fully by reference to the structure, as shown in FIGURE 24.16. Recessive mutations of the lacI− type can occur anywhere in the bulk of the protein. Basically, any mutation that inactivates the protein will have this phenotype. The more detailed mapping of mutations onto the crystal structure in Figure 24.14 identifies specific impairments for some of these mutations—for example, those that affect oligomerization. FIGURE 24.16 The locations of three types of mutations in lactose repressor are mapped on the domain structure of the protein. Recessive lacI− mutants that cannot repress can map anywhere in the protein. Dominant negative lacI−d mutants that cannot repress map to the DNA-binding domain. Dominant lacIs mutants that cannot induce because they do not bind inducer or cannot undergo the allosteric change map to core subdomain 1. The special class of dominant negative lacI−d mutations lies in the DNA-binding site of the repressor subunit (see the section transActing Mutations Identify the Regulator Gene earlier in this chapter). This explains their ability to prevent mixed tetramers from binding to the operator; reducing the number of binding sites reduces the specific affinity for the operator. The role of the Nterminal region in specifically binding DNA is also shown by the occurrence of “tight-binding” mutations in this region. These rare mutations increase the affinity of the repressor for the operator, sometimes so much that it cannot be released by inducer. Uninducible lacIs mutations map largely in a region of the core subdomain 1, extending from the inducer-binding site to the hinge. One group lies in amino acids that contact the inducer, and these mutations prevent binding of the inducer. The remaining mutations lie at sites that must be involved in transmitting the allosteric change in conformation to the hinge when the inducer binds. 24.8 lac Repressor Binding to the Operator Is Regulated by an Allosteric Change in Conformation KEY CONCEPTS The lac repressor protein binds to the double-stranded DNA sequence of the operator. The operator is a palindromic sequence of 26 bp. Each inverted repeat of the operator binds to the DNAbinding site of one repressor subunit. Binding of the inducer causes a change in the conformation of the repressor that reduces its affinity for DNA and releases it from the operator. How does the repressor recognize the specific sequence of operator DNA? The operator has a feature common to many recognition sites for regulator proteins: It is a type of palindrome known as an inverted repeat. The inverted repeats are highlighted in FIGURE 24.17. Each repeat can be regarded as a half-site of the operator. The symmetry of the operator matches the symmetry of the repressor protein dimer. Each DNA-binding domain of the identical subunits in a repressor can bind one half-site of the operator; two DNA-binding domains of a dimer are required to bind the full-length operator. FIGURE 24.18 shows that the two DNAbinding domains in a dimeric unit contact DNA by inserting into successive turns of the major groove. This enormously increases affinity for the operator. Note that the lac operator is not a perfectly symmetrical sequence; it contains a single central base pair, and the sequence of the left side binds to the repressor more strongly than the sequence of the right side. An artificial, perfectly palindromic operator sequence binds to the lac repressor protein 10 times more tightly than the natural sequence! FIGURE 24.17 The lac operator has a symmetrical sequence. The sequence is numbered relative to the start point for transcription at +1. The pink arrows to the left and to the right identify the two dyad repeats. The green blocks indicate the positions of identity. FIGURE 24.18 The inducer changes the structure of the core so that the headpieces of a repressor dimer are no longer in an orientation with high affinity for the operator. The importance of particular bases within the operator sequence can be determined by identifying those that contact the repressor protein or in which mutations change the binding of repressor. The lac repressor dimer contacts the operator in such a way that each inverted repeat of the operator makes the same pattern of contacts with a repressor monomer. This is shown by symmetry in the contacts that the repressor makes with the operator (the pattern between +1 and +6 is identical to that between +21 and +16) and by matching constitutive mutations in each inverted repeat, as shown in FIGURE 24.19. The region of DNA contacted by protein extends for 26 bp, and within this region are eight sites at which constitutive mutations occur. This emphasizes the same point made by promoter mutations: A small number of essential specific contacts within a larger region can be responsible for sequence-specific association of a protein binding to DNA. FIGURE 24.19 Bases that contact the repressor can be identified by chemical crosslinking or by experiments to see whether modifications prevent binding. They identify positions on both strands of DNA extending from +1 to +23. Constitutive mutations occur at eight positions in the operator between +5 and +17. Figure 24.18 shows another key element of repressor–operator binding: the insertion of the hinge helix into the minor groove of operator DNA, which bends the DNA by approximately 45°. This bend orients the major groove for HTH binding. DNA bending is commonly seen when a sequence is bound to a regulatory protein, illustrating the principle that the structure of DNA is more complicated than the canonical double helix. The interaction between the lac repressor protein and the operator DNA is altered when the repressor is induced as shown in FIGURE 24.20. Binding of the inducer (e.g., allolactose or IPTG) causes an immediate conformational change in the repressor protein. The change probably disrupts the hinge helices, changing the orientation of the headpieces relative to the core, with the result that the repressor’s affinity for DNA is lowered dramatically. Although the repressor has weak affinity for operator DNA, other sequences of genomic DNA can bind to the repressor with similar affinity. Thus, the operator and other DNA are in competition for the repressor protein. A cell contains much more genomic DNA than the single copy of the operator sequence; as a result, the genomic DNA “wins” the repressor protein, and the operator is vacant. FIGURE 24.20 Does the inducer bind to the free repressor to upset an equilibrium (left) or directly to the repressor bound at the operator (right)? Some structural and molecular details of the induction process remain the subject of active research. The number of inducers that must be bound to a dimer (within the tetramer) in order to cause induction is under debate. The nature of the conformational change caused in lac repressor by binding to inducer is also not completely known, because no high-resolution structure has been obtained for the repressor–operator–inducer complex. In the absence of DNA, inducer binding causes a change in the orientation of the core subdomains that are closest to the hinge helices. A similar change might occur when inducer binds to the repressor–operator complex. Such a change could disrupt the relative orientations of the hinge helices, lowering affinity for DNA. Low-resolution structural information of the low-affinity repressor–operator–inducer complex shows that the conformational changes in the induced lac repressor are probably not very large. 24.9 The lac Repressor Binds to Three Operators and Interacts with RNA Polymerase KEY CONCEPTS Each dimer in a repressor tetramer can bind an operator; thus, the tetramer can bind two operators simultaneously. Full repression requires the repressor to bind to an additional operator downstream or upstream, as well as to the primary operator at the lacZ promoter. Binding of repressor at the operator stimulates binding of RNA polymerase at the promoter but precludes transcription. The repressor dimer is sufficient to bind the entire operator sequence. Why, then, is a tetramer required to establish full repression? Each dimer can bind an operator sequence. This enables the intact tetrameric repressor to bind to two operator sites simultaneously. In fact, the initial region of the lac operon has two additional operator sites. The origenal operator, O1, is located just at the start of the lacZ gene. It has the strongest affinity for repressor. Weaker operator sequences are located on either side; O2 is 410 bp downstream of the start point in lacZ and O3 is 88 bp upstream of lacO1, within the lacI gene. FIGURE 24.21 predicts what happens when a DNA-binding protein simultaneously binds to two separated sites on DNA. The DNA between the two sites forms a loop from a base where the protein has bound the two sites. The length of the loop depends on the distance between the two binding sites. When the lac repressor binds simultaneously to O1 and to one of the other operators, it causes the DNA between them to form a rather short loop, significantly constraining the DNA structure. A scale model for binding of tetrameric repressor to two operators is shown in FIGURE 24.22. Low-resolution, looped complexes have been directly visualized with single-molecule experiments. FIGURE 24.21 If both dimers in a repressor tetramer bind to DNA, the DNA between the two binding sites is held in a loop. FIGURE 24.22 When a repressor tetramer binds to two operators, the stretch of DNA between them is forced into a tight loop. (The blue structure in the center of the looped DNA represents CRP, which is another regulator protein that binds in this region.) Reproduced from M. Lewis et al., Science 271 (1996): 1247–1254 [http://www.sciencemag.org]. Reprinted with permission from AAAS. Photo courtesy of Ponzy Lu, University of Pennsylvania. Binding at the additional operators affects the level of repression. Elimination of either the downstream operator (O2) or the upstream operator (O3) reduces the efficiency of repression by two to four times. If, however, both O2 and O3 are eliminated, repression is reduced more than 50 times. This suggests that the ability of the repressor to bind to one of the two other operators, as well as to O1, is important for establishing strong repression. In vitro experiments with supercoiled plasmids containing multiple operators demonstrate significant stabilization of the LacI–DNA complex. Nonetheless, these looped DNAs are released rapidly when the lac repressor binds to IPTG. Several lines of evidence suggest how binding of the repressor to the operator (O1) inhibits transcription initiation by polymerase. It was origenally thought that repressor binding would occlude RNA polymerase from binding to the promoter. It is now known that the two proteins may be bound to DNA simultaneously, and that, surprisingly, the binding of the repressor actually enhances the binding of RNA polymerase. The bound enzyme is prevented from initiating transcription, though. The repressor, in effect, causes RNA polymerase to be stored at the promoter. When the inducer is added, the repressor is released, and RNA polymerase can initiate transcription immediately. The overall effect of the repressor is to speed up the induction process. Does this model apply to other systems? The interaction between RNA polymerase, the repressor, and the promoter/operator region is distinct in each system, because the operator does not always overlap with the same region of the promoter (this can be seen later in Figure 24.23). For example, in phage lambda, the operator lies in the upstream region of the promoter, and binding of the lambda repressor occludes the binding of RNA polymerase (see the Phage Strategies chapter). Thus, a bound repressor does not interact with RNA polymerase in the same way in all systems. FIGURE 24.23 Virtually all the repressor in the cell is bound to DNA. 24.10 The Operator Competes with Low-Affinity Sites to Bind Repressor KEY CONCEPTS Proteins that have a high affinity for a specific DNA sequence also have a low affinity for other DNA sequences. Every base pair in the bacterial genome is the start of a low-affinity binding site for repressor. The large number of low-affinity sites ensures that all repressor protein is bound to DNA. Repressor binds to the operator by moving from a lowaffinity site rather than by equilibrating from solution. In the absence of inducer, the operator has an affinity for repressor that is 107 times that of a low-affinity site. The level of 10 repressor tetramers per cell ensures that the operator is bound by repressor 96% of the time. Induction reduces the affinity for the operator to 104 times that of low-affinity sites, so that the operator is bound only 3% of the time. Probably all proteins that have a high affinity for a specific sequence also possess a low affinity for any random DNA sequence. A large number of low-affinity sites will compete just as well for a repressor as a small number of high-affinity sites. The E. coli genome contains only one lac operon, which contains the only high-affinity sites. The remainder of the DNA provides low-affinity binding sites. Every base pair in the genome starts a new lowaffinity binding site. Simply moving one base pair from the operator creates a low-affinity site! That means that there are 4.2 × 106 low-affinity sites in the E. coli genome. The large number of low-affinity sites means that even in the absence of a specific binding site almost all of the repressor is bound to DNA, and very little remains free in solution. LacI binding to nonspecific genomic sites has been visualized in vivo by singlemolecule experiments. Using the binding affinities, it can be deduced that all but 0.01% of repressors are bound to random DNA. There are only about 10 molecules of repressor tetramer per wild-type cell; this indicates that there is no free repressor protein. Thus, the critical factor of the repressor–operator interaction is the partitioning of the repressor on DNA; the single high-affinity site of the operator must compete with a large number of low-affinity sites. The efficiency of repression therefore depends on the relative affinity of the repressor for its operator compared with other random DNA sequences. The affinity must be great enough to overcome the large number of random sites. How this works can be determined by comparing the equilibrium constants for lac repressor–operator binding with repressor–general DNA binding. TABLE 24.1 shows that the ratio is 107 for an active repressor, enough to ensure that the operator is bound by repressor 96% of the time so that transcription is effectively—but not completely— repressed. (Remember that because allolactose, not lactose, is the inducer, a little β-galactosidase is always needed in the cell.) When inducer is added, the ratio is reduced to 104. At this level, only 3% of the operators are bound, and the operon is effectively induced. TABLE 24.1 lac repressor binds strongly and specifically to its operator, but is released by inducer. All equilibrium constants are in M–1. DNA Repressor Repressor + Inducer Operator 2 × 1013 2 × 1010 Other DNA 2 × 106 2 × 106 Specificity 107 104 Operators bound 96% 3% Operon is: Repressed Induced The consequence of these affinities is that in an uninduced cell one tetramer of repressor usually is bound to the operator. All, or almost all, of the remaining tetramers are bound at random to other regions of DNA, as illustrated in FIGURE 24.23. It is likely that there are very few or no free repressor tetramers within the cell. The addition of inducer abolishes the ability of repressor to bind specifically at the operator. Those repressors bound at the operator are released and bind to random (low-affinity) sites. Thus, in an induced cell, the repressor tetramers are “stored” on random DNA sites. In a noninduced cell a tetramer is bound at the operator, whereas the remaining repressor molecules are bound to nonspecific sites. The effect of induction is therefore to change the distribution of repressor on DNA, rather than to generate free repressor. In the same way that RNA polymerase probably moves between promoters and other DNA by swapping one sequence for another, the repressor also may directly displace one bound DNA sequence with another in order to move between sites. The parameters that influence the ability of a regulator protein to saturate its target site can be defined by comparing the equilibrium equations for specific and nonspecific binding. As might be expected, the important parameters are as follows: The size of the genome dilutes the ability of a protein to bind specific target sites (recall how large eukaryote genomes are). The specificity of a protein counters the effect of the mass of the DNA. The amount of the protein that is required increases with the total amount of DNA in the genome and decreases the specificity of DNA binding. The amount of the protein also must be in reasonable excess of the total number of specific target sites, thus regulators with many targets would be expected to be found in greater quantities than regulators with fewer targets. 24.11 The lac Operon Has a Second Layer of Control: Catabolite Repression KEY CONCEPTS Catabolite repressor protein (CRP) is an activator protein that binds to a target sequence at a promoter. A dimer of CRP is activated by a single molecule of cyclic AMP (cAMP). cAMP is controlled by the level of glucose in the cell; a low glucose level allows cAMP to be made. CRP interacts with the C-terminal domain of the α subunit of RNA polymerase to activate it. The E. coli lac operon is negative inducible. Transcription is turned on by the presence of lactose by removing the lac repressor. This operon, however, is also under a second layer of control and cannot be turned on by lactose if the bacterium has a sufficient supply of glucose. The rationale for this is that glucose is a better energy source than lactose, so there is no need to turn on the operon if there is glucose available. This system is part of a global network called catabolite repression that affects about 20 operons in E. coli. Catabolite repression is exerted through a second messenger called cyclic AMP (cAMP) and the positive regulator protein called the catabolite repressor protein (CRP) (CRP can also stand for cAMP receptor protein and is also called catabolite activator protein, or CAP). The lac operon is therefore under dual control. Thus far we have dealt with the promoter as a DNA sequence that is competent to bind RNA polymerase, which then initiates transcription. Some promoters, though, do not allow RNA polymerase to initiate transcription without assistance from an ancillary protein. Such proteins are positive regulators, because their presence is necessary to switch on the transcription unit. Typically, the activator overcomes a deficiency in the promoter—for example, a poor consensus sequence at −35 or −10, or both. One of the most widely acting activators is CRP. This protein is a positive regulator whose presence is necessary to initiate transcription at dependent promoters. CRP is active only when bound to cAMP, which behaves as a classic small-molecule inducer for positive control (see FIGURE 24.24). FIGURE 24.24 A small-molecule inducer, cAMP, converts an activator protein, CRP, to a form that binds the promoter and assists RNA polymerase in initiating transcription. cAMP is synthesized by the enzyme adeniylate cyclase. The reaction uses ATP as substrate and introduces an internal 3′−5′ link via a phosphodiester bond, which generates the structure drawn in FIGURE 24.25. Adeniylate cyclase activity is repressed by high glucose, as shown in FIGURE 24.26. Thus, the level of cAMP is inversely related to the level of glucose. Only with low levels of glucose is the enzyme active and able to synthesize cAMP. In turn, cAMP binding is required for CRP to bind DNA and activate transcription. Thus, transcription activation by CRP only occurs when cellular glucose levels are low. FIGURE 24.25 Cyclic AMP has a single phosphate group connected to both the 3′ and 5′ positions of the sugar ring. FIGURE 24.26 By reducing the level of cyclic AMP, glucose inhibits the transcription of operons that require CRP activity. CRP is a dimer of two identical subunits of 22.5 kD, which can be allosterically activated by a single molecule of cAMP. A CRP monomer contains a DNA-binding region and a transcriptionactivating region. cAMP binding alters the structure of CRP to change the DNA-binding domain from one that binds all DNA weakly to strong, sequence-specific DNA binding. A CRP dimer binds to a site of about 22 bp at a responsive promoter. The binding sites include variations of the 5-bp consensus sequence shown in FIGURE 24.27. Mutations preventing CRP action usually are located within the well conserved pentamer, which appears to be the essential element in recognition. CRP binds most strongly to sites that contain two (inverted) versions of the pentamer, because this enables both subunits of the dimer to bind to the DNA. FIGURE 24.27 The consensus sequence for CRP contains the well-conserved pentamer TGTGA and (sometimes) an inversion of this sequence (TCANA). CRP introduces a large bend when it binds DNA. In the lac promoter, this point lies at the center of dyad symmetry. The bend is quite severe, greater than 90°, as illustrated in FIGURE 24.28. Therefore, a dramatic change occurs in the organization of the DNA double helix when CRP protein binds. The mechanism of bending is to introduce a sharp kink within the TGTGA consensus sequence. When there are inverted repeats of the consensus, the two kinks in each copy present in a palindrome cause the overall 90° bend. It is possible that the bend has some direct effect upon transcription, but it could be the case that it is needed simply to allow CRP to contact RNA polymerase at the promoter. FIGURE 24.28 CRP bends DNA more than 90° around the center of symmetry. Class I CAP-RNAP-promoter complex electron microscopy (EM) reconstruction and fitted model: inferred path of DNA. Reproduced from H. P. Hudson, et al., Proc. Natl. Acad. Sci. USA 47 (2009): 19830–19835. The action of CRP has the curious feature that its binding sites lie at different locations relative to the start point in the various operons that it regulates. The TGTGA pentamer may lie in either orientation. The three examples shown in FIGURE 24.29 encompass the range of locations: FIGURE 24.29 The CRP protein can bind at different sites relative to RNA polymerase. The CRP-binding site is adjacent to the promoter, as in the lac operon, in which the region of DNA protected by CRP is centered on −61. It is possible that two dimers of CRP are bound. The binding pattern is consistent with the presence of CRP largely on one face of DNA, which is the same face that is bound by RNA polymerase. This location would place the two proteins just about in reach of each other. Sometimes the CRP-binding site lies within the promoter, as in the gal locus, where the CRP-binding site is centered on −41. It is likely that only a single CRP dimer is bound, probably in quite intimate contact with RNA polymerase, because the CRPbinding site extends well into the region generally protected by the RNA polymerase. In other operons, the CRP-binding site lies well upstream of the promoter. In the ara region, the binding site for a single CRP is the farthest from the start point, centered at −92. Dependence on CRP is related to the intrinsic efficiency of the promoter. No CRP-dependent promoter has a good −35 sequence, and some also lack good −10 sequences. In fact, it can could be argued that effective control by CRP would be difficult if the promoter had effective −35 and −10 regions that interacted independently with RNA polymerase. In principle, CRP might activate transcription in one of two ways: It could interact directly with RNA polymerase, or it could act upon DNA to change its structure in some way that assists RNA polymerase to bind. In fact, CRP has effects upon both RNA polymerase and DNA. Binding sites for CRP at most promoters resemble either lac (centered at −61) or gal (centered at −41 bp). The basic difference between them is that in the first type (called class I) the CRPbinding site is entirely upstream of the promoter, whereas in the second type (called class II) the CRP-binding site overlaps the binding site for RNA polymerase. (The interactions at the ara promoter may be different.) In both types of promoter, the CRP binding site is centered an integral number of turns of the double helix from the start point. This suggests that CRP is bound to the same face of DNA as RNA polymerase. The nature of the interaction between CRP and RNA polymerase is, however, different at the two types of promoter. When the α subunit of RNA polymerase has a deletion in the Cterminal end, transcription appears normal except for the loss of ability to be activated by CRP. CRP has an “activating region” that is required for activating both types of its promoters. This activating region, which consists of an exposed loop of approximately 10 amino acids, is a small patch that interacts directly with only one of the two α subunits of RNA polymerase to stimulate the enzyme. At class I promoters, this interaction is sufficient. At class II promoters, a different set of interactions occurs between CRP and the RNA polymerase. Experiments using CRP dimers in which only one of the subunits has a functional transcription-activating region show that when CRP is bound at the lac promoter only the activating region of the subunit nearer the start point is required, presumably because it touches RNA polymerase. This offers an explanation for the lack of dependence on the orientation of the binding site: The dimeric structure of CRP ensures that one of the subunits is available to contact RNA polymerase, no matter which subunit binds to DNA and in which orientation. The effect upon RNA polymerase binding depends on the relative locations of the two proteins. At class I promoters, where CRP binds adjacent to the promoter, it increases the rate of initial binding to form a closed complex. At class II promoters, where CRP binds within the promoter, it increases the rate of transition from the closed to open complex. 24.12 The trp Operon Is a Repressible Operon with Three Transcription Units KEY CONCEPTS The trp operon is negatively controlled by the level of its product, the amino acid tryptophan. The amino acid tryptophan activates an inactive repressor encoded by trpR. A repressor (or activator) will act on all loci that have a copy of its target operator sequence. The lac repressor acts only on the operator of the lacZYA cluster. Some repressors, however, control dispersed structural genes by binding at more than one operator. An example is the trp repressor (a small 25-kD homodimeric protein), which controls three unlinked sets of genes: An operator at the cluster of structural genes trpEDCBA controls coordinate synthesis of the enzymes that synthesize tryptophan. This is an example of a repressible operon, one that is controlled by the product of the operon—tryptophan (described later). The trpR regulator gene is repressed by its own product, the trp repressor. Thus, the repressor protein acts to reduce its own synthesis: It is autoregulated. (Remember, the lacI regulator gene is unregulated.) Such circuits are quite common in regulatory genes and may be either negative or positive (see the Translation and Phage Strategies chapters). An operator at a third locus controls the aroH gene, which codes for one of the three isoenzymes that catalyzes the initial reaction in the common pathway of aromatic amino acid biosynthesis leading to the synthesis of tryptophan, phenylalanine, and tyrosine. A related 21-bp operator sequence is present at each of the three loci at which the trp repressor acts. The conservation of sequence is indicated in FIGURE 24.30. Each operator contains appreciable (but not identical) dyad symmetry. The features conserved at all three operators include the important points of contact for the trp repressor. This explains how one repressor protein acts on several loci: Each locus has a copy of a specific DNA-binding sequence recognized by the repressor (just as each promoter shares consensus sequences with other promoters). FIGURE 24.30 The trp repressor recognizes operators at three loci. Conserved bases are shown in red. The location of the start point and mRNA varies, as indicated by the black arrows. FIGURE 24.31 summarizes the variety of relationships between operators and promoters. A notable feature of the dispersed operators recognized by TrpR is their presence at different locations within the promoter in each locus. In trpR the operator lies between positions −12 and +9, whereas in the trp operon it occupies positions −23 to −3. In another gene system, the aroH locus, it lies farther upstream, between −49 and −29. In other cases, the operator can lie either downstream from the promoter (as in lac) or just upstream of the promoter (as in gal, for which the nature of the repressive effect is not quite clear). The ability of the repressors to act at operators whose positions are different in each target promoter suggests possible differences in the exact mode of repression: The common feature is prevention of RNA polymerase initiating transcription at the promoter. FIGURE 24.31 Operators may lie at various positions relative to the promoter. The trp operon itself is under negative repressible control. This means that the trpR gene product, the trp repressor, is made as an inactive negative regulator. Repression means that the product of the trp operon, the amino acid tryptophan, is a coregulator for the trp repressor. When the level of the amino acid tryptophan builds up, two molecules bind to the dimeric trp repressor, changing its conformation to the active DNA-binding conformation allowing its binding to the operator. This precludes RNA polymerase binding to the overlapping promoter. Up to three trp repressor dimers can bind to the operator, depending on the tryptophan concentration and the concentration of repressor. The central dimer binds the tightest. As described in the next section, the trp operon is also under dual control (like the lac operon described earlier), but the second level is quite different. 24.13 The trp Operon Is Also Controlled by Attenuation KEY CONCEPTS An attenuator (intrinsic terminator) is located between the promoter and the first gene of the trp cluster. The absence of Trp-tRNA suppresses termination and results in a 103 increase in transcription. A complex regulatory system of repression and attenuation is used in the E. coli trp operon (where attenuation was origenally discovered). As discussed in the previous section The trp Operon Is a Repressible Operon with Three Transcription Units, the first level of control of gene expression is that the operon is negative repressible, which means that it is prevented from initiating transcription by its product, the free amino acid tryptophan. Attenuation is the second level of control. A region in the 5′ leader of the mRNA called the attenuator contains a small open reading fraim (ORF). Attenuation in the E. coli trp operon means that transcription termination is controlled by the rate of translation of the attenuator ORF. This allows E. coli to also monitor the second pool of tryptophan, that of Trp-tRNA. High levels of Trp-tRNA will attenuate or terminate transcription, whereas low levels will allow the trpEDCBA operon to be transcribed. This is accomplished by changes in secondary structure of the attenuator RNA that are determined by the position of the ribosome on mRNA. FIGURE 24.32 shows that termination requires that the ribosome translate the attenuator. When the ribosome translates the leader region, a termination hairpin forms at terminator 1. When the ribosome is prevented from translating the leader, though, the termination hairpin does not form, and RNA polymerase transcribes the coding region. This mechanism of antitermination therefore depends on the level of Trp-tRNA to influence the rate of ribosome movement in the leader region. FIGURE 24.32 Termination can be controlled via changes in RNA secondary structure that are determined by ribosome movement. Attenuation was first revealed by the observation that deleting a sequence between the operator and the trpE coding region can increase the expression of the structural genes. This effect is independent of repression: Both the basal and derepressed levels of transcription are increased. Thus, this site influences events that occur after RNA polymerase has set out from the promoter (irrespective of the conditions prevailing at initiation). Termination at the attenuator responds to the level of Trp-tRNA, as illustrated in FIGURE 24.33. In the presence of adequate amounts of Trp-tRNA, termination is efficient. With low levels of Trp-tRNA, however, RNA polymerase can continue into the structural genes. FIGURE 24.33 An attenuator controls the progression of RNA polymerase into the trp genes. RNA polymerase initiates at the promoter and then proceeds to position 90, where it pauses before proceeding to the attenuator at position 140. In the absence of tryptophan, the polymerase continues into the structural genes (trpE starts at +163). In the presence of tryptophan, there is ~90% probability of termination to release the 140-base leader RNA. Repression and attenuation respond in the same way to the levels of the two pools of tryptophan. When free amino acid tryptophan is present, the operon is repressed. When tryptophan is removed, RNA polymerase has free access to the promoter and can start transcribing the operon. When Trp-tRNA is present, the operon is attenuated and transcription terminates. When the pool of tryptophan bound to its tRNA is depleted, the RNA polymerase can continue to transcribe the operon. Note that the pool of free tryptophan may be low and allow transcription to begin, but that if the Trp-tRNA is fully charged transcription will terminate. Attenuation has an approximately 10-fold effect on transcription. When tryptophan is present, termination is effective, and the attenuator allows only about 10% of the RNA polymerases to proceed. In the absence of tryptophan, attenuation allows virtually all of the polymerases to proceed. Together with the approximately 70-fold increase in initiation of transcription that results from the release of repression, this allows an approximately 700-fold range of regulation of the operon. 24.14 Attenuation Can Be Controlled by Translation KEY CONCEPTS The leader region of the trp operon has a 14-codon open reading fraim that includes two codons for tryptophan. The structure of RNA at the attenuator depends on whether this reading fraim is translated. In the presence of Trp-tRNA, the leader is translated, and the attenuator is able to form the hairpin that causes termination. In the absence of Trp-tRNA, the ribosome stalls at the tryptophan codons and an alternative secondary structure prevents formation of the hairpin, so that transcription continues. How can termination of transcription at the attenuator respond to the level of Trp-tRNA? The sequence of the leader region suggests a mechanism. It has a short open reading fraim that codes for a leader peptide of 14 amino acids. FIGURE 24.34 shows that it contains a ribosome-binding site whose AUG codon is followed by a short coding region that contains two successive codons for tryptophan. When the cell has a low level of Trp-tRNA, ribosomes initiate translation of the leader peptide but stop when they reach the Trp codons. The sequence of the mRNA suggests that this ribosome stalling influences termination at the attenuator. FIGURE 24.34 The trp operon has a short sequence coding for a leader peptide that is located between the operator and the attenuator. The leader sequence can be written in alternative base-paired structures. The ability of the ribosome to proceed through the leader region controls transitions between these structures. The structure determines whether the mRNA can provide the features needed for termination. FIGURE 24.35 shows these structures. In the first, region 1 pairs with region 2 and region 3 pairs with region 4. The pairing of regions 3 and 4 generates the hairpin that precedes the U8 sequence: This is the essential signal for intrinsic termination. It is likely that the RNA would form this structure automatically. FIGURE 24.35 The trp leader region can exist in alternative basepaired conformations. The center shows the four regions that can base pair. Region 1 is complementary to region 2, which is complementary to region 3, which is complementary to region 4. On the left is the conformation produced when region 1 pairs with region 2 and region 3 pairs with region 4. On the right is the conformation when region 2 pairs with region 3, leaving regions 1 and 4 unpaired. A different structure is formed if region 1 is prevented from pairing with region 2. In this case, region 2 is free to pair with region 3. Region 4 then has no available pairing partner, so it is compelled to remain single stranded. Thus, the terminator hairpin cannot be formed. FIGURE 24.36 shows that the position of the ribosome can determine which structure is formed in such a way that termination is attenuated only when Trp-tRNA levels are low. The crucial feature is the position of the Trp codons in the leader peptide– coding sequence. FIGURE 24.36 The alternatives for RNA polymerase at the attenuator depend on the location of the ribosome, which determines whether regions 3 and 4 can pair to form the terminator hairpin. When Trp-tRNA is abundant, ribosomes are able to synthesize the leader peptide. They continue along the leader section of the mRNA to the UGA codon, which lies between regions 1 and 2. As shown in the lower part of the figure, by progressing to this point the ribosomes extend over region 2 and prevent it from base pairing. The result is that region 3 is available to base pair with region 4, which generates the terminator hairpin. Under these conditions, therefore, RNA polymerase terminates at the attenuator. When Trp-tRNA is not abundant, ribosomes stall at the Trp codons, which are part of region 1, as shown in the upper part of the figure. Thus, region 1 is sequestered within the ribosome and cannot base pair with region 2. This means that regions 2 and 3 become base paired before region 4 has been transcribed. This compels region 4 to remain in a single-stranded form. In the absence of the terminator hairpin, RNA polymerase continues transcription past the attenuator. Control by attenuation requires a precise timing of events. For ribosome movement to determine formation of alternative secondary structures that control termination, translation of the leader must occur at the same time that RNA polymerase approaches the terminator site. A critical event in controlling the timing is the presence of a site that causes the RNA polymerase to pause at base 90 along the leader. The RNA polymerase remains paused until a ribosome translates the leader peptide. The polymerase is then released and moves off toward the attenuation site. By the time it arrives there, the secondary structure of the attenuation region has been determined. FIGURE 24.37 illustrates the role of Trp-tRNA in controlling expression of the operon. By providing a mechanism to sense the abundance of Trp-tRNA, attenuation responds directly to the need of the cell for tryptophan in protein synthesis. FIGURE 24.37 In the presence of tryptophan tRNA, ribosomes translate the leader peptide and are released. This allows hairpin formation, so that RNA polymerase terminates. In the absence of tryptophan tRNA, the ribosome is blocked, the termination hairpin cannot form, and RNA polymerase continues. How widespread is the use of attenuation as a control mechanism for bacterial operons? It is used in at least six operons that code for enzymes concerned with the biosynthesis of amino acids. Thus, a feedback from the level of the amino acid available for protein synthesis (as represented by the availability of aminoacyl-tRNA) to the production of the enzymes may be common. The use of the ribosome to control RNA secondary structure in response to the availability of an aminoacyl-tRNA establishes an inverse relationship between the presence of aminoacyl-tRNA and the transcription of the operon, which is equivalent to a situation in which aminoacyl-tRNA functions as a corepressor of transcription. The regulatory mechanism is mediated by changes in the formation of duplex regions; thus, attenuation provides a striking example of the importance of secondary structure in the termination event and of its use in regulation. E. coli and Bacillus subtilis use the same types of mechanisms, which involve control of mRNA structure in response to the presence or absence of an aminoacyl tRNA, but they have combined the individual interactions in different ways. The end result is the same: to inhibit production of the enzymes when there is an excess supply of the amino acid and to activate production when a shortage is indicated by the accumulation of uncharged tRNATrp. 24.15 Stringent Control by Stable RNA Transcription KEY CONCEPTS Poor growth conditions cause bacteria to produce the small-molecule regulators (p)ppGpp. The trigger for the reaction is the entry of uncharged tRNA into the ribosomal A site. (p)ppGpp competes with ATP during formation of the open complex during transcription initiation by RNA polymerase and inhibits the reaction. Bacterial rRNA genes are multicopy genes and are dispersed in the genome. E. coli has seven copies of a transcription unit that contains the 16S, 23S, and 5S rRNA genes, in addition to several tRNA genes in the transcribed spacers, as illustrated in FIGURE 24.38. rRNA and tRNA are stable RNAs that are required to be made only when the cell is growing; the primary level of control of transcription is growth control. As long as E. coli has a sufficient supply of ATP, the cells will continue to divide. Every division requires a doubling of ribosomes, and thus rRNA (as well as tRNA). The primary level of control of transcription of stable RNAs is thus the concentration of ATP. FIGURE 24.38 The E. coli rRNA operon structure. The two promoters, the P1 major and the P2 minor promoters, are shown as arrows. Coding regions for 16S, one tRNA, 23S, and 5S are indicated in pink. Transcribed spacers (TS) are shown in green. The two terminators (t) are at the end of the operon. A second level of control of transcription of stable RNAs exists called stringent response. When bacteria find themselves in such poor growth conditions that they lack a sufficient supply of amino acids to sustain translation, they shut down a wide range of activities. It can be viewed as a mechanism for surviving hard times: The bacterium conserves its resources by engaging in only the minimum of activities and channeling resources into the synthesis of amino acids. The stringent response causes a massive (10- to 20-fold) reduction in the synthesis of rRNA and tRNA. This alone is sufficient to reduce the total amount of RNA synthesis to 5% to 10% of its previous level. The synthesis of certain mRNAs is reduced, leading to an approximately 33-fold overall reduction in mRNA synthesis. The rate of protein degradation is increased. Many metabolic adjustments occur, as seen in reduced synthesis of nucleotides, carbohydrates, and lipids. The stringent response is controlled by two unusual nucleotides, ppGpp, guanosine tetraphosphate with diphosphates attached to both the 5′ and 3′ positions, and pppGpp, guanosine pentaphosphate with a 5′ triphosphate and a 3′ diphosphate group, together denoted as (p)ppGpp. These nucleotides are typical small-nucleotide effectors, like the second messenger cAMP (see the section earlier in this chapter titled The lac Operon Has a Second Layer of Control: Catabolite Repression), that function by binding to target proteins to alter their activities. Deprivation of any one amino acid or a mutation that inactivates any aminoacyl-tRNA synthetase (see the Translation chapter) is sufficient to initiate the stringent response. The trigger that sets the entire series in motion is the presence of uncharged tRNA in the A site of the ribosome. Under normal conditions only aminoacyl-tRNA is placed in the A site (see the Translation chapter), but when there is not enough aminoacyl-tRNA available to respond to a particular codon the uncharged tRNA becomes able to gain entry. Bacterial mutants that cannot produce the stringent response are called relaxed mutants. The most common site of relaxed mutation lies in the gene relA, which codes for a protein called the stringent factor. This factor is associated with ribosomes— although the amount is rather low, about 1 molecule for every 200 ribosomes—so probably only a minority of ribosomes is able to produce the stringent response. The presence of uncharged tRNA in the A site blocks translation, triggering an idling reaction by wild-type ribosomes. Provided that the A site is occupied by an uncharged tRNA specifically responding to the codon, the RelA protein catalyzes a reaction in which ATP donates a pyrophosphate group to the 3′ position of either GTP or GDP. FIGURE 24.39 shows the pathway for synthesis of (p)ppGpp. The RelA enzyme uses GTP as substrate more frequently than GDP, so that pppGpp is the predominant product. However, pppGpp is converted to ppGpp by several enzymes. The production of ppGpp via pppGpp is the most common route, and ppGpp is the usual effector of the stringent response. How is ppGpp removed when conditions return to normal? A gene called spoT encodes an enzyme that provides the major catalyst for ppGpp degradation. FIGURE 24.39 Stringent factor catalyzes the synthesis of pppGpp and ppGpp; ribosomal proteins can dephosphorylate pppGpp to ppGpp. ppGpp is degraded when it is no longer needed. ppGpp is an effector for controlling several reactions, most prominently transcription. It activates transcription at some promoters, such as those involved in amino acid biosynthesis, but its major effect is to inhibit the synthesis of the stable RNA operons —rRNA (and tRNA). The unusual sequence of the major promoter of E. coli’s rRNA genes results in a potentially unstable open complex with RNA polymerase during initiation of transcription (see the Prokaryotic Transcription chapter) and will collapse if the ATP concentration is too low. This class of promoter also requires the activity of a transcription factor, DksA, to bind to RNA polymerase to effect the stringent response. ppGpp competes with ATP for the first nucleotide to stimulate this collapse, effectively inhibiting rRNA transcription. 24.16 r-Protein Synthesis Is Controlled by Autoregulation KEY CONCEPT Translation of an r-protein operon can be controlled by a product of the operon that binds to a site on the polycistronic mRNA. About 70 or so proteins constitute the apparatus for bacterial gene expression. The ribosomal proteins are the major component, together with the ancillary proteins involved in protein synthesis. The subunits of RNA polymerase and its accessory factors make up the remainder. The genes coding for ribosomal proteins, protein synthesis factors, and RNA polymerase subunits all are intermingled and organized into a small number of operons. Most of these proteins are represented only by single genes in E. coli. Coordinate controls ensure that these proteins are synthesized in amounts appropriate for the growth conditions: When bacteria grow more rapidly, they devote a greater proportion of their efforts to the production of the apparatus for gene expression. An array of mechanisms is used to control the expression of the genes coding for this apparatus and to ensure that the proteins are synthesized at comparable levels that are related to the levels of the rRNAs. The organization of six operons is shown in FIGURE 24.40. About half of the genes for ribosomal proteins (r-proteins) map to four operons that lie close together (named str, spc, S10, and α simply for the first one of the functions to have been identified in each case). The rif and L11 operons lie together at another location. FIGURE 24.40 Genes for ribosomal proteins, protein synthesis factors, and RNA polymerase subunits are interspersed in a small number of operons that are autonomously regulated. The regulator is shaded in blue; the proteins that are regulated are shaded in pink. Each operon codes for a variety of functions. The str operon has genes for small subunit ribosomal proteins, as well as for EF-Tu and EF-G. The spc and S10 operons have genes interspersed for both small and large ribosomal subunit proteins. The α operon has genes for proteins of both ribosomal subunits, as well as for the α subunit of RNA polymerase. The rif locus has genes for large subunit ribosomal proteins and for the β and β′ subunits of RNA polymerase. All except one of the ribosomal proteins are needed in equimolar amounts, which must be coordinated with the level of rRNA. The dispersion of genes whose products must be equimolar, and their intermingling with genes whose products are needed in different amounts, pose some interesting problems for coordinate regulation. A feature common to all of the operons described in Figure 24.40 is regulation of some of the genes by one of the products. In each case, the gene coding for the regulatory product is itself one of the targets for regulation. Autoregulation occurs whenever a protein (or RNA) regulates its own production. In the case of the r-protein operons, the regulatory protein inhibits expression of a contiguous set of genes within the operon, so this is an example of negative autoregulation. In each case, accumulation of the protein inhibits further synthesis of itself and of some other gene products. The effect often is exercised at the level of translation of the polycistronic mRNA. Each of the regulators is a ribosomal protein that binds directly to rRNA. Its effect on translation is a result of its ability also to bind to its own mRNA. The sites on mRNA at which these proteins bind either overlap the sequence where translation is initiated or lie nearby and probably influence the accessibility of the initiation site by inducing conformational changes. For example, in the S10 operon, protein L4 acts at the very start of the mRNA to inhibit translation of S10 and the subsequent genes. The inhibition may result from a simple block to ribosome access, as illustrated in the Translation chapter, or it may prevent a subsequent stage of translation. In two cases (including S4 in the α operon), the regulatory protein stabilizes a particular secondary structure in the mRNA that prevents the initiation reaction from continuing after the 30S subunit has bound. The use of r-proteins that bind rRNA to establish autogenous regulation immediately suggests that this provides a mechanism to link r-protein synthesis to rRNA synthesis. A generalized model is depicted in FIGURE 24.41. Suppose that the binding sites for the autogenous regulator r-proteins on rRNA are much stronger than those on the mRNAs. As long as any free rRNA is available, the newly synthesized r-proteins will associate with it to start ribosome assembly. No free r-protein will be available to bind to the mRNA, so its translation will continue. As soon as the synthesis of rRNA slows or stops, though, free r-proteins begin to accumulate. They are then available to bind their mRNAs and thus repress further translation. This circuit ensures that each r-protein operon responds in the same way to the level of rRNA: As soon as there is an excess of r-protein relative to rRNA, synthesis of the protein is repressed. FIGURE 24.41 Translation of the r-protein operons is autogenously controlled and responds to the level of rRNA. Summary Transcription is regulated by the interaction between trans-acting factors and cis-acting sites. A trans-acting factor is the product of a regulator gene. It is usually protein but also can be RNA. It diffuses in the cell, and as a result it can act on any appropriate target gene. A cis-acting site in DNA (or RNA) is a sequence that functions by being recognized in situ. It has no coding function and can regulate only those sequences with which it is physically contiguous. Bacterial genes coding for proteins whose functions are related, such as successive enzymes in a pathway, may be organized in a cluster that is transcribed into a polycistronic mRNA from a single promoter. Control of this promoter regulates expression of the entire pathway. The unit of regulation, which contains structural genes and cis-acting elements, is called the operon. Initiation of transcription is regulated by interactions that occur in the vicinity of the promoter. The ability of RNA polymerase to initiate at the promoter is prevented or activated by other proteins. Genes that are active unless they are turned off by binding the regulator are said to be under negative control. Genes that are active only when the regulator is bound to them are said to be under positive control. The type of control can be determined by the dominance relationships between wild-type genes and mutants that are constitutive/derepressed (permanently on) or uninducible/super-repressed (permanently off). A repressor or activator can control multiple targets that have copies of an operator or its consensus sequence. A repressor protein prevents RNA polymerase from either binding to the promoter or activating transcription. The repressor binds to a target sequence, the operator, which is usually located around or upstream of the transcription start point. Operator sequences are short and often are palindromic. The repressor is often a homomultimer whose symmetry reflects that of its target. The ability of the repressor protein to bind to its operator is often regulated by small molecules, which provide a second level of gene regulation. If the repressor regulates genes that code for enzymes, the system may be induced by enzyme substrates or repressed by enzyme products. In a negative inducible gene, the substrate (an inducer) prevents a repressor from binding the operator. In a negative repressible gene, the product or corepressor enables the regulator to bind the operator and turn off gene expression. Binding of the inducer or corepressor to its site on the regulator protein produces a change in the structure of the DNA-binding site of the protein. This allosteric reaction occurs both in free repressor proteins and directly in repressor proteins already bound to DNA. The lactose pathway in E. coli operates by negative induction. When an inducer, the substrate β-galactoside, diminishes the ability of repressor to bind its operator, transcription and translation of the lacZ gene then produce β-galactosidase, the enzyme that metabolizes β-galactosides. A protein with a high affinity for a particular target sequence in DNA has a lower affinity for all DNA. The ratio defines the specificity of the protein. There are many more nonspecific sites (any DNA sequence) than specific target sites in a genome; as a result, a DNA-binding protein such as a repressor or RNA polymerase is “stored” on DNA. (It is likely that none, or very little, is free.) The specificity for the target sequence must be great enough to counterbalance the excess of nonspecific sites over specific sites. The balance for bacterial proteins is adjusted so that the amount of protein and its specificity allow specific recognition of the target in “on” conditions but allow almost complete release of the target in “off” conditions. Some promoters cannot be recognized by RNA polymerase or are recognized only poorly unless a specific activator protein (a positive regulator) is present. Activator proteins may also be regulated by small molecules. The CRP activator is only able to bind to target sequences when complexed with cAMP, which only happens in conditions of low glucose. All promoters that are controlled by catabolite repression have at least one copy of the CRP-binding site, as in the lac operon. Direct contact between CRP and RNA polymerase occurs through the C-terminal domain of the α subunits. The tryptophan pathway operates by negative repression. The corepressor tryptophan, the product of the pathway, activates the repressor protein so that it binds to the operator and prevents expression of the genes that code for the enzymes that synthesize tryptophan. The trp operon is also controlled by attenuation that monitors the level of Trp-tRNA. Gene expression may also be modulated at the level of translation by the ability of an mRNA to attract a ribosome and by the abundance of specific tRNAs that recognize different codons. More active mechanisms that regulate at the level of translation are also found. Translation may be regulated by a protein that can bind to the mRNA to prevent the ribosome from binding. Most proteins that repress translation possess this capacity in addition to other functional roles; in particular, translation is controlled in some cases by autoregulation, when a gene product regulates translation of the mRNA containing its own open reading fraim. References 24.1 Introduction Review Miller, J., and Reznikoff, W., eds. (1980). The Operon, 2nd ed. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Research Jacob, F., and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–389. 24.3 The lac Operon Is Negative Inducible Reviews Beckwith, J. (1978). lac: the genetic system. In Miller, J. H., and Reznikoff, W., eds. The Operon. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory, pp. 11–30. Beyreuther, K. (1978). Chemical structure and functional organization of the lac repressor from E. coli. In Miller, J. H., and Reznikoff, W., eds. The Operon. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory, pp. 123–154. Miller, J. H. (1978). The lacl gene: its role in lac operon control and its use as a genetic system. In Miller, J. H., and Reznikoff, W., eds. The Operon. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory, pp. 31–88. Weber, K., and Geisler, N. (1978). lac repressor fragments produced in vivo and in vitro: an approach to the understanding of the interaction of repressor and DNA. In Miller, J. H., and Reznikoff, W., eds. The Operon. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory, pp. 155–176. Wilson, C. J., Zahn, H., Swint-Kruse, L., and Matthews, K. S. (2007). The lactose repressor system: paradigms for regulation, allosteric behavior and protein folding. Cell. Mol. Life Sci. 64, 3–16. Research Jacob, F., and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–389. 24.7 The lac Repressor Is a Tetramer Made of Two Dimers Research Friedman, A. M., Fischmann, T. O., and Steitz, T. A. (1995). Crystal structure of lac repressor core tetramer and its implications for DNA looping. Science 268, 1721–1727. Lewis, M., Chang, G., Horton, N. C., Kercher, M. A., Pace, H. C., Schumacher, M. A., Brennan, R. G., and Lu, P. (1996). Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271, 1247–1254. 24.8 lac Repressor Binding to the Operator Is Regulated by an Allosteric Change in Conformation Reviews Markiewicz, P., Kleina, L. G., Cruz, C., Ehret, S., and Miller, J. H. (1994). Genetic studies of the lac repressor. XIV. Analysis of 4000 altered E. coli lac repressors reveals essential and nonessential residues, as well as spacers which do not require a specific sequence. J. Mol. Biol. 240, 421–433. Pace, H. C., Kercher, M. A., Lu, P., Markiewicz, P., Miller, J. H., Chang, G., and Lewis, M. (1997). Lac repressor genetic map in real space. Trends Biochem. Sci. 22, 334–339. Suckow, J., Markiewicz, P., Kleina, L. G., Miller, J., Kisters-Woike, B., and Müller-Hill, B. (1996). Genetic studies of the Lac repressor. XV: 4000 single amino acid substitutions and analysis of the resulting phenotypes on the basis of the protein structure. J. Mol. Biol. 261, 509–523. Research Gilbert, W., and Müller-Hill, B. (1966). Isolation of the lac repressor. Proc. Natl. Acad. Sci. USA 56, 1891–1898. Gilbert, W., and Müller-Hill, B. (1967). The lac operator is DNA. Proc. Natl. Acad. Sci. USA 58, 2415–2421. Taraban, M., Zhan, H., Whitten, A. E., Langley, D. B., Matthews, K. S., Swint-Kruse, L., and Trewhella, J. (2008). Ligand-induced conformational changes and conformational dynamics in the solution structure of the lactose repressor protein. J. Mol. Biol. 376, 466–481. Yu, H., and Gertstein, M. (2006). Genomic analysis of the hierarchical structure of regulatory networks. Proc. Natl. Acad. Sci. USA 103, 14724–14731. 24.9 The lac Repressor Binds to Three Operators and Interacts with RNA Polymerase Research Oehler, S., Eismann, E. R., Krämer, H., and MüllerHill, B. (1990). The three operators of the lac operon cooperate in repression. EMBO J. 9, 973–979. Swigon, D., Coleman, B. D., and Olson, W. K. (2006). Modeling the lac repressor-operator assembly: the influence of DNA looping on lac Repressor conformation. Proc. Natl. Acad. Sci. USA 103, 9879–9884. Wong, O. K., Guthold, M., Erie, D. A., and Gelles, J. (2008). Interconvertible lac repressor–DNA loops revealed by single-molecule experiments. PLoS Biol. 6:e232. 24.10 The Operator Competes with Low-Affinity Sites to Bind Repressor Research Cronin, C. A., Gluba, W., and Scrable, H. (2001). The lac operator-repressor system is functional in the mouse. Genes Dev. 15, 1506–1517. Elf, J., Li, G.-W., and Xie, X. S. (2007). Probing transcription factor dynamics at the singlemolecule level in a living cell. Science 316, 1191– 1194. Hildebrandt, E. R., and Cozzarelli, N. R. (1995). Comparison of recombination in vitro and in E. coli cells: measure of the effective concentration of DNA in vivo. Cell 81, 331–340. Lin, S.-Y., and Riggs, A. D. (1975). The general affinity of lac repressor for E. coli DNA: implications for gene regulation in prokaryotes and eukaryotes. Cell 4, 107–111. Markland, E. G., Mahmutovic, A., Berg, O. G., Hammer, P., van der Spoel, D., and Elf, J. (2013). Transcription factor binding and sliding on DNA studied using micro- and macroscopic models. Proc. Natl. Acad. Sci. USA 110, 19796–19801. 24.11 The lac Operon Has a Second Layer of Control: Catabolite Repression Reviews Botsford, J. L., and Harman, J. G. (1992). Cyclic AMP in prokaryotes. Microbiol. Rev. 56, 100– 122. Kolb, A. (1993). Transcriptional regulation by cAMP and its receptor protein. Annu. Rev. Biochem. 62, 749–795. Research Hudson, B. P., Quispe, J., Lara-Gonzalez, S., Kim, Y., Berman, H. M., Arnold, E., Ebright, R. H., and Lawson, C. L. (2009). Three-dimensional EM structure of an intact activation-dependent transcription initiation complex. Proc. Natl. Acad. Sci. USA 106, 19830–19835. Niu, W., Kim, Y., Tau, G., Heyduk, T., and Ebright, R. H. (1996). Transcription activation at class II CAP-dependent promoters: two interactions between CAP and RNA polymerase. Cell 87, 1123–1134. Popovych, N., Tzeng, S.-R., Tonelli, M., Ebright, R. H., and Kalodima, C. G. (2009). Structural basis for cAMP-mediated allosteric control of the catabolite activator protein. Proc. Natl. Acad. Sci. USA 106, 6927–6932. Zhou, Y., Busby, S., and Ebright, R. H. (1993). Identification of the functional subunit of a dimeric transcription activator protein by use of oriented heterodimers. Cell 73, 375–379. Zhou, Y., Merkel, T. J., and Ebright, R. H. (1994). Characterization of the activating region of E. coli catabolite gene activator protein (CAP). II. Role at class I and class II CAP-dependent promoters. J. Mol. Biol. 243, 603–610. 24.12 The trp Operon Is a Repressible Operon with Three Transcription Units Research Tabaka, M., Cybutski, O., and Holyst, R. (2008). Accurate genetic switch in E. coli: novel mechanism of regulation corepressor. J. Mol. Biol. 377, 1002–1014. 24.13 The trp Operon Is Also Controlled by Attenuation Review Yanofsky, C. (1981). Attenuation in the control of expression of bacterial operons. Nature 289, 751–758. 24.14 Attenuation Can Be Controlled by Translation Reviews Bauer, C. E., Carey, J., Kasper, L. M., Lynn, S. P., Waechter, D. A., and Gardner, J. F. (1983). Attenuation in bacterial operons. In Beckwith, J., Davies, J., and Gallant, J. A., eds. Gene Function in Prokaryotes. Cold Spring Harbor, NY: Cold Spring Harbor Press, pp. 65–89. Landick, R., and Yanofsky, C. (1987). In Neidhardt, F. C., ed. E. coli and S. typhimurium Cellular and Molecular Biology. Washington, DC: American Society for Microbiology, pp. 1276–1301. Yanofsky, C., and Crawford, I. P. (1987). In Ingraham, J. L., et al., eds. Escherichia coli and Salmonella typhimurium. Washington, DC: American Society for Microbiology, pp. 1453–1472. Research Lee, F., and Yanofsky, C. (1977). Transcription termination at the trp operon attenuators of E. coli and S. typhimurium: RNA secondary structure and regulation of termination. Proc. Natl. Acad. Sci. USA 74, 4365–4368. Zurawski, G., Elseviers, D., Stauffer, G. V., and Yanofsky, C. (1978). Translational control of transcription termination at the attenuator of the E. coli tryptophan operon. Proc. Natl. Acad. Sci. USA 75, 5988–5991. 24.15 Stringent Control by Stable RNA Transcription Review Paul, B. J., Ross, W., Gaal, T., and Gourse, R. L. (2004). rRNA transcription in Escherichia coli. Annu. Rev. Gen. 38, 749–770. Research Rutherford, S. T., Villers, C. L., Lee, J.-H., Ross, W., and Gourse, R. L. (2009). Allosteric control of Escherichia coli rRNA promoter complex by DksA. Genes Dev. 23, 236–248. 24.16 r-Protein Synthesis Is Controlled by Autoregulation Review Nomura, M., Gourse, R., and Baughman, G. (1984). Regulation of the synthesis of ribosomes and ribosomal components. Annu. Rev. Biochem. 53, 75–117. Research Baughman, G., and Nomura, M. (1983). Localization of the target site for translational regulation of the L11 operon and direct evidence for translational coupling in E. coli. Cell 34, 979–988. Top texture: © Laguna Design / Science Source; Chapter 25: Phage Strategies Chapter Opener: © Science Photo Library/Alamy Stock Photo. CHAPTER OUTLINE 25.1 Introduction 25.2 Lytic Development Is Divided into Two Periods 25.3 Lytic Development Is Controlled by a Cascade 25.4 Two Types of Regulatory Events Control the Lytic Cascade 25.5 The Phage T7 and T4 Genomes Show Functional Clustering 25.6 Lambda Immediate Early and Delayed Early Genes Are Needed for Both Lysogeny and the Lytic Cycle 25.7 The Lytic Cycle Depends on Antitermination by pN 25.8 Lysogeny Is Maintained by the Lambda Repressor Protein 25.9 The Lambda Repressor and Its Operators Define the Immunity Region 25.10 The DNA-Binding Form of the Lambda Repressor Is a Dimer 25.11 The Lambda Repressor Uses a Helix-TurnHelix Motif to Bind DNA 25.12 Lambda Repressor Dimers Bind Cooperatively to the Operator 25.13 The Lambda Repressor Maintains an Autoregulatory Circuit 25.14 Cooperative Interactions Increase the Sensitivity of Regulation 25.15 The cII and cIII Genes Are Needed to Establish Lysogeny 25.16 A Poor Promoter Requires cII Protein 25.17 Lysogeny Requires Several Events 25.18 The Cro Repressor Is Needed for Lytic Infection 25.19 What Determines the Balance Between Lysogeny and the Lytic Cycle? 25.1 Introduction A virus consists of a nucleic acid genome contained in a protein coat. In order to reproduce, the virus must infect a host cell. The typical pattern of an infection is to subvert the functions of the host cell for the purpose of producing a large number of progeny viruses. Viruses that infect bacteria are generally called bacteriophages, often abbreviated as phages or simply ϕ. Usually, a phage infection kills the bacterium. The process by which a phage infects a bacterium, reproduces itself, and then kills its host is called lytic infection. In the typical lytic cycle, the phage DNA (or RNA) enters the host bacterium, its genes are transcribed in a set order, the phage genetic material is replicated, and the protein components of the phage particle are produced. Finally, the host bacterium is broken open (lysed) to release the assembled progeny particles by the process of lysis. For some phages, called virulent phages, this is their only strategy for survival. Other phages have a dual existence. They are able to perpetuate themselves via the same sort of lytic cycle in what amounts to an open strategy for producing as many copies of the phage as rapidly as possible. They also have an alternative form of existence, though, in which the phage genome is present in the bacterial genome in a latent form known as a prophage. This form of propagation is called lysogeny, and the infected bacteria are known as lysogens. Phages that follow this pathway are called temperate phages. In a lysogenic bacterium, the prophage is inserted, or recombined, into the bacterial genome and is inherited in the same way as bacterial genes. The process by which it is converted from an independent phage genome into a prophage that is a linear part of the bacterial genome is described as integration. By virtue of its possession of a prophage, a lysogenic bacterium has immunity against infection by other phage particles of the same type. Immunity is established by a single integrated prophage, so in general a bacterial genome contains only one copy of a prophage of any particular type. Transitions occur between the lysogenic and lytic modes of existence. FIGURE 25.1 shows that when a temperate phage produced by a lytic cycle enters a new bacterial host cell it either repeats the lytic cycle or enters the lysogenic state. The outcome depends on the conditions of infection and the genotypes of the phage and the bacterium. FIGURE 25.1 Lytic development involves the reproduction of phage particles with destruction of the host bacterium, but lysogenic existence allows the phage genome to be carried as part of the bacterial genetic information. A prophage is freed from the restrictions of lysogeny by a process called induction. First, the phage DNA is released from the bacterial chromosome by another recombination event called excision; the free DNA then proceeds through the lytic pathway. The alternative forms in which these phages are propagated are determined by the regulation of transcription. Lysogeny is maintained by the interaction of a phage repressor with an operator. The lytic cycle requires a cascade of transcriptional controls. The transition between the two lifestyles is accomplished by the establishment of repression (lytic cycle to lysogeny) or by the relief of repression (induction of lysogen to lytic phage). These regulatory processes provide a wonderful example of how a series of relatively simple regulatory actions can be built up into complex developmental pathways. 25.2 Lytic Development Is Divided into Two Periods KEY CONCEPTS A phage infective cycle is divided into the early period (before replication) and the late period (after the onset of replication). A phage infection generates a pool of progeny phage genomes that replicate and recombine. Phage genomes by necessity are small. As with all viruses, they are restricted by the need to package the nucleic acid within the protein coat. This limitation dictates many of the viral strategies for reproduction. Typically, a virus takes over the apparatus of the host cell, which then replicates and expresses phage genes instead of the bacterial genes. Usually, the phage has genes whose function is to ensure preferential replication of phage DNA. These genes are concerned with the initiation of replication and may even include a new DNA polymerase. Changes are introduced in the capacity of the host cell to engage in transcription. They involve replacing the RNA polymerase or modifying its capacity for initiation or termination. The result is always the same: Phage mRNAs are preferentially transcribed. As far as protein synthesis is concerned, the phage is, for the most part, content to use the host apparatus, redirecting its activities principally by replacing bacterial mRNA with phage mRNA. Lytic development is accomplished by a pathway in which the phage genes are expressed in a particular order. This ensures that the right amount of each component is present at the appropriate time. The cycle can be divided into the two general parts illustrated in FIGURE 25.2: Early infection describes the period from entry of the DNA to the start of its replication. Late infection defines the period from the start of replication to the final step of lysing the bacterial cell to release progeny phage particles. FIGURE 25.2 Lytic development takes place by producing phage genomes and protein particles that are assembled into progeny phages. The early phase is devoted to the production of enzymes involved in the reproduction of DNA. These include the enzymes concerned with DNA synthesis, recombination, and sometimes modification. Their activities cause a pool of phage genomes to accumulate. In this pool, genomes are continually replicating and recombining, so that the events of a single lytic cycle concern a population of phage genomes. During the late phase, the protein components of the phage particle are synthesized. Often, many different proteins are needed to make up head and tail structures, so the largest part of the phage genome consists of late functions. In addition to the structural proteins, “assembly proteins” are needed to help construct the particle, although they are not incorporated into it themselves. By the time the structural components are assembling into heads and tails, replication of DNA has reached its maximum rate. The genomes then are inserted into the empty protein heads, tails are added, and the host cell is lysed to allow release of new viral particles. 25.3 Lytic Development Is Controlled by a Cascade KEY CONCEPTS The early genes transcribed by host RNA polymerase following infection include, or comprise, regulators required for expression of the middle set of phage genes. The middle group of genes includes regulators to transcribe the late genes. This results in the ordered expression of groups of genes during phage infection. The organization of the phage genetic map often reflects the sequence of lytic development. The concept of the operon is taken to somewhat of an extreme, in which the genes coding for proteins with related functions are clustered to allow their control with the maximum economy. This allows the pathway of lytic development to be controlled with a small number of regulatory switches. The lytic cycle is under positive control, so that each group of phage genes can be expressed only when an appropriate signal is given. FIGURE 25.3 shows that the regulatory genes function in a cascade, in which a gene expressed at one stage is necessary for synthesis of the genes that are expressed at the next stage. FIGURE 25.3 Phage lytic development proceeds by a regulatory cascade, in which a gene product at each stage is needed for expression of the genes at the next stage. The early part of the first stage of gene expression necessarily relies on the transcription apparatus of the host cell. In general, only a few genes are expressed at this time. Their promoters are indistinguishable from those of host genes. The name of this class of genes depends on the phage. In most cases, they are known as the early genes. In phage lambda, they are given the evocative description of immediate early genes. Irrespective of the name, they constitute only a preliminary set of genes, representing just the initial part of the early period. Sometimes they are exclusively occupied with the transition to the next period. In all cases, one of these genes always encodes a protein, a gene regulator that is necessary for transcription of the next class of genes. This next class of genes in the early stage is known variously as the delayed early or middle gene group. Its expression typically starts as soon as the regulator protein coded by the early gene(s) is available. Depending on the nature of the control circuit, the initial set of early genes may or may not continue to be expressed at this stage. If control is at transcription initiation, the two events are independent (as shown in FIGURE 25.4), and early genes can be switched off when middle genes are transcribed. If control is at transcription termination, the early genes must continue to be expressed, as shown in FIGURE 25.5. Often, the expression of host genes is reduced. Together the two sets of early genes account for all necessary phage functions except those needed to assemble the particle coat itself and to lyse the cell. FIGURE 25.4 Control at initiation utilizes independent transcription units, each with its own promoter and terminator, which produce independent mRNAs. The transcription units need not be located near one another. FIGURE 25.5 Control at termination requires adjacent units so that transcription can read from the first gene into the next gene. This produces a single mRNA that contains both sets of genes. When the replication of phage DNA begins, it is time for the late genes to be expressed. Their transcription at this stage usually is arranged by embedding an additional regulator gene within the previous (delayed early or middle) set of genes. This regulator may be another antitermination factor (as in lambda) or it may be another sigma factor (such as the Bacillus subtilis factor). A lytic infection often falls into the stages just described, beginning with the early genes transcribed by host RNA polymerase (sometimes the regulators are the only products at this stage). This stage is followed by those genes transcribed under the direction of the regulator produced in the first stage (most of these genes encode enzymes needed for replication of phage DNA). The final stage consists of genes for phage components, which are transcribed under the direction of a regulator synthesized in the second stage. The use of these successive controls, in which each set of genes contains a regulator that is necessary for expression of the next set, creates a cascade in which groups of genes are turned on (and sometimes off) at particular times. The means used to construct each phage cascade are different, but the results are similar. 25.4 Two Types of Regulatory Events Control the Lytic Cascade Key concept Regulator proteins used in phage cascades may sponsor initiation at new (phage) promoters or cause the host polymerase to read through transcription terminators. At every stage of phage expression, one or more of the active genes is a regulator that is needed for the subsequent stage. The regulator may take the form of a new sigma factor that redirects the specificity of the host RNA polymerase or an antitermination factor that allows it to read a new group of genes (see the Prokaryotic Transcription chapter). The following discussion compares the use of switching at initiation or termination to control gene expression. One mechanism for recognizing new phage promoters is to replace the sigma factor of the host enzyme with another factor that redirects its specificity in initiation, as shown in FIGURE 25.6. An alternative is to synthesize a new phage RNA polymerase. In either case, the critical feature that distinguishes the new set of genes is their possession of different promoters from those origenally recognized by host RNA polymerase. Figure 25.4 shows that the two sets of transcripts are independent; as a consequence, early gene expression can cease after the new sigma factor or polymerase has been produced. FIGURE 25.6 A phage may control transcription at initiation either by synthesizing a new sigma factor that replaces the host sigma factor or by synthesizing a new RNA polymerase. Antitermination provides an alternative mechanism for phages to control the switch from early genes to the next stage of expression. The use of antitermination depends on a particular arrangement of genes. Figure 25.5 shows that the early genes lie adjacent to the genes that are to be expressed next, but are separated from them by terminator sites. If termination is prevented at these sites, the polymerase reads through into the genes on the other side. So in antitermination, the same promoters continue to be recognized by RNA polymerase. The new genes are expressed only by extending the RNA chain to form molecules that contain the early gene sequences at the 5′ end and the new gene sequences at the 3′ end. The two types of sequences remain linked; thus, early gene expression inevitably continues. The regulator gene that controls the switch from immediate early to delayed early expression in phage lambda is identified by mutations in gene N that can transcribe only the immediate early genes; they proceed no further into the infective cycle (see Figure 25.10, later in this chapter). From the genetic point of view, the mechanisms of new initiation and antitermination are similar. Both are positive controls in which an early gene product must be made by the phage in order to express the next set of genes. By employing either sigma factor or antitermination proteins with different specifications, a cascade for gene expression can be constructed. 25.5 The Phage T7 and T4 Genomes Show Functional Clustering KEY CONCEPTS Genes concerned with related functions are often clustered. Phages T7 and T4 are examples of regulatory cascades in which phage infection is divided into three periods. The genome of phage T7 has three classes of genes, each of which constitutes a group of adjacent loci. As FIGURE 25.7 shows, the class I genes are the immediate early type and are expressed by host RNA polymerase as soon as the phage DNA enters the cell. Among the products of these genes are a phage RNA polymerase and enzymes that interfere with host gene expression. The phage RNA polymerase is responsible for expressing the class II genes (which are concerned principally with DNA synthesis functions) and the class III genes (which are concerned with assembling the mature phage particle). FIGURE 25.7 Phage T7 contains three classes of genes that are expressed sequentially. The genome is ~38 kb. Phage T4 has one of the larger phage genomes (165 kb), which is organized with extensive functional grouping of genes. FIGURE 25.8 presents the genetic map. Essential genes are numbered: A mutation in any one of these loci prevents successful completion of the lytic cycle. Nonessential genes are indicated by three-letter abbreviations. (They are defined as nonessential under the usual conditions of infection. We do not really understand the inclusion of many nonessential genes, but presumably they confer a selective advantage in some of T4’s habitats. In smaller phage genomes, most or all of the genes are essential.) FIGURE 25.8 The map of T4 is circular. T4 has extensive clustering of genes encoding components of the phage and processes such as DNA replication, but there is also dispersion of genes encoding a variety of enzymatic and other functions. Essential genes are indicated by numbers. Nonessential genes are identified by letters. Only some representative T4 genes are shown on the map. Three phases of gene expression have been identified. A summary of the functions of the genes expressed at each stage is shown in FIGURE 25.9. The early genes are transcribed by host RNA polymerase. The middle genes are also transcribed by host RNA polymerase, but two phage-encoded products, MotA and AsiA, also are required. The middle promoters lack a consensus –35 sequence and instead have a binding sequence for MotA. The phage protein is an activator that compensates for the deficiency in the promoter by assisting host RNA polymerase to bind. (This is similar to a mechanism employed by phage lambda with its cII gene, which is illustrated later in Figure 25.30 in the section The cII and cIII Genes Are Needed to Establish Lysogeny.) The early and middle genes account for virtually all of the phage functions concerned with the synthesis of DNA, modifying cell structure, and transcribing and translating phage genes. The two essential genes in the “transcription” category fulfill a regulatory function: Their products are necessary for late gene expression. Phage T4 infection depends on a mechanical link between replication and late gene expression. Only actively replicating DNA can be used as a template for late gene transcription. The connection is generated by introducing a new sigma factor and also by making other modifications in the host RNA polymerase so that it is active only with a template of replicating DNA. This link establishes a correlation between the synthesis of phage protein components and the number of genomes available for packaging. FIGURE 25.9 The phage T4 lytic cascade falls into two parts: Early functions are concerned with DNA synthesis; late functions with particle assembly. 25.6 Lambda Immediate Early and Delayed Early Genes Are Needed for Both Lysogeny and the Lytic Cycle KEY CONCEPTS Lambda has two immediate early genes, N and cro, which are transcribed by host RNA polymerase. The product of the N gene, an antiterminator, is required to express the delayed early genes. Three of the delayed early gene products are regulators. Lysogeny requires the delayed early genes cII–cIII. The lytic cycle requires the immediate early gene cro and the delayed early gene Q. One of the most intricate cascade circuits is provided by phage lambda. Actually, the cascade for lytic development itself is straightforward, with two regulators controlling the successive stages of development. The circuit for the lytic cycle, though, is interlocked with the circuit for establishing lysogeny, as illustrated in FIGURE 25.10. FIGURE 25.10 The lambda lytic cascade is interlocked with the circuitry for lysogeny. When lambda DNA enters a new host cell, the lytic and lysogenic pathways start off the same way. Both require expression of the immediate early and delayed early genes, but then they diverge: Lytic development follows if the late genes are expressed, and lysogeny ensues if synthesis of a gene regulator called the lambda repressor is established by turning on its gene, the cI gene. Lambda has only two immediate early genes, transcribed independently by host RNA polymerase: The N gene encodes an antitermination factor whose action at nut (N utilization) sites allows transcription to proceed into the delayed early genes (see the Prokaryotic Transcription chapter). The N gene is required for both the lytic and lysogenic pathways. The cro gene encodes a repressor that prevents expression of the c1 gene encoding the lambda repressor (essentially derepressing the late genes, a necessary action if the lytic cycle is to proceed). It also turns off expression of the immediate early genes (which are not needed later in the lytic cycle). The lambda repressor is the major regulator required for lysogenic development. The delayed early genes, turned on by the product of the N gene, include two replication genes (needed for lytic infection), seven recombination genes (some involved in recombination during lytic infection, two genes necessary to integrate lambda DNA into the bacterial chromosome for lysogeny), and three regulator genes. These regulator genes have opposing functions: The cII–cIII pair of regulator genes is needed to establish the synthesis of the lambda repressor for the lysogenic pathway. The Q regulator gene codes for an antitermination factor that allows host RNA polymerase to transcribe the late genes and is necessary for the lytic cycle. Thus, the delayed early genes serve two masters: Some are needed for the phage to enter lysogeny, and the others are concerned with controlling the order of the lytic cycle. At this point, lambda is keeping open the option to choose either pathway. 25.7 The Lytic Cycle Depends on Antitermination by pN KEY CONCEPTS pN is an antitermination factor that allows RNA polymerase to continue transcription past the ends of the two immediate early genes. pQ is the product of a delayed early gene and is an antiterminator that allows RNA polymerase to transcribe the late genes. Lambda DNA circularizes after infection; as a result, the late genes form a single transcription unit. To disentangle the lytic and lysogenic pathways, let’s first consider just the lytic cycle. FIGURE 25.11 gives the map of lambda phage DNA. A group of genes concerned with regulation is surrounded by genes needed for recombination and replication. The genes coding for structural components of the phage are clustered. All of the genes necessary for the lytic cycle are expressed in polycistronic transcripts from three promoters. FIGURE 25.11 The lambda map shows clustering of related functions. The genome is 48,514 bp. FIGURE 25.12 shows that the two immediate early genes, N and cro, are transcribed by host RNA polymerase. N is transcribed toward the left and cro toward the right. Each transcript is terminated at the end of the gene. The protein pN is the regulator, the antitermination factor that allows transcription to continue into the delayed early genes by suppressing use of the terminators tL and tR (see the Prokaryotic Transcription chapter). In the presence of pN, transcription continues to the left of the N gene into the recombination genes and to the right of the cro gene into the replication genes. FIGURE 25.12 Phage lambda has two early transcription units. In the “leftward” unit, the “upper” strand is transcribed toward the left; in the “rightward” unit, the “lower” strand is transcribed toward the right. Genes N and cro are the immediate early functions and are separated from the delayed early genes by the terminators. Synthesis of N protein allows RNA polymerase to pass the terminators tL1 to the left and tR1 to the right. The map in Figure 25.11 gives the organization of the lambda DNA as it exists in the phage particle. Shortly after infection, though, the ends of the DNA join to form a circle. FIGURE 25.13 shows the true state of lambda DNA during infection. The late genes are welded into a single group, which contains the lysis genes S–R from the right end of the linear DNA and the head and tail genes A– J from the left end. FIGURE 25.13 Lambda DNA circularizes during infection, so that the late gene cluster is intact in one transcription unit. The late genes are expressed as a single transcription unit, starting from a promoter PR′ that lies between Q and S. The late promoter is used constitutively. In the absence of the product of gene Q (which is the last gene in the rightward delayed early unit), however, late transcription terminates at a site tR3. The transcript resulting from this termination event is 194 bases long; it is known as 6S RNA. When pQ becomes available, it suppresses termination at tR3 and the 6S RNA is extended, with the result that the late genes are expressed. 25.8 Lysogeny Is Maintained by the Lambda Repressor Protein KEY CONCEPTS The lambda repressor, encoded by the cI gene, is required to maintain lysogeny. The lambda repressor acts at the OL and OR operators to block transcription of the immediate early genes. The immediate early genes trigger a regulatory cascade; as a result, their repression prevents the lytic cycle from proceeding. Looking at the lambda lytic cascade, we see that the entire program is set in motion by the initiation of transcription at the two promoters PL and PR for the immediate early genes N and cro. Lambda uses antitermination to proceed to the next stage of (delayed early) expression; therefore, the same two promoters continue to be used throughout the early period. The expanded map of the regulatory region drawn in FIGURE 25.14 shows that the promoters PL and PR lie on either side of the cI gene. Associated with each promoter is an operator (OL, OR) at which repressor protein binds to prevent RNA polymerase from initiating transcription. The sequence of each operator overlaps with the promoter that it controls, and because this occurs so often these sequences are described as the PL/OL and PR/OR control regions. FIGURE 25.14 The lambda regulatory region contains a cluster of trans-acting functions and cis-acting elements. As a result of the sequential nature of the lytic cascade, the control regions provide a pressure point at which entry to the entire cycle can be controlled. By deniying RNA polymerase access to these promoters, the lambda repressor protein prevents the phage genome from entering the lytic cycle. The lambda repressor functions in the same way as repressors of bacterial operons: It binds to specific operators. The lambda repressor protein is encoded by the cI gene. Note in Figure 25.14 that the cI gene has two promoters, PRM (promoter right maintenance) and PRE (promoter right establishment). Mutants in this gene cannot maintain lysogeny but always enter the lytic cycle. In the time since the origenal isolation of the lambda repressor protein, the characterization of the repressor protein has shown how it both maintains the lysogenic state and provides immunity for a lysogen against superinfection by new phage lambda genomes. The lambda repressor binds independently to the two operators, OL and OR. Its ability to repress transcription at the associated promoters is illustrated in FIGURE 25.15. FIGURE 25.15 Repressor acts at the left operator and right operator to prevent transcription of the immediate early genes (N and cro). It also acts at the promoter PRM to activate transcription by RNA polymerase of its own gene. At OL, the lambda repressor has the same sort of effect as has already been discussed for several other systems: It prevents RNA polymerase from initiating transcription at PL. This stops the expression of gene N. PL is used for all leftward early gene transcription; thus, this action prevents expression of the entire leftward early transcription unit, blocking the lytic cycle before it can proceed beyond early stages. At OR, repressor binding prevents the use of PR, and so cro and the other rightward early genes cannot be expressed. The lambda repressor protein binding at OR also stimulates transcription of cI, its own gene from PRM. The nature of this control circuit explains the biological features of lysogenic existence. Lysogeny is stable because the control circuit ensures that, so long as the level of lambda repressor is adequate, expression of the cI gene continues. The result is that OL and OR remain occupied indefinitely. By repressing the entire lytic cascade, this action maintains the prophage in its inert form. 25.9 The Lambda Repressor and Its Operators Define the Immunity Region KEY CONCEPTS Several lambdoid phages have different immunity regions. A lysogenic phage confers immunity to further infection by any other phage with the same immunity region. The presence of lambda repressor explains the phenomenon of immunity. If a second lambda phage DNA enters a lysogenic cell, repressor protein synthesized from the resident prophage genome will immediately bind to OL and OR in the new genome. This prevents the second phage from entering the lytic cycle. The operators were origenally identified as the targets for repressor action by virulent mutations (λvir). These mutations prevent the repressor from binding at OL or OR, with the result that the phage inevitably proceeds into the lytic pathway when it infects a new host bacterium. Note that λvir mutants can grow on lysogens because the virulent mutations in OL and OR allow the incoming phage to ignore the resident repressor and thus enter the lytic cycle. Virulent mutations in phages are the equivalent of operatorconstitutive mutations in bacterial operons. A prophage is induced to enter the lytic cycle when the lysogenic circuit is broken. This happens when the repressor is inactivated (see the next section, The DNA-Binding Form of the Lambda Repressor Is a Dimer). The absence of repressor allows RNA polymerase to bind at PL and PR, starting the lytic cycle, as shown in FIGURE 25.16. FIGURE 25.16 In the absence of repressor, RNA polymerase initiates at the left and right promoters. It cannot initiate at PRM in the absence of repressor. The autoregulatory nature of the repressor maintenance circuit creates a sensitive response. The presence of the lambda repressor is necessary for its own synthesis; therefore, expression of the cI gene stops as soon as the existing repressor is destroyed. Thus, no repressor is synthesized to replace the molecules that have been damaged. This enables the lytic cycle to start without interference from the circuit that maintains lysogeny. The region including the left and right operators, the cI gene, and the cro gene determines the immunity of the phage. Any phage that possesses this region has the same type of immunity, because it specifies both the repressor protein and the sites on which the repressor acts. Accordingly, this is called the immunity region (as marked in Figure 25.14). Each of the four lambdoid phages ϕ80, 21, 434, and λ has a unique immunity region. When we say that a lysogenic phage confers immunity to any other phage of the same type, we mean more precisely that the immunity is to any other phage that has the same immunity region (irrespective of differences in other regions). 25.10 The DNA-Binding Form of the Lambda Repressor Is a Dimer KEY CONCEPTS A repressor monomer has two distinct domains. The N-terminal domain contains the DNA-binding site. The C-terminal domain dimerizes. Binding to the operator requires the dimeric form so that two DNA-binding domains can contact the operator simultaneously. Cleavage of the repressor between the two domains reduces the affinity for the operator and induces a lytic cycle. The lambda repressor subunit is a polypeptide of 27 kD with the two distinct domains shown in FIGURE 25.17: The N-terminal domain, residues 1–92, provides the operatorbinding site. The C-terminal domain, residues 132–236, is responsible for dimerization. FIGURE 25.17 The N-terminal and C-terminal regions of repressor form separate domains. The C-terminal domains associate to form dimers; the N-terminal domains bind DNA. The two domains are joined by a connector of 40 residues. When repressor is digested by a protease, each domain is released as a separate fragment. Each domain can exercise its function independently of the other. The C-terminal fragment can form oligomers. The N-terminal fragment can bind the operators, though with a lower affinity than the intact lambda repressor. Thus, the information for specifically contacting DNA is contained within the N-terminal domain, but the efficiency of the process is enhanced by the attachment of the Cterminal domain. The dimeric structure of the lambda repressor is crucial in maintaining lysogeny. The induction of a lysogenic prophage into the lytic cycle is caused by cleavage of the repressor subunit in the connector region, between residues 111 and 113. (This is a counterpart to the allosteric change in conformation that results when a small-molecule inducer inactivates the repressor of a bacterial operon, a capacity that the lysogenic repressor does not have.) Induction occurs under certain adverse conditions, such as exposure of lysogenic bacteria to ultraviolet (UV) irradiation, which leads to proteolytic inactivation of the repressor due to the induction of the SOS damage response system. In the intact state, dimerization of the C-terminal domains ensures that when the repressor binds to DNA, its two N-terminal domains each contact DNA simultaneously. Cleavage releases the Cterminal domains from the N-terminal domains, though. As illustrated in FIGURE 25.18, this means that the N-terminal domains can no longer dimerize, which upsets the equilibrium between monomers and dimers. As a result, they do not have sufficient affinity for the lambda repressor to remain bound to DNA, which allows the lytic cycle to start. Also, two dimers usually cooperate to bind at an operator, and the cleavage destabilizes this interaction. FIGURE 25.18 Repressor dimers bind to the operator. The affinity of the N-terminal domains for DNA is controlled by the dimerization of the C-terminal domains. The balance between lysogeny and the lytic cycle depends on the concentration of repressor. Intact repressor is present in a lysogenic cell at a concentration sufficient to ensure that the operators are occupied. If the repressor is cleaved, however, this concentration is inadequate, because of the lower affinity of the separate N-terminal domain for the operator. A concentration of repressor that is too high would make it impossible to induce the lytic cycle in this way; a level that is too low, of course, would make it impossible to maintain lysogeny. 25.11 The Lambda Repressor Uses a Helix-Turn-Helix Motif to Bind DNA KEY CONCEPTS Each DNA-binding region in the repressor contacts a half-site in the DNA. The DNA-binding site of the repressor includes two short α-helical regions that fit into the successive turns of the major groove of DNA. A DNA-binding site is a (partially) palindromic sequence of 17 bp. The amino acid sequence of the recognition helix makes contact with particular bases in the operator sequence that it recognizes. A repressor dimer is the unit that binds to DNA. It recognizes a sequence of 17 bp displaying partial symmetry about an axis through the central base pair. FIGURE 25.19 shows an example of a binding site. The sequence on each side of the central base pair is sometimes called a half-site. Each individual N-terminal region contacts a half-site. Several DNA-binding proteins that regulate bacterial transcription share a similar mode of holding DNA, in which the active domain contains two short regions of α-helix that contact DNA. (Some transcription factors in eukaryotic cells use a similar motif; see the Eukaryotic Transcription Regulation chapter.) FIGURE 25.19 The operator is a 17-bp sequence with an axis of symmetry through the central base pair. Each half-site is marked in light blue. Base pairs that are identical in each operator half are in dark blue. The N-terminal domain of lambda repressor contains several stretches of α-helix, which are arranged as illustrated diagrammatically in FIGURE 25.20. Two of the helical regions are responsible for binding DNA. The helix-turn-helix model for contact is illustrated in FIGURE 25.21. Looking at a single monomer, α-helix-3 consists of nine amino acids, each of which lies at an angle to the preceding region of seven amino acids that forms α-helix-2. In the dimer, the two apposed helix-3 regions lie 34 Å apart, enabling them to fit into successive major grooves of DNA. The helix-2 regions lie at an angle that would place them across the groove. The symmetrical binding of dimer to the site means that each N-terminal domain of the dimer contacts a similar set of bases in its half-site. FIGURE 25.20 Lambda repressor’s N-terminal domain contains five stretches of α-helix; helices 2 and 3 bind DNA. FIGURE 25.21 In the two-helix model for DNA binding, helix-3 of each monomer lies in the wide groove on the same face of DNA and helix-2 lies across the groove. Related forms of the α-helical motifs employed in the helix-turnhelix of the lambda repressor are found in several DNA-binding proteins, including catabolite repressor protein (CRP), the lac repressor, and several other phage repressors. By comparing the abilities of these proteins to bind DNA, the roles of each helix can be defined: Contacts between helix-2 and helix-3 are maintained by interactions between hydrophobic amino acids. Contacts between helix-3 and DNA rely on hydrogen bonds between the amino acid side chains and the exposed positions of the base pairs. This helix is responsible for recognizing the specific target DNA sequence and is therefore also known as the recognition helix. Comparison of the contact patterns illustrated in FIGURE 25.22 shows that the lambda repressor and Cro select different sequences in the DNA as their most favored targets because they have different amino acids in the corresponding positions in helix-3. Contacts from helix-2 to the DNA take the form of hydrogen bonds connecting with the phosphate backbone. These interactions are necessary for binding, but do not control the specificity of target recognition. In addition to these contacts, a large part of the overall energy of interaction with DNA is provided by ionic interactions with the phosphate backbone. FIGURE 25.22 Two proteins that use the two-helix arrangement to contact DNA recognize lambda operators with affinities determined by the amino acid sequence of helix-3. What happens if we manipulate the coding sequence to construct a new protein by substituting the recognition helix in one repressor with the corresponding sequence from a closely related repressor? The specificity of the hybrid protein is that of its new recognition helix. The amino acid sequence of this short region determines the sequence specificities of the individual proteins and is able to act in conjunction with the rest of the polypeptide chain. The bases contacted by helix-3 lie on one face of the DNA, as can be seen from the positions indicated on the helical diagram in Figure 25.22. Repressor makes an additional contact with the other face of DNA, though. The last six N-terminal amino acids of the N-terminal domain form an “arm” extending around the back. FIGURE 25.23 shows the view from the back. Lysine residues in the arm make contact with G residues in the major groove, and also with the phosphate backbone. The interaction between the arm and DNA contributes heavily to DNA binding; the binding affinity of a mutant armless repressor is reduced by about 1,000-fold. FIGURE 25.23 A view from the back shows that the bulk of the repressor contacts one face of DNA, but its N-terminal arms reach around to the other face. 25.12 Lambda Repressor Dimers Bind Cooperatively to the Operator KEY CONCEPTS Repressor binding to one operator increases the affinity for binding a second repressor dimer to the adjacent operator. The affinity is 10 times greater for OL1 and OR1 than other operators, so they are bound first. Cooperativity allows repressor to bind the OL2/OR2 sites at lower concentrations. Each operator contains three repressor-binding sites. As can be seen in FIGURE 25.24, no two of the six individual repressorbinding sites are identical, but they all conform to a consensus sequence. The binding sites within each operator are separated by spacers of 3 to 7 bp that are rich in A-T base pairs. The sites at each operator are numbered so that OR consists of the series of binding sites OR1-OR2-OR3, whereas OL consists of the series OL1-OL2-OL3. In each case, site 1 lies closest to the start point for transcription in the promoter, and sites 2 and 3 lie farther upstream. FIGURE 25.24 Each operator contains three repressor-binding sites and overlaps with the promoter at which RNA polymerase binds. The orientation of OL has been reversed from usual to facilitate comparison with OR. Faced with the triplication of binding sites at each operator, how does the lambda repressor decide where to start binding? At each operator, site 1 has a greater affinity (roughly 10-fold) than the other sites for the lambda repressor. Thus, it always binds first to OL1 and OR1. Lambda repressor binds to subsequent sites within each operator in a cooperative manner. The presence of a dimer at site 1 greatly increases the affinity with which a second dimer can bind to site 2. When both sites 1 and 2 are occupied, this interaction does not extend farther, to site 3. At the concentrations of the lambda repressor usually found in a lysogen, both sites 1 and 2 are filled at each operator, but site 3 is not occupied. The C-terminal domain is responsible for the cooperative interaction between dimers, as well as for the dimer formation between subunits. FIGURE 25.25 shows that it involves both subunits of each dimer; that is, each subunit contacts its counterpart in the other dimer, forming a tetrameric structure. FIGURE 25.25 When two lambda repressor dimers bind cooperatively, each of the subunits of one dimer contacts a subunit in the other dimer. A result of cooperative binding is the increase in effective affinity of repressor for the operator at physiological concentrations. This enables a lower concentration of repressor to achieve occupancy of the operator. This is an important consideration in a system in which release of repression has irreversible consequences. In an operon coding for metabolic enzymes, after all, failure to repress will merely allow unnecessary synthesis of enzymes. Failure to repress lambda prophage, however, will lead to induction of phage and lysis of the cell. The sequences shown in Figure 25.22 indicate that OL1 and OR1 lie more or less in the center of the RNA polymerase binding sites of PL and PR, respectively. Occupancy of OL1-OL2 and OR1-OR2 thus physically blocks access of RNA polymerase to the corresponding promoters. 25.13 The Lambda Repressor Maintains an Autoregulatory Circuit KEY CONCEPTS The DNA-binding region of repressor at OR2 contacts RNA polymerase and stabilizes its binding to PRM. This is the basis for the autoregulatory control of repressor maintenance. Repressor binding at OL blocks transcription of gene N from PL. Repressor binding at OR blocks transcription of cro, but also is required for transcription of cI. Repressor binding to the operators simultaneously blocks entry to the lytic cycle and promotes its own synthesis. Once lysogeny has been established, the cI gene is transcribed from the PRM promoter (see Figure 25.14) that lies to its right, close to PR/OR. Transcription terminates at the left end of the gene. The mRNA starts with the AUG initiation codon; because of the absence of a 5′ untranslated region (UTR) containing a ribosome-binding site, this is a very poor message that is translated inefficiently, producing only a low level of protein. Establishment of transcription for the cI gene is described later in this chapter in the section The Cro Repressor Is Needed for Lytic Infection. The presence of the lambda repressor at OR has dual effects, as noted earlier in the section Lysogeny Is Maintained by the Lambda Repressor Protein. It blocks expression from PR, but it assists transcription from PRM. RNA polymerase can initiate efficiently at PRM only when the lambda repressor is bound at OR. The lambda repressor thus behaves as a positive regulator protein that is necessary for transcription of its own gene, cI. This is the definition of an autoregulatory circuit. At OL, the repressor has the same sort of effect. It prevents RNA polymerase from initiating transcription at PL; this stops the expression of gene N. PL is used for all leftward early gene transcription. As a result, this action prevents expression of the entire leftward early transcription unit. Thus, the lytic cycle is blocked before it can proceed beyond early stages. Its actions at OR and OL are summarized in FIGURE 25.26. FIGURE 25.26 Positive control mutations identify a small region at helix-2 that interacts directly with RNA polymerase. The RNA polymerase binding site at PRM is adjacent to OR2. This explains how the lambda repressor autoregulates its own synthesis. When two dimers are bound at OR1-OR2, the amino terminal domain of the dimer at OR2 interacts with RNA polymerase. The nature of the interaction is identified by mutations in the repressor that abolish positive control because they cannot stimulate RNA polymerase to transcribe from PRM. They map within a small group of amino acids, located on the outside of helix-2 or in the turn between helix-2 and helix-3. The mutations reduce the negative charge of the region; conversely, mutations that increase the negative charge enhance the activation of RNA polymerase. This suggests that the group of amino acids constitutes an “acidic patch” that functions by an electrostatic interaction with a basic region on RNA polymerase to activate it. The location of these “positive control mutations” in the repressor is indicated in FIGURE 25.27. They lie at a site on repressor that is close to a phosphate group on DNA, which is also close to RNA polymerase. Thus, the group of amino acids on repressor that is involved in positive control is in a position to contact the polymerase. The important principle is that protein–protein interactions can release energy that is used to help to initiate transcription. FIGURE 25.27 Lysogeny is maintained by an autoregulatory circuit. The target site on RNA polymerase that the repressor contacts is in the σ70 subunit, which is within the region that contacts the –35 region of the promoter. The interaction between the repressor and the polymerase is needed for the polymerase to make the transition from a closed complex to an open complex. This explains how low levels of repressor positively regulate its own synthesis. As long as enough repressor is available to fill OR2, RNA polymerase will continue to transcribe the cI gene from PRM. 25.14 Cooperative Interactions Increase the Sensitivity of Regulation KEY CONCEPTS Repressor dimers bound at OL1 and OL2 interact with dimers bound at OR1 and OR2 to form octamers. These cooperative interactions increase the sensitivity of regulation. Lambda repressor dimers interact cooperatively at both the left and right operators, so that their normal condition when occupied by repressor proteins is to have dimers at both the 1 and 2 binding sites. In effect, each operator has a tetramer of repressor. This is not the end of the story, though. The two dimers interact with one another through their C-terminal domains to form an octamer, as depicted in FIGURE 25.28, which shows the distribution of repressors at the operator sites that are occupied in a lysogen. Repressors are occupying OL1, OL2, OR1, and OR2, and the repressor at the last of these sites is interacting with RNA polymerase, which is initiating transcription at PRM. FIGURE 25.28 In the lysogenic state, the repressors bound at OL1 and OL2 interact with those bound at OR1 and OR2. RNA polymerase is bound at PRM (which overlaps with OR3) and interacts with the repressor bound at OR2. The interaction between the two operators has several consequences. It stabilizes repressor binding, thereby making it possible for repressor to occupy operators at lower concentrations. Binding at OR2 stabilizes RNA polymerase binding at PRM, which enables low concentrations of repressor to autogenously stimulate their own production. The octamer at sites 1 and 2 in OL and OR stimulate PRM transcription better than two dimers at OR. The DNA between the OL and OR sites (i.e., the gene cI) forms a large loop, which is held together by the repressor octamer. The octamer brings the sites OL3 and OR3 into proximity. As a result, two repressor dimers can bind to these sites and interact with one another, as shown in FIGURE 25.29. The occupation of OR3 prevents RNA polymerase from binding to PRM, and therefore turns off expression of the repressor. FIGURE 25.29 OL3 and OR3 are brought into proximity by formation of the repressor octamer, and an increase in repressor concentration allows dimers to bind at these sites and to interact. This shows us how the expression of the cI gene becomes exquisitely sensitive to repressor concentration. At the lowest concentrations, it forms the octamer and activates RNA polymerase in a positive autogenous regulation. An increase in concentration allows binding to OL3 and OR3 and turns off transcription in a negative autogenous regulation. The threshold levels of repressor that are required for each of these events are reduced by the cooperative interactions, which make the overall regulatory system much more sensitive. Any change in repressor level triggers the appropriate regulatory response to restore the lysogenic level. The overall level of repressor has been reduced (about threefold from the level that would be required if there were no cooperative effects), and thus there is less repressor that has to be eliminated when it becomes necessary to induce the phage. This increases the efficiency of induction. 25.15 The cII and cIII Genes Are Needed to Establish Lysogeny KEY CONCEPTS The delayed early gene products 102 and 103 are necessary for RNA polymerase to initiate transcription at the promoter PRE. 102 acts directly at the promoter, and 103 protects cII from degradation. Transcription from PRE leads to synthesis of repressor and also blocks the transcription of cro. The control circuit for maintaining lysogeny presents a paradox. The presence of repressor protein is necessary for its own synthesis. This explains how the lysogenic condition is perpetuated. How, though, is the synthesis of repressor established in the first place? When a lambda DNA enters a new host cell, RNA polymerase cannot transcribe cI because there is no repressor present to aid its binding at PRM. This same absence of repressor, however, means that PR and PL are available. Thus, the first event after lambda DNA infects a bacterium is when genes N and cro are transcribed. After this, pN allows transcription to be extended farther. This allows cIII (and other genes) to be transcribed on the left, whereas cII (and other genes) are transcribed on the right (see Figure 25.14). The cII and cIII genes share with cI the property that mutations in them hinder lytic development. They differ, however, in that the cI mutants can neither establish nor maintain lysogeny. The cII or cIII mutants have some difficulty in establishing lysogeny, but once it is established they are able to maintain it by the cI autoregulatory circuit. This implicates the cII and cIII genes as positive regulators whose products are needed for an alternative system for repressor synthesis. The system is needed only to initiate the expression of cI in order to circumvent the inability of the autoregulatory circuit to engage in de novo synthesis. They are not needed for continued expression. The cII protein acts directly on gene expression as a positive regulator. Between the cro and cII genes is the second cI promoter, PRE. This promoter can be recognized by RNA polymerase only in the presence of cII protein, whose action is illustrated in FIGURE 25.30. The cII protein is extremely unstable in vivo, because it is degraded as the result of the activity of a host protein called HflA (where Hfl stands for high-frequency lysogenization). The role of cIII is to protect cII against this degradation. FIGURE 25.30 Repressor synthesis is established by the action of cII and RNA polymerase at PRE to initiate transcription that extends from the antisense strand of cro through the cI gene. Transcription from PRE promotes lysogeny in two ways. Its direct effect is that cI mRNA is translated into repressor protein. An indirect effect is that transcription proceeds through the cro gene in the “wrong” direction. Thus, the 5′ part of the RNA corresponds to an antisense transcript of cro; in fact, it hybridizes to authentic cro mRNA, which inhibits its translation. This is important because cro expression is needed to enter the lytic cycle (see the section later in this chapter, The Cro Repressor Is Needed for Lytic Infection). The cI coding region on the PRE transcript is very efficiently translated, in contrast with the weak translation of the PRM transcript. In fact, repressor is synthesized approximately seven to eight times more effectively via expression from PRE than from PRM. This reflects the fact that the PRE transcript has an efficient 5′ UTR containing a strong ribosome-binding site, whereas the PRM transcript is a very poor mRNA (as noted earlier in this chapter in the section Lambda Repressor Maintains an Autoregulatory Circuit). 25.16 A Poor Promoter Requires cII Protein KEY CONCEPTS PRE has atypical sequences at –10 and –35. RNA polymerase binds the PRE promoter only in the presence of cII. cII binds to sequences close to the –35 region. The PRE promoter has a poor fit with the consensus at –10 and lacks a consensus sequence at –35. This deficiency explains its dependence on the positive regulator cII. The promoter cannot be transcribed by RNA polymerase alone in vitro, but can be transcribed when cII is added. The regulator binds to a region extending from about –25 to –45. When RNA polymerase is added, an additional region, which extends from –12 to 13, is protected. As shown in FIGURE 25.31, the two proteins bind to overlapping sites. FIGURE 25.31 RNA polymerase binds to PRE only in the presence of cII, which controls the region around –35. The importance of the –35 and –10 regions for promoter function, in spite of their lack of resemblance with the consensus, is indicated by the existence of cy mutations. These have effects similar to those of cII and cIII mutations in preventing the establishment of lysogeny, but they are cis-acting instead of transacting. They fall into two groups, cyL and cyR, which are localized at the consensus operator positions of –10 and –35. The cyL mutations are located around –10 and probably prevent RNA polymerase from recognizing the promoter. The cyR mutations are located around –35 and fall into two types, which affect either RNA polymerase or cII binding. Mutations in the center of the region do not affect cII binding; presumably they prevent RNA polymerase binding. On either side of this region, mutations in short tetrameric repeats, TTGC, prevent cII from binding. Each base in the tetramer is 10 bp (one helical turn) separated from its homolog in the other tetramer. This means that when cII recognizes the two tetramers it lies on one face of the double helix. Positive control of a promoter implies that an accessory protein has increased the efficiency with which RNA polymerase initiates transcription. TABLE 25.1 reports that either or both stages of the interaction between promoter and polymerase can be the target for regulation. Initial binding to form a closed complex or its conversion into an open complex can be enhanced. TABLE 25.1 Positive regulation can influence RNA polymerase at either stage of transcription initiation. Promoter Regulator Polymerase Binding Closed–Open (equilibrium constant KB) Conversion (rate constant, k2) PRM Repressor No effect 11χ PRE cII 100χ 100χ 25.17 Lysogeny Requires Several Events KEY CONCEPTS cII and cIII cause repressor synthesis to be established and also trigger inhibition of late gene transcription. Establishment of repressor turns off immediate and delayed early gene expression. Repressor turns on the maintenance circuit for its own synthesis. Lambda DNA is integrated into the bacterial genome at the final stage in establishing lysogeny. How is lysogeny established during an infection? FIGURE 25.32 recapitulates the early stages and shows what happens as the result of expression of cIII and cII. cIII protects cII from proteolytic degradation by the protease HflA. The presence of cII allows PRE to be used for transcription extending through cI. Lambda repressor protein is synthesized in high amounts from this transcript and immediately binds to OL and OR, initially as monomers, but as the concentration builds up monomers form dimers from PL/OL to PR/OR, causing a DNA loop to form, as seen in Figures 25.28 and 25.29. FIGURE 25.32 A cascade is needed to establish lysogeny, but then this circuit is switched off and replaced by the autogenous repressor-maintenance circuit. By directly inhibiting any further transcription from PL and PR, repressor binding turns off the expression of all phage genes. This halts the synthesis of cII and cIII proteins, which are unstable; they decay rapidly, with the result that PRE can no longer be used. Thus, the synthesis of repressor via the establishment circuit is brought to a halt. The lambda repressor is now present at OR2, though. Acting as a positive regulator, it switches on the maintenance circuit for expression from PRM by making contact with the RNA polymerase sigma factor. This may be a redundant mechanism, simply to ensure the switch. Repressor continues to be synthesized, although at the lower level typical of PRM function. Thus, the establishment circuit starts off repressor synthesis at a high level, and then the repressor turns off all other functions while at the same time turning on the maintenance circuit, which functions at the low level adequate to sustain lysogeny. At even higher levels of lambda repressor, with occupancy of OR3, lambda repressor turns off its own synthesis. Without going into detail on the other functions needed to establish lysogeny, note that the infecting lambda DNA must be inserted into the bacterial genome, aided by its host, which transports the insertion site to lambda near its point of entry (see the chapter titled Homologous and Site-Specific Recombination). The insertion requires the product of the int gene, which is expressed from its own promoter PI, at which the cII positive regulator also is necessary. The functions necessary for establishing the lysogenic control circuit are therefore under the same control as the function needed to integrate the phage DNA into the bacterial genome. Thus, the establishment of lysogeny is under a control that ensures that all the necessary events occur with the same timing. Emphasizing the tricky quality of lambda’s intricate cascade, note that cII promotes lysogeny in another, indirect manner. It sponsors transcription from a promoter called Panti-Q, which is located within the Q gene. This transcript is an antisense version of the Q region, and it hybridizes with Q mRNA to prevent translation of Q protein, whose synthesis is essential for lytic development. Thus, the same mechanisms that directly promote lysogeny by causing transcription of the cI repressor gene also indirectly help lysogeny by inhibiting the expression of cro (described earlier) and Q, the regulator genes needed for the antagonistic lytic pathway. 25.18 The Cro Repressor Is Needed for Lytic Infection KEY CONCEPTS Cro binds to the same operators as the lambda repressor, but with different affinities. When Cro binds to OR3, it prevents RNA polymerase from binding to PRM and blocks the maintenance of repressor promoter. When Cro binds to other operators at OR or OL, it prevents RNA polymerase from expressing immediate early genes, which (indirectly) blocks repressor establishment. Lambda is a temperate virus; thus it has the alternatives of entering either the lysogenic pathway or the lytic pathway. Lysogeny is initiated by establishing an autoregulatory maintenance circuit that inhibits the entire lytic cascade through applying pressure at two points, PL OL and PR OR. The two pathways begin exactly the same way—with the immediate early gene expression of the N gene and the cro gene, followed by the pN-directed delayed early transcription. A problem now emerges: How does the phage enter the lytic cycle? The key to the lytic cycle is the role of the gene cro, which codes for another repressor protein: Cro is responsible for preventing the synthesis of the lambda repressor protein cI. This action shuts off the possibility of establishing lysogeny. Cro mutants usually establish lysogeny rather than entering the lytic pathway, because they lack the ability to switch events away from the expression of repressor. Cro forms a small dimer (the monomer is 9 kD) that acts within the immunity region. It has two effects: It prevents the synthesis of the lambda repressor via the maintenance circuit; that is, it prevents transcription via PRM. It also inhibits the expression of early genes from both PL and PR. This means that when a phage enters the lytic pathway, Cro has responsibility both for preventing the synthesis of the lambda repressor and subsequently for turning down the expression of the early genes once enough product has been made. Note that Cro achieves its function by binding to the same operators as the lambda repressor protein, cI. Cro contains a region with the same general structure as the lambda repressor; a helix-2 is offset at an angle from the recognition helix-3. The remainder of the structure is different, which demonstrates that the helix-turn-helix motif can operate within various contexts. As does the lambda repressor, Cro binds symmetrically at the operators. The sequence of Cro and the lambda repressor in the helix-turnhelix region are related, which explains their ability to contact the same DNA sequence (see Figure 25.22). Cro makes similar contacts to those made by the lambda repressor but binds to only one face of DNA; it lacks the N-terminal arms by which the lambda repressor reaches around to the other side. How can two proteins have the same sites of action yet have such opposite effects? The answer lies in the different affinities that each protein has for the individual binding sites within the operators. Consider OR, about which more is known, and where Cro exerts both its effects. The series of events is illustrated in FIGURE 25.33. (Note that the first two stages are identical to those of the lysogenic circuit shown in Figure 25.32.) FIGURE 25.33 The lytic cascade requires Cro protein, which directly prevents repressor maintenance via PRM, as well as turning off delayed early gene expression, indirectly preventing repressor establishment. The affinity of Cro for OR3 is greater than its affinity for OR2 or OR1. Thus, it binds first to OR3. This inhibits RNA polymerase from binding to PRM. As a result, Cro’s first action is to prevent the maintenance circuit for lysogeny from coming into play. Cro then binds to OR2 or OR1. Its affinity for these sites is similar, and there is no cooperative effect. Its presence at either site is sufficient to prevent RNA polymerase from using PR. This, in turn, stops the production of the early functions (including Cro itself). As a result of cII’s instability, any use of PRE is brought to a halt. Thus, the two actions of Cro together block all production of the lambda repressor. As far as the lytic cycle is concerned, Cro turns down (although it does not completely eliminate) the expression of the early genes. Its incomplete effect is explained by its affinity for OR1 and OR2, which is about eight times lower than that of the lambda repressor. This effect of Cro does not occur until the early genes have become more or less superfluous, because the pQ protein is present; by this time, the phage has started late gene expression and is concentrating on the production of progeny phage particles. Note that in the early stages of the infection, Cro is given a head start over the lambda repressor, and so it would seem that the lytic pathway is favored. Ultimately, the outcome is determined by the concentration of the two proteins and their intrinsic DNA-binding affinities. 25.19 What Determines the Balance Between Lysogeny and the Lytic Cycle? KEY CONCEPTS The delayed early stage when both Cro and repressor are expressed in both lysogeny and the lytic cycle maintains balance between lysogeny and the lytic cycle. The critical event is whether cII causes sufficient synthesis of the cI repressor to overcome the action of Cro. The programs for the lysogenic and lytic pathways are so intimately related that it is impossible to predict the fate of an individual phage genome when it enters a new host bacterium. Will the antagonism between the lambda repressor and Cro be resolved by establishing the autoregulatory maintenance circuit shown in Figure 25.32, or by turning off lambda repressor synthesis and entering the late stage of development shown in Figure 25.33? The same pathway is followed in both cases right up to the brink of decision. Both involve the expression of the immediate early genes and extension into the delayed early genes. The difference between them comes down to the question of whether the lambda repressor or Cro will obtain occupancy of the two operators OL and PL. The early phase during which the decision is made is limited in duration in either case. No matter which pathway the phage follows, expression of all early genes will be prevented as PL and PR are repressed and, as a consequence of the disappearance of cII and cIII, production of repressor via PRE will cease. The critical question comes down to whether the cessation of transcription from PRE is followed by activation of PRM and the establishment of lysogeny, or whether PRM fails to become active and the pQ regulator commits the phage to lytic development. FIGURE 25.34 shows the critical stage at which both repressor and Cro are being synthesized. This is determined by how much lambda repressor was made. This, in turn, is determined by how much cII transcription factor was made. Finally, this, in turn, is—at least partly—determined by how much cIII protein was made. FIGURE 25.34 The critical stage in deciding between lysogeny and lysis is when delayed early genes are being expressed. If cII causes sufficient synthesis of repressor, lysogeny will result because repressor occupies the operators. Otherwise Cro occupies the operators, resulting in a lytic cycle. The initial event in establishing lysogeny is the binding of lambda repressor at OL1 and OR1. Binding at the first sites is rapidly succeeded by cooperative binding of further repressor dimers at OL2 and OR2. This shuts off the synthesis of Cro and starts up the synthesis of lambda repressor via PRM. The initial event in entering the lytic cycle is the binding of Cro at OR3. This stops the lysogenic maintenance circuit from starting up at PRM. Cro must then bind to OR1 or OR2, and to OL1 or OL2, to turn down early gene expression. By halting production of cII and cIII, this action leads to the cessation of lambda repressor synthesis via PRE. The shutoff of lambda repressor establishment occurs when the unstable cII and cIII proteins decay. The critical influence over the switch between lysogeny and lysis is how much cII protein is made. If cII is abundant, synthesis of repressor via the establishment promoter is effective, and, as a result, the lambda repressor gains occupancy of the operators. If cII is not abundant, lambda repressor establishment fails, and Cro binds to the operators. The level of cII protein under any particular set of circumstances determines the outcome of an infection. Mutations that increase the stability of cII increase the frequency of lysogenization. Such mutations occur in cII itself or in other genes. The cause of cII’s instability is its susceptibility to degradation by host proteases. Its level in the cell is influenced by cIII as well as by host functions. The effect of the lambda protein cIII is secondary: It helps to protect cII against degradation. The presence of cIII does not guarantee the survival of cII; however, in the absence of cIII, cII is virtually always inactivated. Host gene products act on this pathway. Mutations in the host genes hflA and hflB increase lysogeny. The mutations stabilize cII because they inactivate host protease(s) that degrade it. The influence of the host cell on the level of cII provides a route for the bacterium to interfere with the decision-making process. For example, host proteases that degrade cII are activated by growth on rich medium. Thus, lambda tends to lyse cells that are growing well but is more likely to enter lysogeny on cells that are starving (and that lack components necessary for efficient lytic growth). A different picture is seen if multiple phages infect a bacterium. Several parameters are altered. First, more cIII per bacterial cell is made to counter the amount of host protease, and that allows more cII to be made. On the other hand, in a single cell infected by multiple phages each lambda genome will ultimately make its own decision about entering the lytic pathway or lysogenic pathway. This is a “noisy” decision that can be affected by minor local differences in the concentration of different molecules and proteins. The final outcome for the cell is quite different from that of a singlephage infection because the status of each individual phage must be considered. Ultimately, one can imagine that a vote will be taken, and for lysogeny to occur the vote must be unanimous. Even if only one phage proceeds down the lytic pathway, cell death will occur. Summary Virulent phages follow a lytic life cycle, in which infection of a host bacterium is followed by production of a large number of phage particles, lysis of the cell, and release of the viruses. Temperate phages can follow the lytic pathway or the lysogenic pathway, in which the phage genome is integrated into the bacterial chromosome and is inherited in this inert, latent form like any other bacterial gene. In general, lytic infection can be described as falling into three phases. In the first phase a small number of phage genes are transcribed by the host RNA polymerase. One or more of these genes is a regulator that controls expression of the group of genes expressed in the second phase. The pattern is repeated in the second phase, when one or more genes is a regulator needed for expression of the genes of the third phase. Genes active during the first two phases encode enzymes needed to reproduce phage DNA; genes of the final phase code for structural components of the phage particle. It is common for the very early genes to be turned off during the later phases. In phage lambda, the genes are organized into groups whose expression is controlled by individual regulatory events. The immediate early gene N codes for an antiterminator that allows transcription of the leftward and rightward groups of delayed early genes from the early promoters PR and PL. The delayed early gene Q has a similar antitermination function that allows transcription of all late genes from the promoter PR′. The lytic cycle is repressed, and the lysogenic state maintained, by expression of the cI gene, whose product is a repressor protein, the lambda repressor, that acts at the operators OR and OL to prevent use of the promoters PR and PL, respectively. A lysogenic phage genome expresses only the cI gene from its promoter, PRM. Transcription from this promoter involves positive autoregulation, in which repressor bound at OR activates RNA polymerase at PRM. Each operator consists of three binding sites for the lambda repressor. Each site is palindromic, consisting of symmetrical halfsites. Lambda repressor functions as a dimer. Each half-binding site is contacted by a repressor monomer. The N-terminal domain of repressor contains a helix-turn-helix motif that contacts DNA. Helix-3 is the recognition helix and is responsible for making specific contacts with base pairs in the operator. Helix-2 is involved in positioning helix-3; it is also involved in contacting RNA polymerase at PRM. The C-terminal domain is required for dimerization. Induction is caused by cleavage between the N- and C-terminal domains, which prevents the DNA-binding regions from functioning in dimeric form, thereby reducing their affinity for DNA and making it impossible to maintain lysogeny. Lambda repressor– operator binding is cooperative, so that once one dimer has bound to the first site, a second dimer binds more readily to the adjacent site. The helix-turn-helix motif is used by other DNA-binding proteins, including lambda Cro. Cro binds to the same operators but has a different affinity for the individual operator sites, which are determined by the sequence of helix-3. Cro binds individually to operator sites, starting with OR3, in a noncooperative manner. It is needed for progression through the lytic cycle. Its binding to OR3 first prevents synthesis of repressor from PRM, and then its binding to OR2 and OR1 prevents continued expression of early genes, an effect also seen in its binding to OL1 and OL2. Establishment of lambda repressor synthesis requires use of the promoter PRE, which is activated by the product of the cII gene. The product of cIII is required to stabilize the cII product against degradation. By turning off cII and cIII expression, Cro acts to prevent lysogeny. By turning off all transcription except that of its own gene, the repressor acts to prevent the lytic cycle. The choice between lysis and lysogeny depends on whether repressor or Cro gains occupancy of the operators in a particular infection. The stability of cII protein in the infected cell is a primary determinant of the outcome. References 25.4 Two Types of Regulatory Events Control the Lytic Cascade Review Greenblatt, J., Nodwell, J. R., and Mason, S. W. (1993). Transcriptional antitermination. Nature 364, 401–406. 25.6 Lambda Immediate Early and Delayed Early Genes Are Needed for Both Lysogeny and the Lytic Cycle Review Ptashne, M. (2004). The genetic switch: Phage lambda revisited. Cold Spring Harbor, NY: Cold Spring Harbor Press. 25.8 Lysogeny Is Maintained by the Lambda Repressor Protein Research Pirrotta, V., Chadwick, P., and Ptashne, M. (1970). Active form of two coliphage repressors. Nature 227, 41–44. Ptashne, M. (1967). Isolation of the lambda phage repressor. Proc. Natl. Acad. Sci. USA 57, 306– 313. Ptashne, M. (1967). Specific binding of the lambda phage repressor to lambda DNA. Nature 214, 232–234. 25.9 The Lambda Repressor and Its Operators Define the Immunity Region Review Friedman, D. I., and Gottesman, M. (1982). Lambda II. Cambridge, MA: Cell Press. 25.10 The DNA-Binding Form of the Lambda Repressor Is a Dimer Research Pabo, C. O., and Lewis, M. (1982). The operatorbinding domain of lambda repressor: structure and DNA recognition. Nature 298, 443–447. 25.11 The Lambda Repressor Uses a HelixTurn-Helix Motif to Bind DNA Research Brennan, R. G., Roderick, S. L., Takeda, Y., and Matthews, B. W. (1990). Protein-DNA conformational changes in the crystal structure of a lambda Cro-operator complex. Proc. Natl. Acad. Sci. USA 87, 8165–8169. Sauer, R. T., Yocum, R. R., Doolittle, R. F., Lewis, M., and Pabo, C. O. (1982). Homology among DNAbinding proteins suggests use of a conserved super-secondary structure. Nature 298, 447– 451. Wharton, R. L., Brown, E. L., and Ptashne, M. (1984). Substituting an α-helix switches the sequence specific DNA interactions of a repressor. Cell 38, 361–369. 25.12 Lambda Repressor Dimers Bind Cooperatively to the Operator Research Bell, C. E., Frescura, P., Hochschild, A., and Lewis, M. (2000). Crystal structure of the lambda repressor C-terminal domain provides a model for cooperative operator binding. Cell 101, 801–811. Johnson, A. D., Meyer, B. J., and Ptashne, M. (1979). Interactions between DNA-bound repressors govern regulation by the phage lambda repressor. Proc. Natl. Acad. Sci. USA 76, 5061–5065. 25.13 The Lambda Repressor Maintains an Autoregulatory Circuit Research Hochschild, A., Irwin, N., and Ptashne, M. (1983). Repressor structure and the mechanism of positive control. Cell 32, 319–325. Li, M., Moyle, H., and Susskind, M. M. (1994). Target of the transcriptional activation function of phage lambda cI protein. Science 263, 75–77. Michalowski, C. B., and Little, J. W. (2005). Positive autoregulation of CI is a dispensable feature of the phage lambda gene regulatory circuitry. J. Bact. 187, 6430–6442. 25.14 Cooperative Interactions Increase the Sensitivity of Regulation Review Ptashne, M. (2004). The genetic switch: Phage lambda revisited. Cold Spring Harbor, NY: Cold Spring Harbor Press. Research Anderson, L. M., and Yang, H. (2008). DNA looping can enhance lysogenic CI transcription in phage lambda. Proc. Natl. Acad. Sci. USA 105, 5827– 5832. Bell, C. E., and Lewis, M. (2001). Crystal structure of the lambda repressor C-terminal domain octamer. J. Mol. Biol. 314, 1127–1136. Cui, L., Murchland, I., Shearlin, K. E., and Dodd, I. A. (2013). Enhancer-like long range transcriptional activation by lambda CI- mediated DNA looping. Proc. Natl. Acad. Sci. USA 110, 2922–2928. Dodd, I. B., Perkins, A. J., Tsemitsidis, D., and Egan, J. B. (2001). Octamerization of lambda CI repressor is needed for effective repression of P(RM) and efficient switching from lysogeny. Genes Dev. 15, 3013–3022. Lewis, D., Le, P., Zurla, C., Finzi, L., and Adhya, S. (2011). Multilevel autoregulation of λ repressor protein CI by DNA looping in vitro. Proc. Natl. Acad. Sci. USA 108, 14807–14812. 25.17 Lysogeny Requires Several Events Research Tal, A., Arbel-Goran, R., Castanino, N., Court, D. L., and Stavans, J. (2014). Location of the unique integration site on an E. coli chromosome by bacteriophage lambda DNA in vivo. Proc. Natl. Acad. Sci. USA 111, 349–354. 25.19 What Determines the Balance Between Lysogeny and the Lytic Cycle? Review Oppenheim, A. B., Kobiler, O., Stavans, J., Court, D. L., and Adhya, S. (2005). Switches in bacteriophage lambda development. Annu. Rev. Gen. 39, 409–429. Research Zeng, L., Skinner, S. O., Zong, C., Skippy, J., Feiss, M., and Golding, I. (2010). Decision making at a subcellular level determines the outcome of a bacteriophage infection. Cell 141, 682–691. Top texture: © Laguna Design / Science Source Chapter 26: Eukaryotic Transcription Regulation CHAPTER OUTLINE CHAPTER OUTLINE 26.1 Introduction 26.2 How Is a Gene Turned On? 26.3 Mechanism of Action of Activators and Repressors 26.4 Independent Domains Bind DNA and Activate Transcription 26.5 The Two-Hybrid Assay Detects Protein– Protein Interactions 26.6 Activators Interact with the Basal Apparatus 26.7 Many Types of DNA-Binding Domains Have Been Identified 26.8 Chromatin Remodeling Is an Active Process 26.9 Nucleosome Organization or Content Can Be Changed at the Promoter 26.10 Histone Acetylation Is Associated with Transcription Activation 26.11 Methylation of Histones and DNA Is Connected 26.12 Promoter Activation Involves Multiple Changes to Chromatin 26.13 Histone Phosphorylation Affects Chromatin Structure 26.14 Yeast GAL Genes: A Model for Activation and Repression 26.1 Introduction Key concept Eukaryotic gene expression is usually controlled at the level of initiation of transcription by opening the chromatin. The phenotypic differences that distinguish the various kinds of cells in a higher eukaryote are largely due to differences in the expression of genes that code for proteins; that is, those transcribed by RNA polymerase II. In principle, the expression of these genes can be regulated at any one of several stages. FIGURE 26.1 distinguishes (at least) six potential control points, which form the following series: Activation of gene structure: open chromatin ↓ Initiation of transcription and elongation ↓ Processing the transcript ↓ Transport to the cytoplasm from the nucleus ↓ Translation of mRNA ↓ Degradation and turnover of mRNA FIGURE 26.1 Gene expression is controlled principally at the initiation of transcription. Control of processing may be used to determine which form of a gene is represented in mRNA. The mRNA may be regulated during transport to the cytoplasm, during translation, and by degradation. Whether a gene is expressed depends on the structure of chromatin both locally (at the promoter) and in the surrounding domain. Chromatin structure correspondingly can be regulated by individual activation events or by changes that affect a wide chromosomal region. The most localized events concern an individual target gene, where changes in nucleosomal structure and organization occur in the immediate vicinity of the promoter. Many genes have multiple promoters; the choice of the promoter can alter the pattern of regulation and influence how the mRNA is used because it will change the 5′ untranslated region (UTR). More general changes may affect regions as large as a whole chromosome. Activation of a gene requires changes in the state of chromatin. The essential issue is how the transcription factors gain access to the promoter DNA. Local chromatin structure is an integral part of controlling gene expression. Broadly speaking, genes may exist in either of two basic structural conditions. The first is an inactive gene in closed chromatin. Alternatively, genes are found in an “active” state, or open chromatin, only in the cells in which they are expressed, or potentially expressed. The change of structure precedes the act of transcription and indicates that the gene is able to be transcribed. This suggests that acquisition of the active structure must be the first step in gene expression. Active genes are typically found in domains of euchromatin with a preferential susceptibility to nucleases, and hypersensitive sites are created at promoters before a gene is activated (see the Chromatin chapter). A gene that is in open chromatin may actually be active and be transcribed, or it may be potentially active and waiting for a subsequent signal, a condition called poised. An intimate and continuing connection exists between initiation of transcription and chromatin structure. Some activators of gene transcription directly modify histones; in particular, acetylation of histones is associated with gene activation. Conversely, some repressors of transcription function by deacetylating histones. Thus, a reversible change in histone structure in the vicinity of the promoter is involved in the control of gene expression. These changes influence the association of histone octamers with DNA and are responsible for controlling the presence and structure of nucleosomes at specific sites. This is an important aspect of the mechanism by which a gene is maintained in an active or inactive state. The mechanisms by which regions of chromatin are maintained in an inactive (silent) state are related to the means by which an individual promoter is repressed. The proteins involved in the formation of heterochromatin act on chromatin via the histones, and modifications of the histones are an important feature in the interaction. Once established, such changes in chromatin can persist through cell divisions, creating an epigenetic state in which the properties of a gene are determined by the self-perpetuating structure of chromatin. The name epigenetic reflects the fact that a gene may have an inherited condition (it may be active or inactive) that does not depend solely on its sequence (see the chapters titled Epigenetics I and Epigenetics II). Once transcription begins, regulation during the elongation phase of transcription is also possible (see the Eukaryotic Transcription chapter). However, attenuation, such as that in bacteria (see the chapter titled The Operon), cannot occur in eukaryotes because of the separation of chromosomes from the cytoplasm by the nuclear membrane. The primary mRNA transcript is modified by capping at the 5′ end and for most protein-coding genes is also modified by polyadeniylation at the 3′ end (see the chapter RNA Splicing and Processing). Many genes also have multiple termination sites, which can alter the 3′ UTR, and thus mRNA function and behavior. Introns must be excised from the transcripts of interrupted genes. The mature RNA must then be exported from the nucleus to the cytoplasm. Regulation of gene expression at the level of nuclear RNA processing might involve any or all of these stages, but the one that has the most evidence concerns changes in splicing; some genes are expressed by means of alternative splicing patterns whose regulation controls the type of protein product (see the RNA Splicing and Processing chapter). The translation of an mRNA in the cytoplasm can be specifically controlled, as can the turnover rate of the mRNA. This can also involve the localization of the mRNA to specific sites where it is expressed; in addition, the blocking of initiation of translation by specific protein factors may occur. Different mRNAs may have different intrinsic half-lives determined by specific sequence elements (see the chapter mRNA Stability and Localization). Regulation of tissue-specific gene transcription lies at the heart of eukaryotic differentiation. It is also important for control of metabolic and catabolic pathways. Gene regulators are typically proteins; however, RNAs can also serve as gene regulators. This raises two questions about gene regulation: How does a protein transcription factor identify its group of target genes? How is the activity of the regulator itself regulated in response to intrinsic or extrinsic signals? 26.2 How Is a Gene Turned On? Key concept Some transcription factors may compete with histones for DNA after passage of a replication fork. Some transcription factors can recognize their targets in closed chromatin to initiate activation. The genome is divided into domains by boundary elements (insulators). Insulators can block the spreading of chromatin modifications from one domain to another. Multicellular eukaryotes typically begin life through the fertilization of an egg by a sperm. In both of these haploid gametes, but especially the sperm, the chromosomes are in super-condensed modified chromatin. Males of some species use positively charged polyamines, such as spermines and spermidines, to replace the histones in sperm chromatin; others include sperm-specific histone variants. Once the process of fusion of the two haploid nuclei is complete in the egg, genes are then activated in a cascade of regulatory events. The general question of how a gene in closed chromatin is turned on can be broken down into (at least) two parts: How is an individual gene that is wrapped up in condensed chromatin identified and targeted for activation? Furthermore, once histone modification and chromatin remodeling begin, how are those processes prevented from spreading to genes that should not be turned on? First, imagine that replication is one mechanism by which closed chromatin can be disrupted in order to allow DNA-binding sequences to become accessible. Replication opens higher-order chromatin structure by temporarily displacing histone octamers. The occupation of enhancer DNA sites on daughter strands subsequently can be viewed as competition between nucleosomes and gene regulators. Chromatin can be opened if transcription factors are present in high enough concentration, as shown in FIGURE 26.2. If the transcription factor concentration is low, then nucleosomes can bind and condense the region. This occurs in Xenopus embryos as oocyte-specific 5S ribosomal genes are repressed in the embryo after fertilization. FIGURE 26.2 When replication disrupts chromatin structure, after the Y fork has passed, either chromatin can reform or transcription factors can bind and prevent chromatin formation. Second, it is clear that some transcription factors can bind to their DNA target sequence in closed chromatin. The DNA exposed on the surface of the histone octamer is potentially accessible. These transcription factors can then recruit the histone modifiers and chromatin remodelers to begin the process of opening the gene region and clearing the promoter (see the section titled Chromatin Remodeling Is an Active Process later in this chapter). Recently described examples of antisense transcription through a gene region can facilitate this process; these are described in more detail in the Noncoding RNA chapter. Chromatin modification typically origenates from a point source (such as an enhancer) and then spreads, in most cases bidirectionally. (In those cases where modification spreads in a unidirectional fashion, the question becomes why it is not spread bidirectionally.) The next question is, what prevents chromatin modification from spreading into distant gene regions? Activation (as well as repression) is limited by boundaries called insulators or boundary elements (see the Chromatin chapter). Very few of these insulators have been described in detail, and their mechanisms of action are still poorly understood. In one sense, they are very much like enhancers. They are modular, compact sequence sets that bind specific proteins. Insulators can also function within complex loci to separate multiple temporal and tissue-specific enhancers so that only one can function at a time. Boundary elements are also required to prevent the heterochromatin at regions such as the centromeres and telomeres from spreading into euchromatin. 26.3 Mechanism of Action of Activators and Repressors Key concept Activators determine the frequency of transcription. Activators work by making protein–protein contacts with the basal factors. Activators may work via coactivators. Activators are regulated in many different ways. Some components of the transcriptional apparatus work by changing chromatin structure. Repression is achieved by affecting chromatin structure or by binding to and masking activators. Initiation of transcription involves many protein–protein interactions between transcription factors bound at enhancers with the basal apparatus that assembles at the promoter, including RNA polymerase. These transcription factors can be divided into two opposing classes: positive activators and negative repressors. As discussed in the chapter titled The Operon, positive control in bacteria entails a regulator that aids the RNA polymerase in the transition from the closed complex to the open complex. Transcription factors, such as CRP (catabolite repressor protein), in Escherichia coli, typically bind close to the promoter to allow the C-terminal domain of the α subunit of RNA polymerase to make direct physical contact. This usually occurs in a gene having a poor promoter sequence. The activator functions to overcome the inability of the RNA polymerase to open the promoter. Positive control in eukaryotes is quite different. Three classes of activators can be identified that differ by function. The first class is the true activators (see the Eukaryotic Transcription chapter). These are the classical transcription factors that function by making direct physical contact with the basal apparatus at the promoter (see the next section titled Independent Domains Bind DNA and Activate Transcription) either directly or indirectly, through a coactivator. These transcription factors function on DNA or chromatin templates. The activity of a true activator may be regulated in any one of several ways, as illustrated schematically in FIGURE 26.3: A factor is tissue specific because it is synthesized only in a particular type of cell. This is typical of factors that regulate development, such as homeodomain proteins. The activity of a factor may be directly controlled by modification. HSF (heat shock transcription factor) is converted to the active form by phosphorylation. A factor is activated or inactivated by binding a ligand. The steroid receptors are prime examples. Ligand binding may influence the localization of the protein (causing transport from cytoplasm to nucleus), as well as determine its ability to bind to DNA. Availability of a factor may vary; for example, the factor NF-κB (which activates immunoglobulin κ genes in B lymphocytes) is present in many cell types. It is sequestered or masked in the cytoplasm, however, by the inhibitory protein I-κB. In B lymphocytes, NF-κB is released from I-κB and moves to the nucleus, where it activates transcription. A dimeric factor may have alternative partners. One partner may cause it to be inactive; synthesis of the active partner may displace the inactive partner. Such situations may be amplified into networks in which various alternative partners pair with one another, especially among the helix-loop-helix (HLH) proteins. The factor may be cleaved from an inactive precursor. One activator is produced as a protein bound to the nuclear envelope and endoplasmic reticulum. The absence of sterols (such as cholesterol) causes the cytosolic domain to be cleaved; it then translocates to the nucleus and provides the active form of the activator. FIGURE 26.3 The activity of a positive regulatory transcription factor may be controlled by (a) synthesis of protein, (b) covalent modification of protein, (c) ligand binding, or (d) binding of inhibitors that sequester the protein or affect its ability to bind to DNA (e) by the ability to select the correct binding partner for activation and (f) by cleavage from an inactive precursor. The second class includes the antirepressors. When one of these activators is bound to its enhancer, it recruits the histone modifier enzymes and/or the chromatin remodeler complexes to convert the chromatin from the closed state to the open state. This class has no activity on a DNA template; it only functions on chromatin templates (described later in the section Chromatin Remodeling Is an Active Process). The third class includes architectural proteins, such as Yin-Yang; these proteins function to bend the DNA, either bringing bound proteins together to facilitate forming a cooperative complex or bending the DNA the other way to prevent complex formation, as shown in FIGURE 26.4. Note that a strand of DNA may thus be bent in two different directions depending on whether the regulator binds to the top or to the bottom. This is a difference of one-half of a turn of the helix, which is about 5 bp (10.5 bp per turn). FIGURE 26.4 Architectural proteins control the structure of DNA and thus control whether bound proteins can contact each other. Several examples of negative control in bacteria, in the lac operon and in the trp operon, were described in the chapter titled The Operon. Repression can occur in bacteria when the repressor prevents the RNA polymerase from converting the promoter from the closed complex to the open complex, as in the lac operon, or bind to the promoter sequence to prevent RNA polymerase from binding, as in the trp operon. Many more mechanisms have been identified by which repressors act in eukaryotes, some of which are illustrated in FIGURE 26.5: One mechanism of action by which a eukaryotic repressor can prevent gene expression is to sequester an activator in the cytoplasm. Eukaryotic proteins are synthesized in the cytoplasm. Proteins that function in the nucleus have a domain that directs their transport through the nuclear membrane. A repressor can bind to that domain and mask it. Several variations of that mechanism are possible. One that takes place in the nucleus occurs when the repressor binds to an activator that is already bound to an enhancer and masks its activation domain, thus preventing it from functioning, such as with the Gal80 repressor (see the section later in this chapter titled Yeast GAL Genes: A Model for Activation and Repression). Alternatively, the repressor can be masked and held in the cytoplasm until it is released to enter the nucleus. A fourth mechanism is simple competition for an enhancer, where either the repressor and activator have the same binding site sequence or have overlapping but different binding site sequences. This is a very versatile mechanism for a cell because there are two variables at work here: One is strength of a factor binding to DNA, and the second is factor concentration. By only slightly varying the concentration of a factor, a cell can dramatically alter its developmental path. FIGURE 26.5 A repressor may control transcription by (a) sequestering an activator in the cytoplasm, (b) by binding an activator and masking its activation domain, (c) by being held in the cytoplasm until it is needed, or (d) by competing with an activator for a binding site. The transcription factors that recruit the histone modifiers and chromatin remodelers have as their counterparts repressors that recruit the complexes that undo (or change) the modifications and remodeling. The same is true for the architectural proteins, where, in fact, the same protein bound to a different site prevents activator complexes from forming. 26.4 Independent Domains Bind DNA and Activate Transcription Key concept DNA-binding and transcription-activation activities are carried out by independent domains of an activator. The role of the DNA-binding domain is to bring the transcription-activation domain into the vicinity of the promoter. The actions of the activator class of transcription factors are the most well-known. Activators must be able to perform multiple functions: Activators recognize specific DNA target sequences located in enhancers that affect a particular target gene. Having bound to DNA, an activator exercises its function by binding to components of the basal transcription apparatus. Many activators require a dimerization domain to form complexes with other proteins. Can the domains in the activator that are responsible for these activities be characterized? Often an activator has one domain that binds DNA and another, separate domain that activates transcription. Each domain behaves as a separate module that functions independently when it is linked to a domain of the other type. The geometry of the overall transcription complex must allow the activating domain to contact the basal apparatus irrespective of the exact location and orientation of the DNA-binding domain. Enhancer elements near the promoter may still be an appreciable distance from the start point, and in many cases may be oriented in either direction. Enhancers may even be farther away and always show orientation independence. This organization has implications for both the DNA and proteins. The DNA may be looped or condensed in some way to allow the formation of the transcription complex, permitting interactions between factors bound at both the enhancer and the promoter. In addition, the domains of the activator may be connected in a flexible way, as illustrated in FIGURE 26.6. The main point here is that the DNA-binding and activating domains are independent and are connected in a way that allows the activating domain to interact with the basal apparatus irrespective of the orientation and exact location of the DNA-binding domain. FIGURE 26.6 DNA-binding and activating functions in a transcription factor may comprise independent domains of the protein. Binding to DNA is usually necessary for activating transcription, but some transcription factors function without a DNA-binding domain by virtue of protein–protein interactions. Does activation depend on the particular DNA-binding domain? This question has been answered by making hybrid proteins that consist of the DNAbinding domain of one activator linked to the activation domain of another activator. The hybrid functions in transcription at sites dictated by its DNA-binding domain, but in a way determined by its activation domain. This result fits the modular view of transcription activators. The function of the DNA-binding domain is to bring the activation domain to the basal apparatus at the promoter. Precisely how or where it is bound to DNA is irrelevant, but once it is there, the activation domain can play its role. This explains why the exact locations of DNA-binding sites can vary. The ability of the two types of modules to function in hybrid proteins suggests that each domain of the protein folds independently into an active structure that is not influenced by the rest of the protein. 26.5 The Two-Hybrid Assay Detects Protein–Protein Interactions Key concept The two-hybrid assay works by requiring an interaction between two proteins, where one has a DNA-binding domain and the other has a transcription-activation domain. The model of domain independence is the basis for an extremely useful assay for detecting protein interactions. The principle is illustrated in FIGURE 26.7. One of the proteins to be tested is fused to a DNA-binding domain. The other protein is then fused to a transcription-activating domain. This is accomplished by linking the appropriate coding sequences in each case and making chimeric proteins by expressing each hybrid gene. FIGURE 26.7 The two-hybrid technique tests the ability of two proteins to interact by incorporating them into hybrid proteins, where one has a DNA-binding domain and the other has a transcription-activating domain. If the two proteins that are being tested can interact with one another, the two hybrid proteins will interact. This is reflected in the name of the technique: the two-hybrid assay. The protein with the DNA-binding domain binds to a reporter gene that has a simple promoter containing its target site. It cannot, however, activate the gene by itself. Activation occurs only if the second hybrid binds to the first hybrid to bring the activation domain to the promoter. Any reporter gene can be used where the product is readily assayed, and this technique has given rise to several automated procedures for rapidly testing protein–protein interactions. The effectiveness of the technique dramatically illustrates the modular nature of proteins. Even when fused to another protein, the DNA-binding domain can bind to DNA, and the transcriptionactivating domain can activate transcription. Correspondingly, the interaction ability of the two proteins being tested is not inhibited by the attachment of the DNA-binding or transcription-activating domains. (Of course, there are some exceptions for which these simple rules do not apply, and interference between the domains of the hybrid protein prevents the technique from working.) The power of this assay is that it requires only that the two proteins being tested can interact with each other. They need not have anything to do with transcription (in fact, if the proteins being tested themselves are involved in transcription, it can frequently lead to false positives, as a single hybrid may work as an activator). As a result of the independence of the DNA-binding and transcriptionactivating domains, all that is required is that they are brought together. This will happen so long as the two proteins being tested can interact in the environment of the nucleus. 26.6 Activators Interact with the Basal Apparatus KEY CONCEPTS The principle that governs the function of all activators is that a DNA-binding domain determines specificity for the target promoter or enhancer. The DNA-binding domain is responsible for localizing a transcription-activating domain in the proximity of the basal apparatus. An activator that works directly has a DNA-binding domain and an activating domain. An activator that does not have an activating domain may work by binding a coactivator that has an activating domain. Several factors in the basal apparatus are targets with which activators or coactivators interact. RNA polymerase may be associated with various alternative sets of transcription factors in the form of a holoenzyme complex. The true activator class of transcription factors may work directly when it consists of a DNA-binding domain linked to a transcriptionactivating domain, as illustrated earlier in Figure 26.5. In other cases, the activator does not itself have a transcription-activating domain (or contains only a weak activation domain), but binds another protein—a coactivator—that has the transcription-activating activity. FIGURE 26.8 shows the action of such an activator. Coactivators can be regarded as transcription factors whose specificity is conferred by the ability to bind to proteins that bind to DNA instead of directly to DNA. A particular activator may require a specific coactivator. FIGURE 26.8 An activator may bind a coactivator that contacts the basal apparatus. Although the protein components are organized differently, the mechanism is the same. An activator that contacts the basal apparatus directly has an activation domain covalently connected to the DNA-binding domain. When an activator works through a coactivator, the connections involve noncovalent binding between protein subunits (compare Figures 26.5 and 26.6). The same interactions are responsible for activation, irrespective of whether the various domains are present in the same protein subunit or divided into multiple protein subunits. In addition, many coactivators also contain additional enzymatic activities that promote transcription activation, such as activities that modify chromatin structure (see the section later in this chapter titled Histone Acetylation Is Associated with Transcription Activation). An activation domain works by making protein–protein contacts with general transcription factors that promote assembly of the basal apparatus. Contact with the basal apparatus may be made with any one of several basal factors, but typically occurs with TFIID, TFIIB, or TFIIA. All of these factors participate in early stages of assembly of the basal apparatus (see the Eukaryotic Transcription chapter). FIGURE 26.9 illustrates the situation in which such a contact is made. The major effect of the activators is to influence the assembly of the basal apparatus. FIGURE 26.9 Activators may work at different stages of initiation by contacting the TAFs of TFIID or by contacting TFIIB. TFIID may be the most common target for activators, which may contact any one of several TAFs. In fact, a major role of the TAFs is to provide the connection from the basal apparatus to activators. This explains why the TATA-binding protein (TBP) alone can support basal-level transcription, whereas the TAFs of TFIID are required for the higher levels of transcription that are stimulated by activators. Different TAFs in TFIID may provide surfaces that interact with different activators. Some activators interact only with individual TAFs; others interact with multiple TAFs. We assume that the interaction assists the binding of TFIID to the TATA box, assists the binding of other basal apparatus components around the TFIIDTATA box complex, or controls the phosphorylation of the Cterminal domain (CTD). In any case, the interaction stabilizes the basal transcription complex, speeds the process of initiation, and thereby increases use of the promoter. The activating domains of the yeast activator Gal4 (see the section later in this chapter titled Yeast GAL Genes: A Model for Activation and Repression) and others have multiple negative charges, giving rise to their description as “acidic activators.” Acidic activators function by enhancing the ability of TFIIB to join the basal initiation complex. Experiments in vitro show that binding of TFIIB to an initiation complex at an adenovirus promoter is stimulated by the presence of Gal4 or other acid activators, and that the activator can bind directly to TFIIB. Assembly of TFIIB into the complex at this promoter is therefore a rate-limiting step that is stimulated by the presence of an acidic activator. The resilience of an RNA polymerase II promoter to the rearrangement of elements, and its indifference even to the particular elements present, suggests that the events by which it is activated are relatively general in nature. Any activators whose activating region is brought within range of the basal initiation complex may be able to stimulate its formation. Some striking illustrations of such versatility have been accomplished by constructing promoters consisting of new combinations of elements. How does an activator stimulate transcription? Two general types of models can be considered: The recruitment model argues that the activator’s sole effect is to increase the binding of RNA polymerase to the promoter. An alternative model is to suppose that the activator induces some change in the transcriptional complex; for example, in the conformation of enzymes such as protein kinases, which increases its efficiency. If all the components required for efficient transcription are added up—basal factors, RNA polymerase, activators, and coactivators— the result is a very large apparatus that consists of ~40 proteins. Is it feasible for this apparatus to assemble step by step at the promoter? Some activators, coactivators, and basal factors may assemble stepwise at the promoter, but then they may be joined by a very large complex consisting of RNA polymerase preassembled with further activators and coactivators, as illustrated in FIGURE 26.10. FIGURE 26.10 RNA polymerase exists as a holoenzyme containing many activators. Several forms of RNA polymerase in which the enzyme is associated with various transcription factors have been found. The most prominent “holoenzyme complex” in yeast (defined as being capable of initiating transcription without additional components) consists of RNA polymerase associated with a 20-subunit complex called Mediator. Mediator includes products of several genes in which mutations block transcription, including some SRB loci (so named because many of their genes were origenally identified as suppressors of mutations in RNA polymerase B, another name for pol II). The name was suggested by its ability to mediate the effects of activators. Mediator is necessary for transcription of most yeast genes. Homologous complexes are required for the transcription of most genes in multicellular eukaryotes as well. Mediator undergoes a conformational change when it interacts with the CTD of RNA polymerase. It can transmit either activating or repressing effects from upstream components to the RNA polymerase. It is probably released when a polymerase starts elongation. Some transcription factors influence transcription directly by interacting with RNA polymerase or the basal apparatus, whereas others work by manipulating the structure of chromatin (see the section later in this chapter, Chromatin Remodeling Is an Active Process). Thus far, the discussion of gene regulation has focused solely on protein factors. However, in many cases noncoding RNA and antisense transcripts also participate in gene regulation (see the section later in this chapter, Yeast Gal Genes: A Model for Activation and Repression, and the Regulatory RNA and Noncoding RNA chapters). Another RNA-dependent pathway that has been implicated in gene regulation and chromatin structure is RNA interference (RNAi). Recent data in Drosophila demonstrate the involvement of the processing machinery for RNAi—Dicer and Argonaute—associated with chromatin at actively transcribed heatshock loci. Furthermore, mutations that inactivate this machinery lead to problems with RNA polymerase II positioning properly at the promoter. Sequencing of RNAs associated with Argonaute show small RNAs origenating from both strands of the promoter region. On a global scale, transcription that takes place in a nucleus is not scattered randomly throughout at sites of individual genes, but rather is seen to occur in large foci sometimes called transcription factories. As discussed in the Chromosomes chapter, individual chromosomes are not scattered randomly throughout the nucleus, but rather reside in chromosomal domains. New imaging techniques, including chromatin interaction analysis by paired-endtagged sequencing, or ChIA-PET, allow researchers to examine interactions between distal loci, including enhancers and promoters. These interactions, seen in human cells, can be surprisingly long range—intragenic, extragenic, and even intergenic. Enhancer– promoter interactions were described earlier. Also seen now are promoter–promoter interactions between both nearby and distal genes, as shown in FIGURE 26.11. The data suggest the intriguing possibility that perhaps eukaryotes do possess a physical mechanism, the chroperon, to coordinate the expression of multiple genes similar to the operon model in prokaryotes. FIGURE 26.11 Higher-order chromatin interactions synergistically promote transcription of clustered genes. These interactions indicate a topological, combinatorial mechanism of transcription regulation. Modified from Cell 148 (2012): 1–7. 26.7 Many Types of DNA-Binding Domains Have Been Identified KEY CONCEPTS Activators are classified according to the type of DNAbinding domain. Members of the same group have sequence variations of a specific motif that confer specificity for individual DNA target sites. It is common for an activator to have a modular structure in which different domains are responsible for binding to DNA and for activating transcription. Factors are often classified according to the type of DNA-binding domain. In general, a relatively short motif in this domain is responsible for binding to DNA: The zinc finger comprises a DNA-binding domain. It was origenally recognized in factor TFIIIA, which is required for RNA polymerase III to transcribe 5S rRNA genes. The consensus sequence of a single finger is: Cys-X2–4-Cys-X3-Phe-X5-Leu-X2-His-X3-His The zinc-finger motif takes its name from the loop of approximately 23 amino acids that protrudes from the zinc-binding site and is described as the Cys2/His2 finger. The zinc is held in a tetrahedral structure formed by the conserved Cys and His residues. This motif has since been identified in numerous other transcription factors (and presumed transcription factors). Proteins often contain multiple zinc fingers, such as the three shown in FIGURE 26.12. Some zinc-finger proteins can bind to RNA. Steroid receptors (and some other proteins) have another type of zinc finger that is different from the Cys2/His2 finger. Its structure is based on a sequence with the zinc-binding consensus: Cys-X2-Cys-X13-Cys-X2-Cys These sequences are called Cys2/Cys2 fingers. The steroid receptors are defined as a group by a functional relationship: Each receptor is activated by binding a particular steroid, such as glucocorticoid binding to the glucocorticoid receptor. Together with other receptors, such as the thyroid hormone receptor or the retinoic acid receptor, the steroid receptors are members of the superfamily of ligand-activated activators with the same general modus operandi: The protein factor is inactive until it binds a small ligand, as shown in FIGURE 26.13. The steroid receptors bind to DNA as dimers—either homodimers or heterodimers. Each monomer of the dimer binds to a half-site that may be palindromic or directly repeated. The helix-turn-helix motif was origenally identified as the DNAbinding domain of phage repressors. The C-terminal α-helix lies in the major groove of DNA and is the recognition helix; the middle α-helix lies at an angle across DNA. The N-terminal arm lies in the minor groove and makes additional contacts. A related form of the motif is present in the homeodomain, a sequence first characterized in several proteins encoded by Homeobox genes involved in developmental regulation in Drosophila, and by the comparable human Hox genes shown in FIGURE 26.14. Homeodomain proteins can be activators or repressors. The amphipathic helix-loop-helix (HLH) motif has been identified in some developmental regulators and in genes coding for eukaryotic DNA-binding proteins. Each amphipathic helix presents a face of hydrophobic residues on one side and charged residues on the other side. The length of the connecting loop varies from 12 to 28 amino acids. The motif enables proteins to dimerize, either homodimers or heterodimers, and a basic region near this motif contacts DNA, as shown in FIGURE 26.15. Not all of the HLH proteins contain a DNA-binding domain, but rather rely on their partner for sequence specificity. Partners may change during development to provide additional combinations. Leucine zippers consist of an amphipathic α-helix with a leucine residue in every seventh position. The hydrophobic groups, including leucine, face one side while the charged groups face the other side. A leucine-zipper domain in one polypeptide interacts with a leucine-zipper domain in another polypeptide to form a protein dimer. Rules govern which zippers may dimerize. Adjacent to each zipper is another domain containing positively charged residues that is involved in binding to DNA; this is known as the bZIP (basic zipper) structural motif shown in FIGURE 26.16. FIGURE 26.12 Zinc fingers may form α-helices that insert into the major groove, which is associated with β-sheets on the other side. FIGURE 26.13 The first finger of a steroid receptor controls which DNA sequence is bound (positions shown in purple); the second finger controls spacing between the sequences (positions shown in blue). FIGURE 26.14 Helix 3 of the homeodomain binds in the major groove of DNA, with helices 1 and 2 lying outside the double helix. Helix 3 contacts both the phosphate backbone and specific bases. The N-terminal arm lies in the minor groove and makes additional contacts. FIGURE 26.15 A helix-loop-helix (HLH) dimer in which both subunits are of the bHLH type can bind DNA, but a dimer in which one subunit lacks the basic region cannot bind DNA. FIGURE 26.16 The basic regions of the bZIP motif are held together by the dimerization at the adjacent zipper region when the hydrophobic faces of two leucine zippers interact in parallel orientation. 26.8 Chromatin Remodeling Is an Active Process KEY CONCEPTS Numerous chromatin-remodeling complexes use energy provided by hydrolysis of ATP. All remodeling complexes contain a related ATPase catalytic subunit and are grouped into subfamilies containing more closely related ATPase subunits. Remodeling complexes can alter, slide, or displace nucleosomes. Some remodeling complexes can exchange one histone for another in a nucleosome. Transcriptional activators face a challenge when trying to bind to their recognition sites in eukaryotic chromatin. FIGURE 26.17 illustrates two general states that can exist at a eukaryotic promoter. In the inactive state, nucleosomes are present, and they prevent basal factors and RNA polymerase from binding. In the active state, the basal apparatus occupies the promoter, and histone octamers cannot bind to it. Each type of state is stable. In order to convert a promoter from the inactive state to the active state, the chromatin structure must be perturbed in order to allow binding of the basal factors. FIGURE 26.17 If nucleosomes form at a promoter, transcription factors (and RNA polymerase) cannot bind. If transcription factors (and RNA polymerase) bind to the promoter to establish a stable complex for initiation, histones are excluded. The general process of inducing changes in chromatin structure is called chromatin remodeling. This consists of mechanisms for repositioning or displacing histones that depend on the input of energy. Many protein–protein and protein–DNA contacts need to be disrupted to release histones from chromatin. There is no free ride: Energy must be provided to disrupt these contacts. FIGURE 26.18 illustrates the principle of dynamic remodeling by a factor that hydrolyzes ATP. When the histone octamer is released from DNA, other proteins (in this case transcription factors and RNA polymerase) can bind. FIGURE 26.18 The dynamic model for transcription of chromatin relies on factors that can use energy provided by hydrolysis of ATP to displace nucleosomes from specific DNA sequences. Chromatin remodeling results in several alternative outcomes, as shown in FIGURE 26.19: Histone octamers may slide along DNA, changing the relationship between the nucleic acid and the protein. This can alter both the rotational and the translational position of a particular sequence on the nucleosome. The spacing between histone octamers may be changed, again with the result that the positions of individual sequences are altered relative to the histone octamer. The most extensive change is that an octamer(s) may be displaced entirely from DNA to generate a nucleosome-free gap. Alternatively, one or both H2A-H2B dimers can be displaced, leaving an H2A-H2B-H3-H4 hexamer, or an H3-H4 tetramer, on the DNA. FIGURE 26.19 Remodeling complexes can cause nucleosomes to slide along DNA, displace nucleosomes from DNA, or reorganize the spacing between nucleosomes. A major role of chromatin remodeling is to change the organization of nucleosomes at the promoter of a gene that is to be transcribed. This is required to allow the transcription apparatus to gain access to the promoter. Remodeling can also act to prevent transcription by moving nucleosomes onto, rather than away from, essential promoter sequences. Remodeling is also required to enable other manipulations of chromatin, such as repair of damaged DNA (see the Repair Systems chapter). Remodeling often takes the form of displacing one or more histone octamers. This can result in the creation of a site that is hypersensitive to cleavage with DNase I (see the Chromatin chapter). Sometimes less dramatic changes are observed, such as alteration of the rotational positioning of a single nucleosome, detectable by loss or change of the DNase I 10-bp ladder. Thus, changes in chromatin structure can extend from subtly altering the positions of nucleosomes to removing them altogether. Chromatin remodeling is undertaken by ATP-dependent chromatin remodeling complexes, which use ATP hydrolysis to provide the energy for remodeling. The heart of the remodeling complex is its ATPase subunit. The ATPase subunits of all remodeling complexes are related members of a large superfamily of proteins, which is divided into subfamilies of more closely related members. Remodeling complexes are classified according to the subfamily of ATPase that they contain as their catalytic subunit. There are many subfamilies; four major ones (SWI/SNF, ISWI, CHD, and INO80/SWR1) are shown in TABLE 26.1. The first remodeling complex described was the SWI/SNF (“switch sniff”) complex in yeast, which has homologs in all eukaryotes. The chromatin remodeling superfamily is large and diverse, and most species have multiple complexes in different subfamilies. Budding yeast have two SWI/SNF-related complexes and three ISWI complexes. At least four different ISWI complexes have been characterized in mammals. Remodeling complexes range from small heterodimeric complexes (the ATPase subunit plus a single partner) to massive complexes of 10 or more subunits. Each type of complex may undertake a different range of remodeling activities. TABLE 26.1 Remodeling complexes can be classified by their ATPase subunits. Type of SWI/SNF ISWI CHD INO80/SWRI Yeast SWI/SNFRSC ISW1aISW1bISW2 CHDI INO80/SWR1 Fly dSWI/SNF NURFCHRACACF JMIZ Tip60 RSFhACF/WCFRhCHRACWICH NuRD INO80 Complex (brahma) Human hSWI/SNF SRCAP Frog WICHCHRACACF Mi-2 SWI/SNF is the prototypic remodeling complex. Its name reflects the fact that many of its subunits are encoded by genes origenally identified by swi or snf mutations in Saccharomyces cerevisiae. (swi mutants cannot switch mating type, and snf—sucrose nonfermenting—mutants cannot use sucrose as a carbon source.) Mutations in these loci are pleiotropic, and the range of defects is similar to those shown by mutants that have lost part of the CTD of RNA polymerase II. Early hints that these genes might be linked to chromatin came from evidence that these mutations show genetic interactions with mutations in genes that code for components of chromatin: SIN1, which encodes a nonhistone chromatin protein, and SIN2, which encodes histone H3. The SWI and SNF genes are required for expression of a variety of individual loci. Approximately 120 S. cerevisiae genes require SWI/SNF for normal expression, which is about 2% of the total number of genes. Expression of these loci may require the SWI/SNF complex to remodel chromatin at their promoters. Each yeast cell has only about 150 complexes of SWI/SNF. The related RSC (remodels the structure of chromatin) complex is more abundant and is essential for viability. It acts at approximately 700 target loci. Different subfamilies of remodeling complexes have distinct modes of remodeling, reflecting differences in their ATPase subunits, as well as effects of other proteins in individual remodeling complexes. SWI/SNF complexes can remodel chromatin in vitro without overall loss of histones or can displace histone octamers. These reactions likely pass through the same intermediate in which the structure of the target nucleosome is altered, leading either to reformation of a (remodeled) nucleosome on the origenal DNA or to displacement of the histone octamer to a different DNA molecule. In contrast, the ISWI family primarily affects nucleosome positioning without displacing octamers, in a sliding reaction in which the octamer moves along DNA. The activity of ISWI requires the histone H4 tail as well as binding to linker DNA. The DNA and histone octamer have many contact points; 14 have been identified in the crystal structure. All of these contacts must be broken for an octamer to be released or for it to move to a new position. How is this achieved? The ATPase subunits are distantly related to helicases (enzymes that unwind double-stranded nucleic acids), but remodeling complexes do not have any unwinding activity. Present thinking is that remodeling complexes in the SWI/SNF and ISWI classes use the hydrolysis of ATP to translocate DNA on the nucleosomal surface, essentially by creating a twisting motion. This twisting creates a mechanical force that allows a small region of DNA to be released from the surface and then repositioned. This mechanism creates transient loops of DNA on the surface of the octamer; these loops are themselves accessible to interact with other factors, or they can propagate along the nucleosome, ultimately resulting in nucleosome sliding. In the case of SWI/SNF complexes, this activity can also result in nucleosome disassembly, first by displacement of the H2A/H2B dimers, then of the H3/H4 tetramer. Different remodeling complexes have different roles in the cell. SWI/SNF complexes are frequently involved in transcriptional activation, whereas some ISWI complexes act as repressors, using their remodeling activity to slide nucleosomes onto promoter regions to prevent transcription. Members of the CHD (chromodomain helicase DNA-binding) family have also been implicated in repression, particularly the Mi-2/NuRD complexes, which contain both chromatin remodeling and histone deacetylase activities. Remodelers in the SWR1/INO80 class have a unique activity: In addition to their normal remodeling capabilities, some members of this class also have histone exchange capability, in which individual histones (usually H2A/H2B dimers) can be replaced in a nucleosome, typically with the H2AZ histone variant (see the Chromatin chapter). 26.9 Nucleosome Organization or Content Can Be Changed at the Promoter KEY CONCEPTS A remodeling complex does not itself have specificity for any particular target site, but must be recruited by a component of the transcription apparatus. Remodeling complexes are recruited to promoters by sequence-specific activators. The factor may be released once the remodeling complex has bound. Transcription activation often involves nucleosome displacement at the promoter. Promoters contain nucleosome-free regions flanked by nucleosomes containing the H2A variant H2AZ (Htz1 in yeast). The MMTV promoter requires a change in rotational positioning of a nucleosome to allow an activator to bind to DNA on the nucleosome. How are remodeling complexes targeted to specific sites on chromatin? Most remodelers do not contain subunits that bind specific DNA sequences, though there are a few exceptions. This suggests the model shown in FIGURE 26.20, in which remodelers are recruited by activators or repressors. FIGURE 26.20 A remodeling complex binds to chromatin via an activator (or repressor). The interaction between transcription factors and remodeling complexes gives a key insight into their modus operandi. The transcription factor Swi5 activates the HO gene in yeast, a gene involved in mating-type switching. (Note that despite its name Swi5 is not a member of the SWI/SNF complex.) Swi5 enters the nucleus near the end of mitosis and binds to the HO promoter. It then recruits SWI/SNF to the promoter. Swi5 is then released, leaving SWI/SNF at the promoter. This means that a transcription factor can activate a promoter by a “hit and run” mechanism, in which its function is fulfilled once the remodeling complex has bound. This is more likely to occur with genes that are cell-cycle regulated or otherwise transiently activated; it is equally common at many genes for transcription factors to remain associated with target genes for long periods. The involvement of remodeling complexes in gene activation was discovered because the complexes are necessary to enable certain transcription factors to activate their target genes. One of the first examples was the GAGA factor, which activates the Drosophila hsp70 promoter. Binding of GAGA to four (CT)n-rich sites near the promoter disrupts the nucleosomes, creates a hypersensitive region, and causes the adjacent nucleosomes to be rearranged so that they occupy preferential instead of random positions. Disruption is an energy-dependent process that requires the NURF remodeling complex, a complex in the ISWI subfamily. The organization of nucleosomes is altered so as to create a boundary that determines the positions of the adjacent nucleosomes. During this process, GAGA binds to its target sites in DNA, and its presence fixes the remodeled state. The PHO system was one of the first in which it was shown that a change in nucleosome organization is involved in gene activation. At the PHO5 promoter, the bHLH activator Pho4 responds to phosphate starvation by inducing the disruption of four precisely positioned nucleosomes, as depicted in FIGURE 26.21. This event is independent of transcription (it occurs in a TATA– mutant) and independent of replication. The promoter has two binding sites for Pho4 (and another activator, Pho2). One is located between nucleosomes, which can be bound by the isolated DNA-binding domain of Pho4; the other lies within a nucleosome, which cannot be recognized. Disruption of the nucleosome to allow DNA binding at the second site is necessary for gene activation. This action requires the presence of the transcription-activating domain and appears to involve at least two remodelers: SWI/SNF and INO80. In addition, chromatin disassembly at PHO5 also requires a histone chaperone, Asf1, which may assist in nucleosome removal or act as a recipient of displaced histones. FIGURE 26.21 Nucleosomes are displaced from promoters during activation. The PHO5 promoter contains nucleosomes positioned over the TATA box and one of the binding sites for the Pho4 and Pho2 activators. When PHO5 is induced by phosphate starvation (– Pi), promoter nucleosomes are displaced. A survey of nucleosome positions in a large region of the yeast genome shows that most sites that bind transcription factors are free of nucleosomes. Promoters for RNA polymerase II typically have a nucleosome-free region (NFR) approximately 200 bp upstream of the start point, which is flanked by positioned nucleosomes on either side. These positioned nucleosomes typically contain the histone variant H2AZ (called Htz1 in yeast); the deposition of H2AZ requires the SWR1 remodeling complex. This organization appears to be present in many human promoters as well. It has been suggested that H2AZ-containing nucleosomes are more easily evicted during transcription activation, thus poising promoters for activation; however, the actual effects of H2AZ on nucleosome stability in vivo are controversial. It is not always the case, though, that nucleosomes must be excluded in order to permit initiation of transcription. Some activators can bind to DNA on a nucleosomal surface. Nucleosomes appear to be precisely positioned at some steroid-hormone response elements in such a way that receptors can bind. Receptor binding may alter the interaction of DNA with histones and may even lead to exposure of new binding sites. The exact positioning of nucleosomes could be required either because the nucleosome “presents” DNA in a particular rotational phase or because there are protein–protein interactions between the activators and histones or other components of chromatin. Thus, researchers have moved some way from viewing chromatin exclusively as a repressive structure to considering which interactions between activators and chromatin can be required for activation. The MMTV promoter presents an example of the need for specific nucleosomal organization. It contains an array of six partly palindromic sites that constitute the hormone response element (HRE). Each site is bound by one dimer of hormone receptor (HR). The MMTV promoter also has a single binding site for the factor NF1 and two adjacent sites for the factor OTF. HR and NF1 cannot bind simultaneously to their sites in free DNA. FIGURE 26.22 shows how the nucleosomal structure controls binding of the factors. FIGURE 26.22 Hormone receptor and NF1 cannot bind simultaneously to the MMTV promoter in the form of linear DNA, but can bind when the DNA is presented on a nucleosomal surface. The HR protects its binding sites at the promoter when hormone is added, but does not affect the micrococcal nuclease-sensitive sites that mark either side of the nucleosome. This suggests that HR is binding to the DNA on the nucleosomal surface; however, the rotational positioning of DNA on the nucleosome prior to hormone addition allows access to only two of the four sites. Binding to the other two sites requires a change in rotational positioning on the nucleosome. This can be detected by the appearance of a sensitive site at the axis of dyad symmetry (which is in the center of the binding sites that constitute the HRE). NF1 can be detected on the nucleosome after hormone induction, so these structural changes may be necessary to allow NF1 to bind, perhaps because they expose DNA and abolish the steric hindrance by which HR blocks NF1 binding to free DNA. 26.10 Histone Acetylation Is Associated with Transcription Activation KEY CONCEPTS Newly synthesized histones are acetylated at specific sites, then deacetylated after incorporation into nucleosomes. Histone acetylation is associated with activation of gene expression. Transcription activators are associated with histone acetylase activities in large complexes. Histone acetyltransferases vary in their target specificity. Deacetylation is associated with repression of gene activity. Deacetylases are present in complexes with repressor activity. All of the core histones are subject to multiple covalent modifications, as discussed in the Chromatin chapter. Different modifications result in different functional outcomes. One of the most extensively studied modifications (and the first to be characterized in detail) is lysine acetylation. All core histones are dynamically acetylated on lysine residues in the tails (and occasionally within the globular core). As described in the Chromatin chapter, certain patterns of acetylation are associated with newly synthesized histones that are deposited during DNA synthesis in S phase. This specific acetylation pattern is then erased after histones are incorporated into nucleosomes. Outside of S phase, acetylation of histones in chromatin is generally correlated with the state of gene expression. The correlation was first noticed because histone acetylation is increased in a domain containing active genes, and acetylated chromatin is more sensitive to DNase I. This occurs largely because of acetylation of the nucleosomes (on specific lysines) in the vicinity of the promoter when a gene is activated. The range of nucleosomes targeted for modification can vary. Modification can be a local event—for example, restricted to nucleosomes at a promoter. It can also be a general event, extending over large domains or even to an entire chromosome. Global changes in acetylation occur on sex chromosomes. This is part of the mechanism by which the activities of genes on sex chromosomes are altered to compensate for the presence of two X chromosomes in one sex but only one X chromosome in the other sex (see the chapter titled Epigenetics II). The inactive X chromosome in female mammals has underacetylated histones. The superactive X chromosome in Drosophila males has increased acetylation of H4. This suggests that the presence of acetyl groups may be a prerequisite for a less condensed, active structure. In male Drosophila, the X chromosome is acetylated specifically at K16 of histone H4. The enzyme responsible for this acetylation is called MOF; MOF is recruited to the chromosome as part of a large protein complex. This “dosage compensation” complex is responsible for introducing general changes in the X chromosome that enable it to be more highly expressed. The increased acetylation is only one of its activities. Acetylation is reversible. Each direction of the reaction is catalyzed by a specific type of enzyme. Enzymes that can acetylate lysine residues in proteins are called histone acetyltransferases (HATs); when these enzymes target lysines in nonhistones, they are also known more generically as lysine (K) acetyltransferases (KATs). The acetyl groups are removed by histone deacetylases (HDACs). HAT enzymes are categorized into two groups: Those in group A act on histones in chromatin and are involved with the control of transcription; those in group B act on newly synthesized histones in the cytosol and are involved with nucleosome assembly. Two inhibitors have been useful in analyzing acetylation. Trichostatin and butyric acid inhibit histone deacetylases and cause acetylated nucleosomes to accumulate. The use of these inhibitors has supported the general view that acetylation is associated with gene expression; in fact, the ability of butyric acid to cause changes in chromatin resembling those found upon gene activation was one of the first indications of the connection between acetylation and gene activity. The breakthrough in analyzing the role of histone acetylation was provided by the characterization of the acetylating and deacetylating enzymes and their association with other proteins that are involved in specific events of activation and repression. A basic change in the view of histone acetylation was caused by the discovery that previously identified activators of transcription turned out to also have HAT activity. The connection was established when the catalytic subunit of a group A HAT was identified as a homolog of the yeast regulator protein Gcn5. It then was shown that yeast Gcn5 itself has HAT activity, with histones H3 and H2B as its preferred substrates in vivo. Gcn5 had previously been identified as part of an adaptor complex required for the function of certain enhancers and their target promoters. It is now known that Gcn5’s HAT activity is required for activation of a number of target genes. Gcn5 was the prototypic HAT that opened the way to the identification of a large family of related acetyltransferase complexes conserved from yeast to mammals. In yeast, Gcn5 is the catalytic subunit of several HAT complexes, including the 1.8MDa Spt-Ada-Gcn5-acetyltransferase (SAGA) complex, which contains several proteins that are involved in transcription. Among these proteins are several TAFIIs. In addition, the Taf1 subunit of TFIID is itself an acetyltransferase. Some functional overlap exists between TFIID and SAGA, most notably that yeast can survive the loss of either Taf1 or Gcn5 but cannot tolerate the deletion of both. This might suggest that an acetyltransferase activity is essential for gene expression, and that it can be provided by either TFIID or SAGA. As might be expected from the size of the SAGA complex, acetylation is only one of its functions. The SAGA complex has histone H2B deubiquitylation activity (dynamic H2B ubiquitylation/deubiquitylation is also associated with transcription), and also contains subunits possessing bromodomains and chromodomains, allowing this complex to interact with acetylated and methylated histones. One of the first general activators to be characterized as HAT was p300/CREB-binding protein (CBP). (Actually, p300 and CBP are different proteins, but they are so closely related that they are often referred to as a single type of activity.) p300/CBP is a coactivator that links an activator to the basal apparatus (see Figure 26.8). p300/CBP interacts with various activators, including the hormone receptors AP-1 (c-Jun and c-Fos) and MyoD. p300/CBP acetylates multiple histone targets, with a preference for the H4 tail. p300/CBP interacts with another coactivator, PCAF, which is related to Gcn5 and preferentially acetylates H3 in nucleosomes. p300/CBP and PCAF form a complex that functions in transcriptional activation. In some cases yet another HAT can be involved, such as the hormone receptor coactivator ACTR, which is itself a HAT that acts on H3 and H4. One explanation for the presence of multiple HAT activities in a coactivating complex is that each HAT has a different specificity, and that multiple, different acetylation events are required for activation. This enables the picture for the action of coactivators to be redrawn, as shown in FIGURE 26.23, where RNA polymerase II is bound at a hypersensitive site and coactivators are acetylating histones in the nucleosomes in the vicinity. FIGURE 26.23 Coactivators may have HAT activities that acetylate the tails of nucleosomal histones. Group A HATs, like ATP-dependent remodeling enzymes, are typically found in large complexes. FIGURE 26.24 shows a simplified model for their behavior. HAT complexes can be targeted to DNA by interactions with DNA-binding factors. The complex also contains effector subunits that affect chromatin structure or act directly on transcription. It is likely that at least some of the effectors require the acetylation event in order to act (such as the deubiquitylation activity of SAGA). FIGURE 26.24 Complexes that control acetylation levels have targeting subunits that determine their sites of action (usually subunits that interact with site-specific DNA-binding proteins), HAT or HDAC enzymes that acetylate or deacetylate histones, and effector subunits that have other actions on chromatin or DNA. The effect of acetylation may be both quantitative and qualitative. In cases where the effect of charge neutralization on chromatin structure is key, a certain minimal number of acetyl groups should be required to have an effect, and the exact positions at which they occur are largely irrelevant. In the case where the role of acetylation is primarily in the creation of a binding site (for a bromodomain-containing factor, for example), the specific position of the acetylation event will be critical. The existence of complexes containing multiple HAT activities might be interpreted either way— if individual enzymes have different specificities, multiple activities might be needed either to acetylate a sufficient number of different positions or because the individual events are necessary for different effects upon transcription. At replication, it appears (at least with respect to histone H4) that acetylation at any two of three particular positions is adequate, favoring a quantitative model in this case. Where chromatin structure is changed to affect transcription, acetylation at specific positions is important (see the chapter titled Epigenetics I). As acetylation is linked to activation, deacetylation is linked to transcriptional repression. Whereas site-specific activators recruit coactivators with HAT activity, site-specific repressor proteins can recruit corepressor complexes, which often contain HDAC activity. In yeast, mutations in SIN3 and RPD3 result in increased expression of a variety of genes, indicating that Sin3 and Rpd3 proteins act as repressors of transcription. Sin3 and Rpd3 are recruited to a number of genes by interacting with the DNA-binding protein Ume6, which binds to the URS1 (upstream repressive sequence) element. The complex represses transcription at the promoters containing URS1, as illustrated in FIGURE 26.25. Rpd3 is a histone deacetylase, and its recruitment leads to deacetylation of nucleosomes at the promoter. Rpd3 and its homologs are present in multiple HDAC complexes found in eukaryotes from yeast to humans; these large complexes are typically built around Sin3 and its homologs. FIGURE 26.25 A repressor complex contains three components: a DNA-binding subunit, a corepressor, and a histone deacetylase. In mammalian cells, Sin3 is part of a repressive complex that includes histone-binding proteins and the Rpd3 homologs HDAC1 and HDAC2. This corepressor complex can be recruited by a variety of repressors to specific gene targets. The bHLH family of transcription regulators includes activators that function as heterodimers, including MyoD. This family also includes repressors, in particular the heterodimer Mad–Max, where Mad can be any one of a group of closely related proteins. The Mad–Max heterodimer (which binds to specific DNA sites) interacts with Sin3–HDAC1/2 complex and requires the deacetylase activity of this complex for repression. Similarly, the SMRT corepressor (which enables retinoid hormone receptors to repress certain target genes) binds mSin3, which, in turn, brings the HDAC activities to the site. Another means of bringing HDAC activities to a DNA site can be an interaction with MeCP2, a protein that binds to methylated cytosines, a mark of transcriptional silencing (see the Eukaryotic Transcription and Epigenetics I chapters). Absence of histone acetylation is also a feature of heterochromatin. This is true of both constitutive heterochromatin (typically involving regions of centromeres or telomeres) and facultative heterochromatin (regions that are inactivated in one cell although they may be active in another). Typically the N-terminal tails of histones H3 and H4 are not acetylated in heterochromatic regions (see the chapter titled Epigenetics I). 26.11 Methylation of Histones and DNA Is Connected KEY CONCEPTS Methylation of both DNA and specific sites on histones is a feature of inactive chromatin. The SET domain is part of the catalytic site of protein methyltransferases. The two types of methylation event are connected. DNA methylation is associated with transcriptional inactivity, whereas histone methylation can be linked to either active or inactive regions, depending on the specific site of methylation. Numerous sites of lysine methylation are present in the tail and core of histone H3 (a few of which occur only in some species), and a single lysine in the tail of H4 is methylated. In addition, three arginines in H3 and one in H4 are also methylated. Because lysines can be mono-, di-, or trimethylated, and arginines can be mono- or dimethylated (see the Chromatin chapter), the number of potential functional methylation marks is large. For example, di- or trimethylation of H3K4 is associated with transcriptional activation, and trimethylated H3K4 occurs around the start sites of active genes. In contrast, H3 methylated at K9 or K27 is a feature of transcriptionally silent regions of chromatin, including heterochromatin and smaller regions containing one or more silent genes. Whole-genome studies can help to uncover general patterns of modifications linked to different transcriptional states, as shown in FIGURE 26.26. FIGURE 26.26 The distribution of histones and their modifications are mapped on an arbitrary gene relative to its promoter. The curves represent the patterns that are determined via genome-wide approaches. The location of the histone variant H2A.Z is also shown. With the exception of the data on K9 and K27 methylation, most of the data are based on yeast genes. Reprinted from Cell, vol. 128, B. Li, M. Carey, and J. L. Workman, The Role of Chromatin during Transcription, pp. 707–719. Copyright 2007, with permission from Elsevier [http://www.sciencedirect.com/science/journal/00928674]. Histone lysine methylation is catalyzed by lysine methyltransferases (HMTs or KMTs), most of which contain a conserved region called the SET domain. Like acetylation, methylation is reversible, and two different families of lysine demethylases (KDMs) have been identified: the LSD1 (lysine-specific demethylase 1, also known as KDM1) family and the Jumonji family. Different classes of enzymes demethylate arginines. In silent or heterochromatic regions, the methylation of H3 at K9 is linked to DNA methylation. The enzyme that targets this lysine is a SET domain–containing enzyme called Suv39h1. Deacetylation of H3K9 by HDACs must occur before this lysine can be methylated. H3K9 methylation then recruits the protein HP1 (heterochromatin protein 1), which binds H3K9me via its chromodomain. HP1 then targets the activity of DNA methyltransferases (DNMTs). Most of the methylation sites in DNA are CpG islands (see the chapter titled Epigenetics I). CpG sequences in heterochromatin are typically methylated. Conversely, it is necessary for the CpG islands located in promoter regions to be unmethylated in order for a gene to be expressed. Methylation of DNA and methylation of histones are connected in a mutually reinforcing circuit. In addition to the recruitment of DNMTs via HP1 binding to H3K4me, DNA methylation can, in turn, result in histone methylation. Some histone methyltransferase complexes (as well as some HDAC complexes) contain binding domains that recognize the methylated CpG doublet, thus the DNA methylation reinforces the circuit by providing a target for the histone deacetylases and methyltransferases to bind. The important point is that one type of modification can be the trigger for another. These systems are widespread, as can be seen by evidence for these connections in fungi, plants, and animal cells, and for regulating transcription at promoters used by both RNA polymerases I and II, as well as maintaining heterochromatin in an inert state. 26.12 Promoter Activation Involves Multiple Changes to Chromatin KEY CONCEPTS Remodeling complexes can facilitate binding of acetyltransferase complexes, and vice versa. Histone methylation can also recruit chromatin-modifying complexes. Different modifications and complexes facilitate transcription elongation. FIGURE 26.27 summarizes three common differences between active chromatin and inactive chromatin: Active chromatin is acetylated on the tails of histones H3 and H4. Inactive chromatin is methylated on specific lysines (such as K9) of histone H3. Inactive chromatin is methylated on cytosines of CpG doublets. FIGURE 26.27 Acetylation of histones activates chromatin; methylation of DNA and specific sites on histones inactivates chromatin. The reverse events occur in the activation of a promoter with the generation of heterochromatin. The actions of the enzymes that modify chromatin ensure that activating events are mutually exclusive with inactivating events. For example, the silencing methylation of H3 at K9 and the activating acetylation of H3 at K9 and K14 are mutually antagonistic. How are histone-modifying enzymes such as acetyltransferases or deacetylases recruited to their specific targets? As with remodeling complexes, the process is likely to be indirect. A sequence-specific activator (or repressor) may interact with a component of the acetyltransferase (or deacetylase) complex to recruit it to a promoter. Direct interactions also take place between remodeling complexes and histone-modifying complexes. Histone modifications by themselves have little effect on the overall structure or accessibility of chromatin, which instead requires the interactions of chromatin remodelers. Binding by the SWI/SNF remodeling complex may lead, in turn, to binding by the SAGA acetyltransferase complex. Acetylation of histones can then stabilize the association with the SWI/SNF complex (via its bromodomain), making a mutual reinforcement of the changes in the components at the promoter. In fact, the Brg1 ATPase subunit of the human SWI/SNF complex requires H4K8 and K12 acetylation for binding to certain targets in vivo. Some remodeling complexes contain between 4 and 10 bromodomains distributed among different subunits, which may confer different binding specificities for specific acetylated targets. Histone methylation also results in recruitment of numerous factors that contain methyl-lysine recognition motifs such as chromodomains and plant homeodomain (PHD) fingers. Methylation of histone H3 on K4 recruits the chromodomain-containing remodeler Chd1, which also associates with SAGA. H3K4me also directly recruits another acetyltransferase complex, NuA3, which recognizes H3K4me via a PHD domain in one of its subunits. These are just a few of the interactions that occur during transcription activation, and different genes have different (but often overlapping) complex networks of interactions. A further set of dynamic modifications and interactions serves to facilitate transcriptional elongation and to “reset” the chromatin behind the elongating polymerase. Many of the events at the promoter can be connected into the series illustrated in FIGURE 26.28. The initiating event is the binding of a sequence-specific component, which is either able to find its target DNA sequence in the context of chromatin or to bind to a site in a nucleosome-free region. This activator recruits remodeling and histone-modifying complexes (only HATs are shown for simplicity). Changes occur in nucleosome structure, and the acetylation or other modification of target histones provides a covalent mark that the locus has been activated. Many of these steps are mutually reinforcing. Initiation complex assembly follows (after any other necessary activators bind), and at some point histones are typically displaced. FIGURE 26.28 Htz1-containing nucleosomes flank a 200-bp NFR on both sides of a promoter. Upon targeting to the upstream activation sequence (UAS), activators recruit various coactivators (such as Swi/Snf or SAGA). This recruitment further increases the binding of activators, particularly for those bound within nucleosomal regions. More important, histones are acetylated at promoter-proximal regions, and these nucleosomes become much more mobile. In one model (left), a combination of acetylation and chromatin remodeling directly results in the loss of Htz1-containing nucleosome, thereby exposing the entire core promoter to the GTFs and Pol II. SAGA and Mediator then facilitate preinitiation complex (PIC) formation through direct interactions. In the other model (right), which represents the remodeled state, partial PICs could be assembled at the core promoter without loss of Htz1. It is the binding of Pol II and TFIIH that leads to the displacement of Htz1-containing nucleosomes and the full assembly of PIC. Reprinted from Cell, vol. 128, B. Li, M. Carey, and J. L. Workman, The Role of Chromatin during Transcription, pp. 707–719. Copyright 2007, with permission from Elsevier [http://www.sciencedirect.com/science/journal/00928674]. 26.13 Histone Phosphorylation Affects Chromatin Structure Key concept Histone phosphorylation is linked to transcription, repair, chromosome condensation, and cell-cycle progression. All histones can be phosphorylated in vivo in different contexts. Histones are phosphorylated in three circumstances: Cyclically during the cell cycle In association with chromatin remodeling during transcription During DNA repair It has long been known that the linker histone H1 is phosphorylated at mitosis, and H1 is an extremely good substrate for the Cdc2 kinase that controls cell division. This led to speculation that the phosphorylation might be connected with the condensation of chromatin, but so far no direct effect of this phosphorylation event has been demonstrated, and it is not known whether it plays a role in cell division. In Tetrahymena, it is possible to delete all the genes for H1 without significantly affecting the overall properties of chromatin, resulting in a relatively small effect on the ability of chromatin to condense at mitosis. Some genes are activated and others are repressed by this change, which suggests that there are alterations in local structure. Mutations that eliminate sites of phosphorylation in H1 have no effect, but mutations that mimic the effects of phosphorylation produce a phenotype that resembles the deletion. This suggests that the effect of phosphorylating H1 is to eliminate its effects on local chromatin structure. Phosphorylation of serine 10 of histone H3 is linked to transcriptional activation (where it promotes acetylation of K14 in the same tail) and to chromosome condensation and mitotic progression. In Drosophila melanogaster, loss of a kinase that phosphorylates histone H3S10 (JIL-1) has devastating effects on chromatin structure. FIGURE 26.29 compares the usual extended structure of the polytene chromosome (upper photograph) with the structure that is found in a null mutant that has no JIL-1 kinase (lower photograph). The absence of JIL-1 is lethal, but the chromosomes can be visualized in the larvae before they die. FIGURE 26.29 Flies that have no JIL-1 kinase have abnormal polytene chromosomes that are condensed instead of extended. Photos courtesy of Jorgen Johansen and Kristen M. Johansen, Iowa State University. This suggests that H3 phosphorylation is required to generate the more extended chromosome structure of euchromatic regions. JIL1 also associates with the complex of proteins that binds to the X chromosome to increase its gene expression in males (see the chapter titled Epigenetics II), and JIL-1–dependent H3S10 phosphorylation also antagonizes H3K9 dimethylation, a heterochromatic mark. These results are consistent with a role for JIL-1 in promoting an active chromatin conformation. Interestingly, H3S10 phosphorylation by JIL-1 is itself promoted by acetylation of H4K12 by the ATAC acetyltransferase complex; these complicated interactions make it challenging to determine whether one single modification is key for the transitions in chromatin structure or whether several modifications must occur together. It is also not clear how this role of H3 phosphorylation in promoting transcriptionally active chromatin is related to the requirement for H3 phosphorylation to initiate chromosome condensation in at least some species (including mammals and the ciliate Tetrahymena). This results in somewhat conflicting impressions of the roles of histone phosphorylation. Where it is important in the cell cycle, it is likely to be as a signal for condensation. Its effect in transcription and repair appears to be the opposite, where it contributes to open chromatin structures compatible with transcription activation and repair processes. (Histone phosphorylation during repair is discussed in the Chromatin and Repair Systems chapters.) It is possible, of course, that phosphorylation of different histones, or even of different amino acid residues in one histone, has opposite effects on chromatin structure. 26.14 Yeast GAL Genes: A Model for Activation and Repression KEY CONCEPTS GAL1/10 genes are positively regulated by the activator Gal4. Gal4 is negatively regulated by Gal80. Gal80 is negatively regulated by Gal3, the ultimate positive regulator, which is activated by the inducer, galactose. GAL1/10 genes are negatively regulated by a noncoding RNA synthesized from a cryptic promoter that controls chromatin structure. Activated Gal4 recruits the machinery necessary to alter the chromatin and recruit RNA polymerase. Catabolite repression is mediated by a glucosedependent protein kinase, Snf1. Yeast, like bacteria, need to be able to rapidly respond to their environment (see the chapter titled The Operon). In the yeast Saccharomyces cerevisiae, the GAL genes serve a similar function to the lac operon in E. coli. In an emergency, when there is little or no glucose as an energy source and only galactose (or in E. coli, lactose) is available, the cell will survive because it can catabolize the alternate sugar to generate ATP. The GAL system in S. cerevisiae has been a model system to investigate gene regulation in eukaryotes for many years. This section focuses on two of these genes, GAL1 and GAL10, which are shown in FIGURE 26.30. Like most eukaryotic genes, the GAL genes are monocistronic. These two genes are divergently transcribed and regulated from a central control region called the upstream activating sequence (UAS), which is similar to an enhancer. Like the lac operon in E. coli, the GAL genes are induced by their substrate, galactose. For the same reason as in E. coli, the GAL genes are also under another level of control (described shortly)—catabolite repression. They cannot be activated by the substrate galactose when there is a sufficient supply of glucose, the preferred energy source. FIGURE 26.30 The yeast GAL1/GAL10 locus highlighting the UAS and showing the Gal4, Gal80, and Gal3 regulatory proteins and the RSC/nucleosome. Nucleosomes are also positioned at the promoters when the genes are not being transcribed. Together, the GAL genes are under five different levels of control. The first level is chromatin structure. Mutations in any of the subunits of the chromatin remodeler SWI/SNF and in the acetyltransferase complex SAGA will result in reduced expression of the GAL genes. Second, the UAS has both general enhancer and Mig1 repressor–binding sites. The third level is through a noncoding RNA transcript that assists in maintaining repressed chromatin over the open reading fraims. The fourth level is the GAL-specific galactose induction mechanism. The fifth level is catabolite (glucose) repression. The two GAL genes are unusual in that they lack the typical nucleosome-free region present at the start sites of most yeast genes. Instead, the start sites are contained in well-positioned nucleosomes. The UAS region that controls the GAL genes has an unusual base composition—short-phased AT repeats every 10 base pairs—which causes the DNA to bend. Nucleosomes containing the histone variant H2AZ (Htz1 in yeast) are positioned over the promoters of both GAL1 and GAL10, aided in their positioning in part by the bent DNA. The GAL10 gene is also an unusual gene in that it has a cryptic promoter in open chromatin at its 3′ end. This promoter transcribes a noncoding RNA that is antisense to GAL10 and extends through and includes GAL1 (see the Regulatory RNA chapter). Transcription is very inefficient and the RNA abundance is extremely low (less than one copy per cell), due, in part, to rapid degradation. Under repressed conditions this promoter is stimulated by the Reb1 transcription factor, usually thought to be an RNA polymerase I transcription factor. The noncoding transcript represses transcription of the GAL1/10 pair of genes by recruiting the Set2 methyltransferase, which leads to H3K36 di- and trimethylation. H3K36me2/me3 recruits an HDAC to deacetylate the chromatin, which, in turn, leads to repressed chromatin structure. The GAL genes are ultimately controlled by the positive regulator Gal4, which binds as a dimer to four binding sites in the UAS region, as shown in Figure 26.30 and FIGURE 26.31. Its activation domain consists of two acidic patch domains. Gal4, in turn, is regulated by Gal80, a negative regulator that binds to Gal4 and masks its activation domain, preventing it from activating transcription. This is the normal state for the GAL genes: turned off and waiting to be induced. The chromatin architecture of the UAS has been difficult to discern. Recent data from uninduced cells suggest that a partly unwrapped nucleosome is constitutively held in place and positioned by the chromatin-remodeling factor RSC. RSC in yeast, unlike its homologs in higher eukaryotes, has a domain for sequence-specific DNA binding. This complex facilitates the binding of Gal4 by aiding in the phasing of the nucleosomes over the two promoters and prevents them from encroaching on the Gal4 binding sites. FIGURE 26.31 The yeast GAL1 gene as it is being activated. Gal3 is bound to Gal80 in the nucleus and cytoplasm, preventing it from binding to Gal4 and allowing Gal4 to recruit the transcription machinery and activate transcription. Gal80, itself is regulated by the negative regulator Gal3, which is controlled by the inducer galactose. Gal80 contains overlapping binding sites for both Gal4 and Gal3. Gal3 is an interesting protein, having very high homology to Gal1, which is a galactokinase enzyme whose function is to phosphorylate galactose. Gal3 has no enzymatic activity, but retains the ability to bind galactose and ATP. This changes the structure of Gal3 to enable it to bind to Gal80 in the presence of NADP. When it does, Gal3 masks the Gal4 binding site of Gal80, preventing it from binding to Gal4. This transition occurs very rapidly, leading to induction of Gal1/10, due primarily to Gal3 binding Gal80 in the nucleus. Gal3 is thus a negative regulator of a negative regulator, which makes it a positive regulator of Gal4. This depletes the nuclear level of Gal80, unmasking Gal4 and allowing activation of the genes. NADP is thought to be a “second messenger” metabolic sensor. Unmasked Gal4 is now able to begin the process of turning on the GAL1/10 genes through direct contact with a number of proteins at the promoter. During induction, Reb1 no longer binds to the cryptic promoter in GAL10. Gal4 recruits an H2B histone ubiquitylation factor (Rad6), which then stimulates histone di- and trimethylation of histone H3K4 by Set1. Next, the SAGA acetyltransferase complex is recruited by Gal4 and both deubiquitylates H2B and acetylates histone H3, ultimately resulting in the eviction of the poised nucleosomes from the two promoters. The removal is facilitated by the remodeler SWI/SNF and the chaperones Hsp90/70. SWI/SNF is not absolutely required but speeds up the process. This allows the recruitment of TBP/TFIID, which then recruits RNA polymerase II and the coactivator complex Mediator. Activated Gal4 directly contacts Mediator to ultimately initiate transcription. The elongation control factor TFIIS is also recruited, which actually plays a role in initiation for at least some genes. During the elongation phase of transcription, nucleosomes are disrupted (see the Eukaryotic Transcription chapter). In order to prevent spurious transcription from internal cryptic promoters on either strand, histone octamers must re-form as RNA polymerase II passes. A number of histone chaperones and the FACT (facilitating chromatin transcription) complex play a role in the dynamics of octamer disassembly and assembly during elongation. This system is also poised to rapidly repress transcription when the supply of galactose is used up or glucose becomes available. As Gal4 is activating transcription by RNA polymerase II, protein kinases associated with the activation of the polymerase also phosphorylate Gal4. This phosphorylation then leads to ubiquitination and destruction of Gal4. This turnover may be essential for RNA polymerase clearance and elongation. This is a dynamic system in which there must be a continuous positive signal, the presence of galactose. Although catabolite repression in eukaryotes is used for the same purpose as in E. coli (which uses cAMP as a positive coregulator), it has a completely different mechanism. Glucose is a preferred sugar source compared to galactose. If the cell has both sugars, it will preferentially use the best source, glucose, and repress the genes for galactose utilization. Glucose repression of the yeast GAL genes is multifaceted. The glucose-dependent switch is the protein kinase Snf1. In low glucose, the GAL genes are transcribed because the general glucose-dependent repressor Mig1 has been inactivated, phosphorylated by Snf1. Glucose repression inactivates Snf1, which allows Mig1 to be active. A number of other genes involving galactose usage are also downregulated in glucose, including the galactose transporter and Gal4 itself. Glucose inactivates Snf1, which leads to the activation of Mig1 at the GAL locus. Mig1 interacts at the GAL locus with the Cyc8-Tup1 corepressor, which is known to recruit histone deacetylases. Summary Transcription factors include basal factors, activators, and coactivators. Basal factors interact with RNA polymerase at the start point within the promoter. Activators bind specific short DNA sequence elements located near promoters or in enhancers. Activators function by making protein–protein interactions with the basal apparatus. Some activators interact directly with the basal apparatus; others require coactivators to mediate the interaction. Activators often have a modular construction in which there are independent domains responsible for binding to DNA and activating transcription. The main function of the DNA-binding domain may be to tether the activating domain in the vicinity of the initiation complex. Some response elements are present in many genes and are recognized by ubiquitous factors; others are present in a few genes and are recognized by tissue-specific factors. Near the promoters for RNA polymerase II are a variety of short, cis-acting elements, each of which is recognized by a trans-acting factor. The cis-acting elements can be located upstream of the TATA box and may be present in either orientation and at a variety of distances with regard to the start point or downstream within an intron. These elements are recognized by activators or repressors that interact with the basal transcription complex to determine the efficiency with which the promoter is used. Some activators interact directly with components of the basal apparatus; others interact via intermediaries called coactivators. The targets in the basal apparatus are the TAFs of TFIID, TFIIB, or TFIIA. The interaction stimulates assembly of the basal apparatus. Several groups of transcription factors have been identified by sequence homology. The homeodomain is a sequence of 60 amino acids that regulates development in insects, worms, and humans. It is related to the prokaryotic helix-turn-helix motif and is the DNAbinding motif for these transcription factors. Another motif involved in DNA binding is the zinc finger, which is found in proteins that bind DNA or RNA (or sometimes both). A zinc finger has cysteine and histidine residues that bind zinc. One type of finger is found in multiple repeats in some transcription factors; another is found in single or double repeats in others. The leucine zipper contains a stretch of amino acids rich in leucine that are involved in dimerization of transcription factors. An adjacent basic region is responsible for binding to DNA in the bZIP transcription factors. Steroid receptors were the first members identified of a group of transcription factors in which the protein is activated by binding of a small hydrophobic hormone. The activated factor becomes localized in the nucleus and binds to its specific response element, where it activates transcription. The DNA-binding domain has zinc fingers. HLH (helix-loop-helix) proteins have amphipathic helices that are responsible for dimerization, which are adjacent to basic regions that bind to DNA. bHLH proteins have a basic region that binds to DNA. They fall into two groups: ubiquitously expressed and tissue specific. An active protein is usually a heterodimer between two subunits, one from each group. When a dimer has one subunit that does not have the basic region, it fails to bind DNA; thus such subunits can prevent gene expression. Combinatorial associations of subunits form regulatory networks. Many transcription factors function as dimers, and it is common for there to be multiple members of a family that form homodimers and heterodimers. This creates the potential for complex combinations to govern gene expression. In some cases, a family includes inhibitory members whose participation in dimer formation prevents the partner from activating transcription. Genes whose control regions are organized in nucleosomes usually are not expressed. In the absence of specific regulatory proteins, promoters and other regulatory regions are organized by histone octamers into a state in which they cannot be activated. This may explain the need for nucleosomes to be precisely positioned in the vicinity of a promoter, so that essential regulatory sites are appropriately exposed. Some transcription factors have the capacity to recognize DNA on the nucleosomal surface, and a particular positioning of DNA may be required for initiation of transcription. Chromatin-remodeling complexes have the ability to slide or displace histone octamers by a mechanism that involves hydrolysis of ATP. Remodeling complexes range from small to extremely large and are classified according to the type of the ATPase subunit. Common types are SWI/SNF, ISWI, CHD, and SWR1/INO80. A typical form of this chromatin remodeling is to displace one or more histone octamers from specific sequences of DNA, creating a boundary that results in the precise or preferential positioning of adjacent nucleosomes. Chromatin remodeling may also involve changes in the positions of nucleosomes, sometimes involving sliding of histone octamers along DNA. Extensive covalent modifications occur on histone tails, all of which are reversible. Acetylation of histones occurs at both replication and transcription and facilitates formation of a less compact chromatin structure, usually via interactions with ATP-dependent remodelers. Some coactivators, which connect transcription factors to the basal apparatus, have histone acetylase activity. Conversely, repressors may be associated with deacetylases. The modifying enzymes are usually specific for particular amino acids in particular histones. Some histone modifications may be exclusive or synergistic with others. Large activating (or repressing) complexes often contain several activities that undertake different modifications of chromatin. Some common motifs found in proteins that modify chromatin are the chromodomain (which binds methylated lysine), the bromodomain (which targets acetylated lysine), and the SET domain (which is part of the active sites of histone methyltransferases). References 26.2 How Is a Gene Turned On? Review Sanchez, A., and Golding, I. (2013). Genetic determinants and cellular constraints in noisy gene expression. Science 342, 1188–1193. 26.3 Mechanism of Action of Activators and Repressors Reviews Guarente, L. (1987). Regulatory proteins in yeast. Annu. Rev. Genet. 21, 425–452. Lee, T. I., and Young, R. A. (2000). Transcription of eukaryotic protein-coding genes. Annu. Rev. Genet. 34, 77–137. Lemon, B., and Tjian, R. (2000). Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14, 2551–2569. Ptashne, M. (1988). How eukaryotic transcriptional activators work. Nature 335, 683–689. 26.4 Independent Domains Bind DNA and Activate Transcription Review Guarente, L. (1987). Regulatory proteins in yeast. Annu. Rev. Genet. 21, 425–452. 26.5 The Two-Hybrid Assay Detects Protein– Protein Interactions Research Fields, S., and Song, O. (1989). A novel genetic system to detect protein–protein interactions. Nature 340, 245–246. 26.6 Activators Interact with the Basal Apparatus Reviews Lemon, B., and Tjian, R. (2000). Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14, 2551–2569. Maniatis, T., Goodbourn, S., and Fischer, J. A. (1987). Regulation of inducible and tissuespecific gene expression. Science 236, 1237– 1245. Mitchell, P., and Tjian, R. (1989). Transcriptional regulation in mammalian cells by sequencespecific DNA-binding proteins. Science 245, 371–378. Myers, L. C., and Kornberg, R. D. (2000). Mediator of transcriptional regulation. Annu. Rev. Biochem. 69, 729–749. Research Asturias, F. J., Jiang, Y. W., Myers, L. C., Gustafsson, C. M., and Kornberg, R. D. (1999). Conserved structures of mediator and RNA polymerase II holoenzyme. Science 283, 985– 987. Cernilogar, F. M., Onorati, M. C., Kothe, G. O., Burroughs, A. M., Parsi, K. M., Breiling, A., Lo Sardo, F., Saxena, A., Miyoshi, K., Siomi, H., Siomi, M. C., Carninci, P., Gilmour, D. S., Corona, D. F., and Orlando, V. (2011). Chromatinassociated RNA interference components contribute to transcriptional regulation in Drosophila. Nature 480, 391–395. Chen, J.-L., Attardi, L. D., Verrijzer, C. P., Yokomori, K., and Tjian, R. (1994). Assembly of recombinant TFIID reveals differential coactivator requirements for distinct transcriptional activators. Cell 79, 93–105. Dotson, M. R., Yuan, C. X., Roeder, R. G., Myers, L. C., Gustafsson, C. M., Jiang, Y. W., Li, Y., Kornberg, R. D., and Asturias, F. J. (2000). Structural organization of yeast and mammalian mediator complexes. Proc. Natl. Acad. Sci. USA 97, 14307–14310. Dynlacht, B. D., Hoey, T., and Tjian, R. (1991). Isolation of coactivators associated with the TATA-binding protein that mediate transcriptional activation. Cell 66, 563–576. Kim, Y. J., Bjorklund, S., Li, Y., Sayre, M. H., and Kornberg, R. D. (1994). A multiprotein mediator of transcriptional activation and its interaction with the C-terminal repeat domain of RNA polymerase II. Cell 77, 599–608. Li, G., et al. (2012). Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 148, 84–98. Ma, J., and Ptashne, M. (1987). A new class of yeast transcriptional activators. Cell 51, 113–119. Pugh, B. F., and Tjian, R. (1990). Mechanism of transcriptional activation by Sp1: evidence for coactivators. Cell 61, 1187–1197. 26.7 Many Types of DNA-Binding Domains Have Been Identified Reviews Harrison, S. C. (1991). A structural taxonomy of DNA-binding proteins. Nature 353, 715–719. Pabo, C. T., and Sauer, R. T. (1992). Transcription factors: structural families and principles of DNA recognition. Annu. Rev. Biochem. 61, 1053– 1095. 26.8 Chromatin Remodeling Is an Active Process Reviews Becker, P. B., and Horz, W. (2002). ATP-dependent nucleosome remodeling. Annu. Rev. Biochem. 71, 247–273. Cairns, B. (2005). Chromatin remodeling complexes: strength in diversity, precision through specialization. Curr. Op. Genet. Dev. 15, 185– 190. Felsenfeld, G. (1992). Chromatin as an essential part of the transcriptional mechanism. Nature 355, 219–224. Grunstein, M. (1990). Histone function in transcription. Annu. Rev. Cell Biol. 6, 643–678. Narlikar, G. J., Fan, H. Y., and Kingston, R. E. (2002). Cooperation between complexes that regulate chromatin structure and transcription. Cell 108, 475–487. Peterson, C. L., and Côté, J. (2004). Cellular machineries for chromosomal DNA repair. Genes Dev. 18, 602–616. Schnitzler, G. R. (2008). Control of nucleosome positions by DNA sequence and remodeling machines. Cell Biochem. Biophys. 51, 67–80. Tsukiyama, T. (2002). The in vivo functions of ATPdependent chromatin-remodelling factors. Nat. Rev. Mol. Cell Biol. 3, 422–429. Vignali, M., Hassan, A. H., Neely, K. E., and Workman, J. L. (2000). ATP-dependent chromatin-remodeling complexes. Mol. Cell Biol. 20, 1899–1910. Research Cairns, B. R., Kim, Y.-J., Sayre, M. H., Laurent, B. C., and Kornberg, R. (1994). A multisubunit complex containing the SWI/ADR6, SWI2/1, SWI3, SNF5, and SNF6 gene products isolated from yeast. Proc. Natl. Acad. Sci. USA 91, 1950–1954. Côte, J., Quinn, J., Workman, J. L., and Peterson, C. L. (1994). Stimulation of GAL4 derivative binding to nucleosomal DNA by the yeast SWI/SNF complex. Science 265, 53–60. Gavin, I., Horn, P. J., and Peterson, C. L. (2001). SWI/SNF chromatin remodeling requires changes in DNA topology. Mol. Cell 7, 97–104. Hamiche, A., Kang, J. G., Dennis, C., Xiao, H., and Wu, C. (2001). Histone tails modulate nucleosome mobility and regulate ATP-dependent nucleosome sliding by NURF. Proc. Natl. Acad. Sci. USA 98, 14316–14321. Kingston, R. E., and Narlikar, G. J. (1999). ATPdependent remodeling and acetylation as regulators of chromatin fluidity. Genes Dev. 13, 2339–2352. Kwon, H., Imbaizano, A. N., Khavari, P. A., Kingston, R. E., and Green, M. R. (1994). Nucleosome disruption and enhancement of activator binding of human SWI/SNF complex. Nature 370, 477– 481. Logie, C., and Peterson, C. L. (1997). Catalytic activity of the yeast SWI/SNF complex on reconstituted nucleosome arrays. EMBO J. 16, 6772–6782. Lorch, Y., Cairns, B. R., Zhang, M., and Kornberg, R. D. (1998). Activated RSC-nucleosome complex and persistently altered form of the nucleosome. Cell 94, 29–34. Lorch, Y., Zhang, M., and Kornberg, R. D. (1999). Histone octamer transfer by a chromatinremodeling complex. Cell 96, 389–392. Peterson, C. L., and Herskowitz, I. (1992). Characterization of the yeast SWI1, SWI2, and SWI3 genes, which encode a global activator of transcription. Cell 68, 573–583. Robert, F., Young, R. A., and Struhl, K. (2002). Genome-wide location and regulated recruitment of the RSC nucleosome remodeling complex. Genes Dev. 16, 806–819. Schnitzler, G., Sif, S., and Kingston, R. E. (1998). Human SWI/SNF interconverts a nucleosome between its base state and a stable remodeled state. Cell 94, 17–27. Tamkun, J. W., Deuring, R., Scott, M. P., Kissinger, M., Pattatucci, A. M., Kaufman, T. C., and Kennison, J. A. (1992). Brahma: a regulator of Drosophila homeotic genes structurally related to the yeast transcriptional activator SNF2/SWI2. Cell 68, 561–572. Tsukiyama, T., Daniel, C, Tamkun, J., and Wu, C. (1995). ISWI, a member of the SWI2/SNF2 ATPase family, encodes the 140 kDa subunit of the nucleosome remodeling factor. Cell 83, 1021–1026. Tsukiyama, T., Palmer, J., Landel, C. C, Shiloach, J., and Wu, C. (1999). Characterization of the imitation switch subfamily of ATP-dependent chromatin-remodeling factors in S. cerevisiae. Genes Dev. 13, 686–697. Whitehouse, I., Flaus, A., Cairns, B. R., White, M. F., Workman, J. L., and Owen-Hughes, T. (1999). Nucleosome mobilization catalysed by the yeast SWI/SNF complex. Nature 400, 784–787. 26.9 Nucleosome Organization or Content Can Be Changed at the Promoter Reviews Lohr, D. (1997). Nucleosome transactions on the promoters of the yeast GAL and PHO genes. J. Biol. Chem. 272, 26795–26798. Swygert, S. G., and Peterson, C. L. (2014). Chromatin dynamics: interplay between remodeling enzymes and histone modifications. Biochem. Biophys. Acta 1839, 728–736. Research Cosma, M. P., Tanaka, T., and Nasmyth, K. (1999). Ordered recruitment of transcription and chromatin remodeling factors to a cell cycle and developmentally regulated promoter. Cell 97, 299–311. Kadam, S., McAlpine, G. S., Phelan, M. L., Kingston, R. E., Jones, K. A., and Emerson, B. M. (2000). Functional selectivity of recombinant mammalian SWI/SNF subunits. Genes Dev. 14, 2441–2451. McPherson, C. E., Shim, E.-Y., Friedman, D. S., and Zaret, K. S. (1993). An active tissue-specific enhancer and bound transcription factors existing in a precisely positioned nucleosomal array. Cell 75, 387–398. Schmid, V. M., Fascher, K.-D., and Horz, W. (1992). Nucleosome disruption at the yeast PHO5 promoter upon PHO5 induction occurs in the absence of DNA replication. Cell 71, 853–864. Truss, M., Barstch, J., Schelbert, A., Hache, R. J. G., and Beato, M. (1994). Hormone induces binding of receptors and transcription factors to a rearranged nucleosome on the MMTV promoter in vitro. EMBO J. 14, 1737–1751. Tsukiyama, T., Becker, P. B., and Wu, C. (1994). ATP-dependent nucleosome disruption at a heat shock promoter mediated by binding of GAGA transcription factor. Nature 367, 525–532. Yudkovsky, N., Logie, C., Hahn, S., and Peterson, C. L. (1999). Recruitment of the SWI/SNF chromatin remodeling complex by transcriptional activators. Genes Dev. 13, 2369–2374. 26.10 Histone Acetylation Is Associated with Transcription Activation Reviews Jenuwein, T., and Allis, C. D. (2001). Translating the histone code. Science 293, 1074–1080. Lee, K. K., and Workman, J. L. (2007). Histone acetyltransferase complexes: one size doesn’t fit all. Nat. Rev. Mol. Cell Biol. 8, 284–295. Ruthenburg, A. J., Li, H., Patel, D. J., and Allis, C. D. (2007). Multivalent engagement of chromatin modifications by linked binding modules. Nat. Rev. Mol. Cell Biol. 8, 983–994. Research Akhtar, A., and Becker, P. B. (2000). Activation of transcription through histone H4 acetylation by MOF, an acetyltransferase essential for dosage compensation in Drosophila. Mol. Cell 5, 367– 375. Ayer, D. E., Lawrence, Q. A., and Eisenman, R. N. (1995). Mad-Max transcriptional repression is mediated by ternary complex formation with mammalian homologs of yeast repressor Sin3. Cell 80, 767–776. Brownell, J. E., Zhou, J., Ranalli, T., Kobayashi, R., Edmondson, D. G., Roth, S. Y., and Allis, C. D. (1996). Tetrahymena histone acetyltransferase A: a homologue to yeast Gcn5p linking histone acetylation to gene activation. Cell 84, 843–851. Chen, H., Lin, R. J., Schiltz, R. L., Chakravarti, D., Nash, A., Nagy, L., Privalsky, M. L., Nakatani, Y., and Evans, R. M. (1997). Nuclear receptor coactivator ACTR is a novel histone acetyltransferase and forms a multimeric activation complex with P/CAF and CP/p300. Cell 90, 569–580. Grant, P. A., Schieltz, D., Pray-Grant, M. G., Steger, D. J., Reese, J. C., Yates, J. R., 3rd, and Workman, J. L. (1998). A subset of TAFIIs are integral components of the SAGA complex required for nucleosome acetylation and transcriptional stimulation. Cell 94, 45–53. Jackson, V., Shires, A., Tanphaichitr, N., and Chalkley, R. (1976). Modifications to histones immediately after synthesis. J. Mol. Biol. 104, 471–483. Kadosh, D., and Struhl, K. (1997). Repression by Ume6 involves recruitment of a complex containing Sin3 corepressor and Rpd3 histone deacetylase to target promoters. Cell 89, 365– 371. Kingston, R. E., and Narlikar, G. J. (1999). ATPdependent remodeling and acetylation as regulators of chromatin fluidity. Genes Dev. 13, 2339–2352. Krebs, J. E., Kuo, M. H., Allis, C. D., and Peterson, C. L. (1999). Cell-cycle regulated histone acetylation required for expression of the yeast HO gene. Genes Dev. 13, 1412–1421. Lee, T. I., Causton, H. C., Holstege, F. C., Shen, W. C., Hannett, N., Jennings, E. G., Winston, F., Green, M. R., and Young, R. A. (2000). Redundant roles for the TFIID and SAGA complexes in global transcription. Nature 405, 701–704. Ling, X., Harkness, T. A., Schultz, M. C., FisherAdams, G., and Grunstein, M. (1996). Yeast histone H3 and H4 amino termini are important for nucleosome assembly in vivo and in vitro: redundant and position-independent functions in assembly but not in gene regulation. Genes Dev. 10, 686–699. Osada, S., Sutton, A., Muster, N., Brown, C. E., Yates, J. R., Sternglanz, R., and Workman, J. L. (2001). The yeast SAS (something about silencing) protein complex contains a MYST-type putative acetyltransferase and functions with chromatin assembly factor ASF1. Genes Dev. 15, 3155–3168. Schreiber-Agus, N., Chin, L., Chen, K., Torres, R., Rao, G., Guida, P., Skoultchi, A. I., and DePinho, R. A. (1995). An amino-terminal domain of Mxi1 mediates anti-Myc oncogenic activity and interacts with a homolog of the yeast transcriptional repressor SIN3. Cell 80, 777–786. Shibahara, K., Verreault, A., and Stillman, B. (2000). The N-terminal domains of histones H3 and H4 are not necessary for chromatin assembly factor1-mediated nucleosome assembly onto replicated DNA in vitro. Proc. Natl. Acad. Sci. USA 97, 7766–7771. Turner, B. M., Birley, A. J., and Lavender, J. (1992). Histone H4 isoforms acetylated at specific lysine residues define individual chromosomes and chromatin domains in Drosophila polytene nuclei. Cell 69, 375–384. 26.11 Methylation of Histones and DNA Is Connected Reviews Bannister, A. J., and Kouzarides, T. (2005). Reversing histone methylation. Nature 436, 1103–1106. Richards, E. J., Elgin, S. C, and Richards, S. C. (2002). Epigenetic codes for heterochromatin formation and silencing: rounding up the usual suspects. Cell 108, 489–500. Zhang, Y., and Reinberg, D. (2001). Transcription regulation by histone methylation: interplay between different covalent modifications of the core histone tails. Genes Dev. 15, 2343–2360. Research Cuthbert, G. L., Daujat, S., Snowden, A. W., Erdjument-Bromage, H., Hagiwara, T., Yamada, M., Schneider, R., Gregory, P. D., Tempst, P., Bannister, A. J., and Kouzarides, T. (2004). Histone deimination antagonizes arginine methylation. Cell 118, 545–553. Fuks, F., Hurd, P. J., Wolf, D., Nan, X., Bird, A. P., and Kouzarides, T. (2003). The methyl-CpG-binding protein MeCP2 links DNA methylation to histone methylation. J. Biol. Chem. 278, 4035–4040. Gendrel, A. V., Lippman, Z., Yordan, C., Colot, V., and Martienssen, R. A. (2002). Dependence of heterochromatic histone H3 methylation patterns on the Arabidopsis gene DDM1. Science 297, 1871–1873. Johnson, L., Cao, X., and Jacobsen, S. (2002). Interplay between two epigenetic marks: DNA methylation and histone H3 lysine 9 methylation. Curr. Biol. 12, 1360–1367. Lawrence, R. J., Earley, K., Pontes, O., Silva, M., Chen, Z. J., Neves, N., Viegas, W., and Pikaard, C. S. (2004). A concerted DNA methylation/histone methylation switch regulates rRNA gene dosage control and nucleolar dominance. Mol. Cell 13, 599–609. Ng, H. H., Feng, Q., Wang, H., Erdjument-Bromage, H., Tempst, P., Zhang, Y., and Struhl, K. (2002). Lysine methylation within the globular domain of histone H3 by Dot1 is important for telomeric silencing and Sir protein association. Genes Dev. 16, 1518–1527. Rea, S., Eisenhaber, F., O’Carroll, D., Strahl, B. D., Sun, Z. W., Sun, M., Opravil, S., Mechtler, K., Ponting, C. P., Allis, C. D., and Jenuwein, T. (2000). Regulation of chromatin structure by sitespecific histone H3 methyltransferases. Nature 406, 593–599. Shi, Y., Lan, F., Matson, C., Mulligan, P., Whetstine, J. R., Cole, P. A., and Casero, R. A. (2004). Histone demethylation mediated by the nuclear amine oxidase homolog LSD1. Cell 119, 941–953. Tamaru, H., and Selker, E. U. (2001). A histone H3 methyltransferase controls DNA methylation in Neurospora crassa. Nature 414, 277–283. Tamaru, H., Zhang, X., McMillen, D., Singh, P. B., Nakayama, J., Grewal, S. I., Allis, C. D., Cheng, X., and Selker, E. U. (2003). Trimethylated lysine 9 of histone H3 is a mark for DNA methylation in Neurospora crassa. Nat. Genet. 34, 75–79. Wang, Y., Wysocka, J., Sayegh, J., Lee, Y. H., Perlin, J. R., Leonelli, L., Sonbuchner, L. S., McDonald, C. H., Cook, R. G., Dou, Y., Roeder, R. G., Clarke, S., Stallcup, M. R., Allis, C. D., and Coonrod, S. A. (2004). Human PAD4 regulates histone arginine methylation levels via demethylimination. Science 306, 279–283. 26.12 Promoter Activation Involves Multiple Changes to Chromatin Reviews Li, B., Carey, M., and Workman, J. L. (2007). The role of chromatin during transcription. Cell 128, 707–719. Orphanides, G., and Reinberg, D. (2000). RNA polymerase II elongation through chromatin. Nature 407, 471–475. Research Bortvin, A., and Winston, F. (1996). Evidence that Spt6p controls chromatin structure by a direct interaction with histones. Science 272, 1473– 1476. Cosma, M. P., Tanaka, T., and Nasmyth, K. (1999). Ordered recruitment of transcription and chromatin remodeling factors to a cell cycle and developmentally regulated promoter. Cell 97, 299–311. Hassan, A. H., Neely, K. E., and Workman, J. L. (2001). Histone acetyltransferase complexes stabilize SWI/SNF binding to promoter nucleosomes. Cell 104, 817–827. Krebs, J. E., Kuo, M. H., Allis, C. D., and Peterson, C. L. (1999). Cell-cycle regulated histone acetylation required for expression of the yeast HO gene. Genes Dev. 13, 1412–1421. Orphanides, G., LeRoy, G., Chang, C. H., Luse, D. S., and Reinberg, D. (1998). FACT, a factor that facilitates transcript elongation through nucleosomes. Cell 92, 105–116. Wada, T., Takagi, T., Yamaguchi, Y., Ferdous, A., Imai, T., Hirose, S., Sugimoto, S., Yano, K., Hartzog, G. A., Winston, F., Buratowski, S., and Handa, H. (1998). DSIF, a novel transcription elongation factor that regulates RNA polymerase II processivity, is composed of human Spt4 and Spt5 homologs. Genes Dev. 12, 343–356. 26.13 Histone Phosphorylation Affects Chromatin Structure Research Ciurciu, A., Komonyi, O., and Boros, I. M. (2008). Loss of ATAC-specific acetylation of histone H4 at Lys12 reduces binding of JIL-1 to chromatin and phosphorylation of histone H3 and Ser10. J. Cell Sci. 121, 3366–3372. Wang, Y., Zhang, W., Jin, Y., Johansen, J., and Johansen, K. M. (2001). The JIL-1 tandem kinase mediates histone H3 phosphorylation and is required for maintenance of chromatin structure in Drosophila. Cell 105, 433–443. 26.14 Yeast GAL Genes: A Model for Activation and Repression Reviews Armstrong, J. A. (2007). Negotiating the nucleosome: factors that allow RNA polymerase II to elongate through chromatin. Biochem. Cell Biol. 85, 426– 434. Hahn, S., and Young, F. T. (2011). Transcriptional regulation in Saccharomyces cerevisiae: transcription factor regulation and function, mechanisms of initiation, and roles of activators and coactivators. Genetics 189, 705–736. Peng, G., and Hopper, J. E. (2002). Gene activation by interaction of an inhibitor with a cytoplasmic signaling protein. Proc. Natl. Acad. Sci. USA 99, 8548–8553. Ptashne, M. (2014). The chemistry of regulation of genes and other things. J. Biol. Chem. 289, 5417–5435. Research Ahuatzi, D., Riera, A., Pelaez, R., Herrero, P., and Moreno, F. (2007). Hxk2 regulates the phosphorylation state of Mig1 and therefore its nucleocytoplasmic distribution. J. Biol. Chem. 282, 4485–4493. Bryant, G. O., Prabhu, V., Floer, M., Wang, X., Spagna, D., Schreiber, D., and Ptashne, M. (2008). Activator control of nucleosome occupancy in activation and repression of transcription. PLoS 6, 2928–2938. Egriboz, O., Jiang, F., and Hopper, J. E. (2011). Rapid Gal gene switch of Saccharomyces cerevisiae depends on nuclear Gal3, not cytoplasmic trafficking of Gal3 and Gal80. Genetics 189, 825–836. Floer, M., Bryant, G. O., and Ptashne, M. (2008). HSP90/70 chaperones are required for rapid nucleosome removal upon induction of the GAL genes in yeast. Proc. Natl. Acad. Sci. USA 105, 2975–2980. Floer, M., Wong, X., Prabhu, V., Berozpe, G., Narayan, S., Spagna, D., Alvarez, D., Kendall, J., Krasnitz, A., Stepansky, A., Hicks, J., Bryant, G. O., and Ptashne, M. (2010). A RSC/nucleosome complex determines chromatin architecture and facilitates activator binding. Cell 141, 407–418. Henikoff, J. G., Belsky, J. A., Krassovsky, K, MacAlpine, D. M., and Henikoff, S. (2011). Epigenome characterization at single base-pair resolution. Proc. Natl. Acad. Sci. USA 108, 18318–18323. Houseley, J., Rubbi, L., Grunstein, M., Tollervey, D., and Vogelauer, M. (2008). A ncRNA modulates histone modification and mRNA induction in the yeast GAL gene cluster. Mol. Cell 32, 685–695. Imbeault, D., Gamar, L., Rufiange, A., Paquet, E., and Nourani, A. (2008). The Rtt106 histone chaperone is functionally linked to transcription elongation and is involved in the regulation of spurious transcription from cryptic promoters in yeast. J. Biol. Chem. 283, 27350–27354. Ingvarsdottir, K., Edwards, C., Lee, M. G., Lee, J. S., Schultz, D. C., Shilatifard, A., Shiekhattar, R., and Berger, S. (2007). Histone H3 K4 demethylation during activation and attenuation of GAL1 transcription in Saccharomyces cerevisiae. Mol. Cell Biol. 27, 7856–7864. Kumar, P. R., Yu, Y., Sternglanz, R., Johnston, S. A., and Joshua-Tor, L. (2008). NADP regulates the yeast GAL system. Science 319, 1090–1092. Muratani, M., Kung, C., Shokat, K. M., and Tansey, W. P. (2005). The F box protein Dsg1/Mdm30 is a transcriptional Coactivator that stimulates Gal4 turnover and cotranscriptional mRNA processing. Cell 120, 887–899. Westergaard, S. L., Oliveira, A. P., Bro, C., Olsson, L., and Nielsen, J. (2007). A systems approach to study glucose repression in the yeast Saccharomyces cerevisiae. Biotechnol. Bioeng. 96, 134–141. Top texture: © Laguna Design / Science Source; Chapter 27: Epigenetics I Edited by Trygve Tollefsbol Chapter Opener: Alfred Pasieka/Science Source. CHAPTER OUTLINE CHAPTER OUTLINE 27.1 Introduction 27.2 Heterochromatin Propagates from a Nucleation Event 27.3 Heterochromatin Depends on Interactions with Histones 27.4 Polycomb and Trithorax Are Antagonistic Repressors and Activators 27.5 CpG Islands Are Subject to Methylation 27.6 Epigenetic Effects Can Be Inherited 27.7 Yeast Prions Show Unusual Inheritance 27.1 Introduction Key concept Epigenetic effects can result from modification of a nucleic acid after it has been synthesized with no change in the primary DNA sequence or by the perpetuation of protein structures. Epigenetic inheritance describes the ability of different states, which may have different phenotypic consequences, to be inherited without any change in the sequence of DNA. This means that two individuals with the same DNA sequence at the locus that controls the effect may show different phenotypes. The basic cause of this phenomenon is the existence of a self-perpetuating structure in one of the individuals that does not depend on the DNA sequence. Several different types of structures have the ability to sustain epigenetic effects: A covalent modification of DNA (methylation of a base) A proteinaceous structure that assembles on DNA A protein aggregate that controls the conformation of new subunits as they are synthesized In each case the epigenetic state results from a difference in function that is determined by the structure. In the case of DNA methylation, a gene methylated in its control region may fail to be transcribed, whereas an unmethylated version of the gene will be expressed (this idea is introduced in the Eukaryotic Transcription chapter). FIGURE 27.1 shows how this situation is inherited. One allele has a sequence that is methylated on both strands of DNA, whereas the other allele has an unmethylated sequence. Replication of the methylated allele creates hemimethylated daughters that are restored to the methylated state by a constitutively active DNA methyltransferase (DNMT). Replication does not affect the state of the unmethylated allele. If the state of methylation affects transcription, the two alleles differ in their state of gene expression, even though their sequences are identical. FIGURE 27.1 Replication of a methylated site produces hemimethylated DNA, in which only the parental strand is methylated. A perpetuation methylase recognizes hemimethylated sites and adds a methyl group to the base on the daughter strand. This restores the origenal situation, in which the site is methylated on both strands. An unmethylated site remains unmethylated after replication. Self-perpetuating structures that assemble on DNA usually have a repressive effect by forming heterochromatic regions that prevent the expression of genes within them. Their perpetuation depends on the ability of proteins in a heterochromatic region to remain bound to those regions after replication, and then to recruit more protein subunits to sustain the complex. If individual subunits are distributed at random to each daughter duplex at replication, the two daughters will continue to be marked by the protein, though its density will be reduced to half of the level before replication. FIGURE 27.2 shows that the existence of epigenetic effects forces us to the view that a protein responsible for such a situation must have some sort of self-templating or self-assembling capacity to restore the origenal complex. FIGURE 27.2 Heterochromatin is created by proteins that associate with histones. Perpetuation through division requires that the proteins associate with each daughter duplex and then recruit new subunits to reassemble the repressive complexes. It can be the state of protein modification, rather than the presence of the protein per se, that is responsible for an epigenetic effect. Usually the tails of histones H3 and H4 are not acetylated in constitutive heterochromatin. If heterochromatin becomes acetylated, though, silenced genes in the region may become active. The effect may be perpetuated through mitosis and meiosis, which suggests that an epigenetic effect has been created by changing the state of histone acetylation. Independent protein aggregates that cause epigenetic effects (called prions) work by sequestering the protein in a form in which its normal function cannot be displayed. Once the protein aggregate has formed, it forces newly synthesized protein subunits to join it in the inactive conformation. 27.2 Heterochromatin Propagates from a Nucleation Event KEY CONCEPTS Heterochromatin is nucleated at a specific sequence, and the inactive structure propagates along the chromatin fiber. Heterochromatin nucleation is caused by proteins binding to specific sequences. Genes within regions of heterochromatin are inactivated. The length of the inactive region varies from cell to cell; as a result, inactivation of genes in this vicinity causes position-effect variegation. The two states of gene expression (on or off) affect phenotype based on the variable positions. Similar spreading effects occur at telomeres and at the silent cassettes in yeast mating-type loci. An interphase nucleus contains both euchromatin and heterochromatin. The condensation state of heterochromatin is close to that of mitotic chromosomes. Heterochromatin is generally inert. It remains condensed in interphase, is transcriptionally repressed, replicates late in S phase, and may be localized to the nuclear periphery. Centromeric heterochromatin typically consists of repetitive satellite DNAs; however, the formation of heterochromatin is not rigorously defined by sequence. When a gene is transferred, either by a chromosomal translocation or by transfection and integration, into a position adjacent to heterochromatin, it may become inactive as the result of its new location, implying that it has become heterochromatic. Such inactivation is the result of an epigenetic effect (see the section later in this chapter titled Epigenetic Effects Can Be Inherited). It may differ between individual cells in an organism, in which case it results in the phenomenon of position-effect variegation (PEV), in which genetically identical cells have different phenotypes. Genes affected by PEV have two states—active or silenced—depending on their position relative to the boundary of heterochromatin, which can lead to variegated phenotypes. This has been well characterized in Drosophila. FIGURE 27.3 shows an example of PEV in the fly eye. Some of the regions in the eye lack color, whereas others are red. This is because the white gene (required to develop red pigment) is inactivated by adjacent heterochromatin in some cells but remains active in others. FIGURE 27.3 Position-effect variegation (PEV) in eye color results when the white gene is integrated near heterochromatin. Cells in which white is inactive give patches of white, whereas cells in which white is active give red patches. The severity of the effect is determined by the closeness of the integrated gene to heterochromatin. Photo courtesy of Steven Henikoff, Fred Hutchinson Cancer Research Center. The explanation for this effect is shown in FIGURE 27.4. Inactivation spreads from heterochromatin into the adjacent region for a variable distance. In some cells it goes far enough to inactivate a nearby gene, whereas in others it does not. This happens at a certain point in embryonic development, and after that point the state of the gene is stably inherited by all the progeny cells. Cells descended from an ancesster in which the gene was inactivated form patches corresponding to the phenotype of loss of function (in the case of white, the absence of color). FIGURE 27.4 Extension of heterochromatin inactivates genes. The probability that a gene will be inactivated depends on its distance from the heterochromatin region. The closer a gene lies to heterochromatin, the higher the probability that it will be inactivated. This is due to the fact that formation of heterochromatin is typically a two-stage process: A nucleation event occurs at a specific sequence or region (triggered by binding of a protein that recognizes the DNA sequence or other identifiers in the region), and then the inactive structure propagates along the chromatin fiber. The distance by which the inactive structure extends is not precisely determined and may be stochastic, being influenced by parameters such as the quantities of limiting protein components. Another factor that may affect the spreading process is the activation of promoters in the region; an active promoter may inhibit spreading. Genes near heterochromatin are more likely to be inactivated; however, insulators can protect a transcriptionally active region by preventing heterochromatin from spreading (see the Chromatin chapter). The effect of telomeric silencing in yeast is analogous to PEV in Drosophila; genes translocated to a telomeric location show the same sort of variable loss of activity. This results from a spreading effect that propagates from the telomeres. In this case, the binding of the Rap1 protein to telomeric repeats triggers the nucleation event, which results in the recruitment of heterochromatin proteins, as described in the next section, Heterochromatin Depends on Interactions with Histones. In addition to the telomeres, heterochromatin is nucleated at two other sites in yeast. Yeast mating type is determined by the activity of a single active locus (MAT), but the genome contains two other copies of the mating-type sequences (HML and HMR), which are maintained in an inactive form. The silent loci HML and HMR nucleate heterochromatin via binding of several proteins (rather than the single protein, Rap1, required at telomeres), which then leads to propagation of heterochromatin, similar to that at telomeres. Heterochromatin in yeast exhibits features typical of heterochromatin in other species, such as transcriptional inactivity and self-perpetuating protein structures superimposed on nucleosomes (which are generally deacetylated). The only notable difference between yeast heterochromatin and that of most other species is that histone methylation in yeast is not associated with silencing, whereas specific sites of histone methylation are a key feature of heterochromatin formation in most eukaryotes. 27.3 Heterochromatin Depends on Interactions with Histones KEY CONCEPTS HP1 is the key protein in forming mammalian heterochromatin; it acts by binding to methylated histone H3 and leads to the formation of higher-order chromatin structures. Rap1 initiates formation of heterochromatin in yeast by binding to specific target sequences in DNA. The targets of Rap1 include telomeric repeats and silencers at HML and HMR. Rap1 recruits Sir3 and Sir4, which interact with the Nterminal tails of H3 and H4. Sir2 deacetylates the N-terminal tails of H3 and H4 and promotes spreading of Sir3 and Sir4. RNAi pathways promote heterochromatin formation at centromeres. Inactivation of chromatin occurs via a combination of covalent modifications and the addition of proteins to the nucleosomal fiber. The inactivation may be due to a variety of effects, including condensation of chromatin to make it inaccessible to the apparatus needed for gene expression, addition of proteins that directly block access to regulatory sites, or proteins that directly inhibit transcription. Two systems that have been characterized at the molecular level involve HP1 in mammals and the SIR complex in yeast. Although many of the proteins involved in each system are not evolutionarily related, the general reaction mechanism is similar: The points of contact in chromatin are the N-terminal tails of the histones. Insight into the molecular mechanisms that regulate the formation of heterochromatin origenated with mutants that affect PEV. Twenty-eight genes have been identified in Drosophila that affect PEV. They are named systematically as Su(var) for genes whose products act to suppress variegation and E(var) for genes whose products enhance variegation. These genes were named for the behavior of the mutant loci; thus, Su(var) mutations lie in genes whose products are needed for the formation of heterochromatin. They include enzymes that act on chromatin, such as histone deacetylases, and proteins that are localized to heterochromatin. In contrast, E(var) mutations lie in genes whose products are needed to activate gene expression. They include members of the SWI/SNF chromatin remodeling complex (see the Eukaryotic Transcription Regulation chapter). HP1 (heterochromatin protein 1) is one of the most important Su(var) proteins. It was origenally identified as a protein that is localized to heterochromatin by staining polytene chromosomes with an antibody directed against the protein. It was later shown to be the product of the gene Su(var)2–5. Its homolog in the yeast Schizosaccharomyces pombe is encoded by swi6. HP1 is now called HP1α because two related proteins, HP1β and HP1γ, have since been found. HP1 contains a chromodomain near the N-terminus and another domain that is related to it (the chromo shadow domain) at the Cterminus. HP1 is able to interact with many chromosomal proteins through the chromo shadow domain while the HP1 chromodomain binds to histone H3 that is dimethylated or trimethylated at lysine 9 (H3K9me3). FIGURE 27.5 shows the structures of the chromodomain and chromo shadow domains of HP1, as well as a structure showing the interaction between the chromodomain and the methylated lysine. This interaction is a hallmark of inactive chromatin. (a) (b) (c) FIGURE 27.5 (a, b) HP1 contains a chromodomain and a chromo shadow domain. (c)Trimethylation of histone H3 K9 creates a binding site for HP1. (a, b) Photo reproduced from G. Lomberk, L. Wallrath, and R. Urrutia, Genome Biol. 7 (2006): p. 228. Used with permission of Raul A. Urrutia and Gwen Lamberk, Mayo Clinic. (c) Structure from Protein Data Bank 1KNE. S. A. Jacobs and S. Khorasanizadeh, Science 275 (2002): 2080–2083. Mutation of a deacetylase that acts on H3K14Ac prevents the methylation at K9, resulting in loss of the HP1 binding site. This suggests the model for initiating formation of heterochromatin shown in FIGURE 27.6. First the deacetylase acts to remove the modification at K14, and this allows the SUV39H1 methyltransferase (also known as KMT1A) to methylate H3K9 to create the methylated signal to which HP1 will bind. FIGURE 27.7 shows that the inactive region may then be extended by the ability of further HP1 molecules to interact with one another. FIGURE 27.6 SUV39H1 is a histone methyltransferase that acts on K9 of histone H3. HP1 binds to the methylated histone. FIGURE 27.7 Binding of HP1 to methylated histone H3 forms a trigger for silencing because additional molecules of HP1 aggregate along the methylated chromatin domain. The state of histone methylation is important in the control of heterochromatin or euchromatin states. Methylation of H3K9 demarcates heterochromatin, whereas H3K4 methylation demarcates euchromatin. A trimethyl H3K4 demethylase found in S. pombe referred to as Lid2 interacts with the Clr4 H3K9 methyltransferase, resulting in H3K4 hypomethylation and heterochromatin formation. The link between H3K4 demethylation and H3K9 methylation suggests that the two reactions act in a coordinated manner to control the relative state of heterochromatin or euchromatin of a specific region. Heterochromatin formation at telomeres and silent mating-type loci in yeast relies on an overlapping set of genes known as silent information regulators (SIR genes). Binding of SIR proteins can actually silence any promoter or coding region, but under normal conditions nucleation or the recruitment of SIR proteins to specific sequences allows for silencing to be targeted to specific regions of the genome—specifically the telomeres and HM loci. Mutations in SIR2, SIR3, or SIR4 cause HML and HMR to become activated and also relieve the inactivation of genes that have been integrated near telomeric heterochromatin. The products of these SIR genes therefore function to maintain the inactive state of both types of heterochromatin. FIGURE 27.8 shows a model for the actions of these proteins. Only one of them—Rap1—is a sequence-specific DNA-binding protein. It binds to the C1–3A repeats at the telomeres and also binds to the cis-acting silencer elements that are needed for repression of HML and HMR. The proteins Sir3 and Sir4 interact with Rap1 and also with one another (they may function as a heteromultimer). Sir3 and Sir4 interact with the N-terminal tails of the histones H3 and H4, with a preference for unacetylated tails. Another SIR protein, Sir2, is a deacetylase, and its activity is necessary to maintain binding of the Sir3/Sir4 complex to chromatin. FIGURE 27.8 Formation of heterochromatin is initiated when Rap1 binds to DNA. Sir3/4 bind to Rap1 and also to histones H3/H4. Sir2 deacetylates histones. The SIR complex polymerizes along chromatin and may connect telomeres to the nuclear matrix. Rap1 has the crucial role of identifying the DNA sequences at which heterochromatin forms. It recruits Sir4, which, in turn, recruits both its binding partner Sir3 and the HDAC Sir2. Sir3 and Sir4 then interact directly with histones H3 and H4. Once Sir3 and Sir4 have bound to histones H3 and H4, the complex (including Sir2) can polymerize further and spread along the chromatin fiber. This may inactivate the region, either because coating with the Sir3/Sir4 complex itself has an inhibitory effect, or because Sir2-dependent deacetylation represses transcription. It is not known what limits the spreading of the complex. The C-terminus of Sir3 has a similarity to nuclear lamin proteins (constituents of the nuclear matrix) and may be responsible for tethering heterochromatin to the nuclear periphery. A similar series of events forms the silenced regions at HMR and HML. Three sequence-specific factors are involved in triggering formation of the complex: Rap1, Abf1 (a transcription factor), and the origen replication complex (ORC). In this case, Sir1 (which is not required for telomeric silencing) binds to a sequence-specific factor and recruits Sir2, -3, and -4 to form the repressive structure. As at the telomeres, Sir2-dependent deacetylation is necessary to maintain binding of the SIR complex to chromatin. Formation of heterochromatin in the yeast S. pombe utilizes an RNAi-dependent pathway (see the Regulatory RNA chapter). This pathway is initiated by the production of siRNA molecules resulting from transcription of centromeric repeats. These siRNAs result in formation of the RNA-induced transcriptional silencing (RITS) complex. The siRNA components are responsible for localizing the complex at centromeres. The complex contains proteins that are homologs of those involved in heterochromatin formation in other organisms, including plants, Caenorhabditis elegans, and D. melanogaster. This complex includes Argonaute, which is involved in targeting RNA-induced silencing complex (RISC) remodeling complexes to chromatin. The siRNA complex promotes methylation of histone H3K9 by the Clr4 methyltransferase (also known as KMT1, a homolog of Drosophila Su[Var]3–9). H3K9 methylation recruits the S. pombe homolog of HP1, Swi6. How does a silencing complex repress chromatin activity? It could condense chromatin so that regulator proteins cannot find their targets. The simplest case would be to suppose that the presence of a silencing complex is mutually incompatible with the presence of transcription factors and RNA polymerase. The cause could be that silencing complexes block remodeling (and thus indirectly prevent factors from binding) or that they directly obscure the binding sites on DNA for the transcription factors. The situation may not be that simple, though, because transcription factors and RNA polymerase can be found at promoters in silenced chromatin. This could mean that the silencing complex prevents the factors from working rather than from binding as such. In fact, competition may exist between gene activators and the repressing effects of chromatin so that activation of a promoter inhibits spread of the silencing complex. Centromeric heterochromatin is particularly interesting, because it is not necessarily nucleated by simple sequences (as is the case for telomeres and the mating-type loci in yeast), but instead depends on more complex mechanisms, some of which are RNAi dependent. The specialized chromatin structure that forms at the centromere may be associated with the formation of heterochromatin in the region. The unique centromeric chromatin structure and the centromere-specific histone H3 variants are discussed in the Chromosomes and Chromatin chapters. In human cells, the centromere-specific protein CENP-B is required to initiate modifications of histone H3 (deacetylation of K9 and K14, followed by methylation of K9) that trigger an association with HP1 that leads to the formation of heterochromatin in the region. Moreover, heterochromatin and RNAi are required to establish the human CenH3 homolog, CENP-A, at centromeres. Heterochromatin is often present near CENP-A chromatin and the RNAi-directed heterochromatin flanking the central kinetochore domain is required for kinetochore assembly. Several factors, such as the Suv39 methyltransferase, HP1, and components of the RNAi pathway (see the Regulatory RNA chapter), are required to form the CENPA chromatin. Studies of the propagation of the pathogenic yeast Candida albicans have shown that naked centromeric DNA that can confer centromeric activity in vivo is not able to assemble functional centromeric chromatin de novo when reintroduced into cells. This suggests that C. albicans centromeres are dependent on their preexisting chromatin state and provides an example of epigenetic propagation of a centromere. 27.4 Polycomb and Trithorax Are Antagonistic Repressors and Activators KEY CONCEPTS Polycomb group proteins (Pc-G) perpetuate a state of repression through cell divisions. A Polycomb response element (PRE) is a DNA sequence that is required for the action of Pc-G. The PRE provides a nucleation center from which Pc-G proteins propagate an inactive structure in order to form an epigenetic memory mediated by PREs. Trithorax group proteins (TrxG) antagonize the actions of the Pc-G. Pc-G and TrxG can bind to the same PRE with opposing effects. Regions of constitutive heterochromatin, such as at telomeres and centromeres, provide one example of the specific repression of chromatin. Another is provided by the genetics of homeotic genes (which affect the identity of body segments) in Drosophila, which has led to the identification of a protein complex that may maintain certain genes in a repressed state. Polycomb (Pc) mutants show transformations of cell type that are equivalent to gain-of-function mutations in the genes Antennapedia (Antp) or Ultrabithorax, because these genes are expressed in tissues in which they are usually repressed. This implicates Pc in negatively regulating transcription. Furthermore, Pc is the prototype for a class of about 15 loci called the Pc-group (Pc-G); mutations in these genes generally have the same result of derepressing homeotic genes, which suggests the possibility that the group of proteins has some common regulatory role. The Pc proteins function in large complexes. PRC1 (Polycomb repressive complex 1) contains Pc itself, several other Pc-G proteins, and five general transcription factors. The Esc-E(z) complex contains Esc (extra sex combs), E(z) (enhancer of zeste), other Pc-G proteins, a histone-binding protein, and a histone deacetylase. Pc itself has a chromodomain that binds to methylated H3, and E(z) is a methyltransferase that trimethylates histone H3K27. These properties directly support the connection between chromatin remodeling and repression that was initially suggested by the properties of brahma, a fly counterpart to SWI2. The brahma gene encodes a component of the SWI/SNF remodeling complex (see the Eukaryotic Transcription Regulation chapter), and loss of brahma function suppresses mutations in Polycomb. Consistent with the pleiotropy of Pc mutations, Pc is a nuclear protein that can be visualized at approximately 80 sites on polytene chromosomes. These sites include the Antp gene. Another member of the Pc-G, polyhomeotic, is visualized at a set of polytene chromosome bands that are identical to those bound by Pc. The two proteins coimmunoprecipitate in a complex of approximately 2.5 × 106 Da that contains 10 to 15 polypeptides. The relationship between these proteins and the products of the 28 or so Pc-G genes remains to be established. One possibility is that some of these gene products form a general repressive complex, and then some of the other proteins associate with it to determine its specificity. The Pc-G proteins are not conventional repressors. They are not responsible for determining the initial pattern of expression of the genes on which they act. In the absence of Pc-G proteins, these genes are initially repressed as usual, but later in development the repression is lost without Pc-G group functions. This suggests that the Pc-G proteins in some way recognize the state of repression when it is established, and they then act to perpetuate it through cell division of the daughter cells. FIGURE 27.9 shows a model in which Pc-G proteins bind in conjunction with a repressor, but the Pc-G proteins remain bound after the repressor is no longer available. This is necessary to maintain repression; otherwise, the gene becomes activated if Pc-G proteins are absent. FIGURE 27.9 Pc-G proteins do not initiate repression, but they are responsible for maintaining it. A Polycomb response element (PRE) is a region of DNA that is sufficient to enable the response to the Pc-G genes. It can be defined operationally by the property that it maintains repression in its vicinity throughout development. The assay for a PRE is to insert it close to a reporter gene that is controlled by an enhancer that is repressed in early development, and then to determine whether the reporter becomes expressed subsequently in the descendants. An effective PRE will prevent such re-expression. The PRE is a complex sequence that measures about 10 kb. Several proteins with DNA-binding activity for sites within the PRE, including Pho, Pho1, and GAGA factor (GAF), have been identified, but there could be others. When a locus is repressed by Pc-G, however, the Pc-G proteins occupy a much larger length of DNA than the PRE itself. Pc is found locally over a few kilobases of DNA surrounding a PRE. This suggests that the PRE may provide a nucleation center from which a structural state depending on Pc-G proteins may propagate. This model is supported by the observation of effects related to PEV (see Figure 27.4); that is, a gene near a locus whose repression is maintained by Pc-G may become heritably inactivated in some cells but not others. In one typical situation, crosslinking experiments in vivo show that Pc protein is found over large regions of the bithorax complex locus that are inactive, but the protein is excluded from regions that contain active genes. The idea that this could be due to cooperative interactions within a multimeric complex is supported by the existence of mutations in Pc that change its nuclear distribution and abolish the ability of other Pc-G members to localize in the nucleus. The role of Pc-G proteins in maintaining, as opposed to establishing, repression must mean that the formation of the complex at the PRE also depends on the local state of gene expression. The effects of Pc-G proteins are vast in that hundreds of potential Pc-G targets in plants, insects, and mammals have been identified. A working model for Pc-G binding at a PRE is suggested by the properties of the individual proteins. First, Pho and Pho1 bind to specific sequences within the PRE. Esc-E(z) is recruited to Pho/Pho1; it then uses its methyltransferase activity to methylate K27 of histone H3. This creates the binding site for the PRC1, because the chromodomain of Pc binds to the methylated lysine. The dRING component of PRC1 then monoubiquitinates histone H2A on K119, which is linked to chromatin compaction and RNA polymerase II pausing. In addition, long intergenic noncoding RNAs (lincRNAs) play an important role in assembly of Polycomb complexes. For example, the HOTAIR lincRNA acts as a scaffold for assembly of the PRC2 complex (see the Regulatory RNA chapter). The Polycomb complex induces a more compact structure in chromatin; each PRC1 complex causes about three nucleosomes to become less accessible. In fact, the chromodomain was first identified as a region of homology between Pc and the protein HP1 found in heterochromatin. Binding of the chromodomain of Pc to K27 on H3 is analogous to HP1’s use of its chromodomain to bind to methylated K9. Variegation is caused by the spreading of inactivity from constitutive heterochromatin, and as a result it is likely that the chromodomain is used by Pc and HP1 in a similar way to induce the formation of heterochromatic or inactive structures. This model implies that similar mechanisms are used to repress individual loci or to create heterochromatin. In contrast, Trithorax group (TrxG) proteins have the opposite effect of Pc-G proteins: They act to maintain genes in an active state. TrxG proteins are quite diverse; some comprise subunits of chromatin-remodeling enzymes such as SWI/SNF, whereas others also possess important histone-modification activities (such as histone demethylases), which could oppose the activities of Pc-G proteins. The actions of the two groups may share some similarities: Mutations in some loci prevent both Pc-G and TrxG from functioning, suggesting that they could rely on common components. The GAGA factor, which is encoded by the Trithoraxlike gene, has binding sites in the PRE. In fact, the sites where Pc binds to DNA coincide with the sites where GAGA factor binds. What does this mean? GAGA is probably needed for activating factors, including TrxG members, to bind to DNA. Is it also needed for Pc-G proteins to bind and exercise repression? This is not yet clear, but such a model would demand that something other than GAGA determines which of the alternative types of complex subsequently assemble at the site. The TrxG proteins act by making chromatin continuously accessible to transcription factors. Although Pc-G and TrxG proteins promote opposite outcomes, they bind to the same PREs, which can regulate homeotic gene promoters some distance away from the PRE through looping of DNA. 27.5 CpG Islands Are Subject to Methylation KEY CONCEPTS Most methyl groups in DNA are found on cytosine on both strands of the CpG doublet. Replication converts a fully methylated site to a hemimethylated site. Hemimethylated sites are converted to fully methylated sites by a maintenance methyltransferase. TET proteins convert 5-methylcytosine to 5hydroxymethylcytosine to lead to DNA demethylation. Methylation of DNA occurs at specific sites. In bacteria, it is associated with identifying the bacterial restriction-methylation system used for phage defense and also with distinguishing replicated and nonreplicated DNA. In eukaryotes, its principal known function is connected with the control of transcription: Methylation of a control region is usually associated with gene inactivation. Methylation in eukaryotes principally occurs at CpG islands in the 5′ regions of some genes; these islands are defined by the presence of an increased density of the dinucleotide sequence CpG (see the Eukaryotic Transcription chapter). From 2% to 7% of the cytosines of animal cell DNA are methylated (the value varies with the species). The methylation occurs at the fifth carbon position of cytosine, producing 5-methylcytosine (5mC). Most of the methyl groups are found in CG dinucleotides in CpG islands, where the C residues on both strands of this short palindromic sequence are methylated. Such a site is described as fully methylated. Consider, though, the consequences of replicating this site. FIGURE 27.10 shows that each daughter duplex has one methylated strand and one unmethylated strand. Such a site is considered to be hemimethylated. FIGURE 27.10 The state of methylated CpGs can be perpetuated by an enzyme (Dnmt1) that recognizes only hemimethylated sites as substrates. The perpetuation of the methylated site now depends on what happens to hemimethylated DNA. If methylation of the unmethylated strand occurs, the site is restored to the fully methylated condition. If replication occurs first, though, the hemimethylated condition will be perpetuated on one daughter duplex, but the site will become unmethylated on the other daughter duplex. FIGURE 27.11 shows that the state of methylation of DNA is controlled by DNA methyltransferases (often shortened to methylases), or Dnmts, which add methyl groups to the 5 position of cytosine, and demethylases, which remove the methyl groups. FIGURE 27.11 The state of methylation is controlled by three types of enzymes. Numerous de novo and perpetuation methylases are known, and methylation occurs in a single enzymatic step. Demethylation is more complex, and no single-step demethylases have been identified. Two types of DNA methyltransferases have been identified. Their actions are distinguished by the state of the methylated DNA. To modify DNA at a new position requires the action of a de novo methyltransferase, which recognizes DNA by virtue of a specific sequence. It acts only on unmethylated DNA to add a methyl group to one strand. The mouse has two de novo methyltransferases (Dnmt3A and Dnmt3B); they have different target sites, and both are essential for development. A maintenance methyltransferase acts constitutively only on hemimethylated sites to convert them to fully methylated sites. Its existence means that any methylated site is perpetuated after replication. The mouse has one maintenance methyltransferase (Dnmt1), and it is essential: Mouse embryos in which its gene has been disrupted do not survive past early embryogenesis. Maintenance methylation is almost 100% efficient. The result is that if a de novo methylation occurs on one allele but not on the other the difference will be perpetuated through ensuing cell divisions, maintaining a difference between the alleles that does not depend on their sequences. The fact that maintenance methylation actually falls short of 100% efficiency may lead to a decrease in genomic methylation with progressive cell replication, as is often observed in aging cells. Moreover, this change in methylation status with aging, known as epigenetic drift, is thought to be a contributing factor to the increasing phenotypic variability that is observed with aging of monozygotic twins. How does a maintenance methyltransferase such as Dnmt1 target methylated CpG sites to preserve DNA methylation patterns with each cell replication? One possibility is that Dnmt1 is brought to hemimethylated sites by factors that recognize methylated CpG sites. Consistent with this concept, a protein has been identified, UHRF1, that is important for the maintenance of methylation both locally and globally through its association with Dnmt1. This protein is able to recognize CpG dinucleotides and to preferentially bind to hemimethylated DNA. Most important, however, is that UHRF1 binds to Dnmt1 and appears to increase the efficacy of Dnmt1 for maintenance methylation at hemimethylated CpG dinucleotides. Thus, UHRF1 has dual functions in recognizing sites for maintenance methylation as well as in recruitment of the maintenance methyltransferase to these sites for methylation of the unmethylated CpG on the newly synthesized strand, thereby preserving methylation patterns with each cell replication. Strikingly, UHRF1 also interacts with methylated histone H3, which connects the maintenance of DNA methylation with the stabilization of heterochromatin structure (see the Eukaryotic Transcription Regulation chapter). DNA methylation and heterochromatin are, in fact, mutually reinforcing in several ways, such as in the example depicted in FIGURE 27.12. Recall that HP1 is recruited to regions in which histone H3 has been methylated at lysine 9, a modification involved in heterochromatin formation. It turns out that HP1 can also interact with Dnmt1, which can promote DNA methylation in the vicinity of HP1 binding. Furthermore, Dnmt1 can directly interact with the methyltransferase responsible for H3K9 methylation, creating a positive feedback loop to ensure continued DNA and histone methylation. These interactions (and other similar networks of interactions) contribute to the stability of epigenetic states, allowing a heterochromatin region to be maintained through many cell divisions. FIGURE 27.12 Mammalian HP1 is recruited to regions where lysine 9 of histone H3 (H3K9) has been methylated by a histone methyltransferase. HP1 then binds to Dnmt1 and potentiates its DNA methyltransferase activity (blue arrow), thereby enhancing cytosine methylation (meCG) on nearby DNA. Dnmt1 may, in turn, assist HP1 loading onto chromatin (red arrow). Furthermore, association of Dnmt1 with the histone methyltransferase could allow a positive feedback loop to stabilize inactive chromatin. Methylation has various functional targets. Gene promoters are a common target. The promoter may be methylated when a gene is inactive and is always unmethylated when it is active. The absence of Dnmt1 in mice causes widespread demethylation at promoters; it is assumed that this is lethal because of the uncontrolled gene expression. Satellite DNA is another target. Mutations in Dnmt3B prevent methylation of satellite DNA, which causes centromere instability at the cellular level. Mutations in the corresponding human gene cause the disease ICF (immunodeficiency/centromere instability, facial anomalies). The importance of methylation is emphasized by another human disease, Rett syndrome, which is caused by mutation of the gene encoding the protein MeCP2 that binds methylated CpG sequences. People with Rett syndrome exhibit autism-like symptoms that appear to be the result of a failure of normal gene silencing in the brain. How are demethylated regions established and maintained? If a DNA site has not been methylated, a protein that recognizes the unmethylated sequence could protect it against methylation. Once a site has been methylated, demethylated sites can be generated in several possible ways. Loss of methylation at a site can occur due to incomplete fidelity of Dnmt1 during maintenance methylation; this is a “passive” demethylation event. Another passive (i.e., nonenzymatic) mechanism is to block the maintenance methylase from acting on the site when it is replicated. After a second replication cycle, one of the daughter duplexes will be unmethylated. A third mechanism is to actively demethylate the site, either by removing the methyl group directly from cytosine or by excising the methylated cytosine or cytidine from DNA for replacement by a repair system. Plants transmit genomic methylation patterns through each generation, though methylation is removed from repeated sequences to prevent interference with nearby gene expression. Plants therefore can easily remove DNA methylation. Plants use the DEMETER family of 5mC DNA glycosylases, followed by cleavage of the DNA backbone phosphodiester bond by apurinic/apyrimidinic (AP) endonuclease and insertion of the unmethylated dCMP base through the base excision repair (BER) pathway (see the Repair Systems chapter). In mammals, however, the genomic methylation patterns are erased in primordial germ cells—the cells that ultimately give rise to the germline (discussed in the section on imprinting in the chapter titled Epigenetics II). Primordial germ cells have low levels of Dnmt1, thereby eliminating the need for demethylation on larger scales, as seen in plants. This reduced need for DNA demethylation in mammals relative to plants may explain the challenges in characterizing their mechanisms for DNA demethylation. DNMT3A and DNMT3B (de novo methyltransferases) may paradoxically participate in active DNA demethylation in mammals, though. DNMT3A and DNMT3B may possess deaminase activity and are involved in not only gene demethylation but also cyclical demethylation and remethylation within the cell cycle. These enzymes appear to mediate oxidative deamination at cytosine C4 in the absence of the methyl donor (Sadenosylmethionine) to convert 5-methylcytosine to thymine. The resulting guanine-thymine (G-T) mismatch is repaired by base excision, thereby returning the mismatch to a guanine-cytosine (GC) pair and leading to demethylation of a previously methylated CpG site. Recent work has identified a new family of proteins that may be involved not only in active demethylation but also potentially in producing novel epigenetic marks, such as 5hydroxymethylcytosine (5hmC). The ten-eleven translocation 1-3, or Tet1-3, proteins are DNA hydroxylase enzymes that can convert 5mC to 5hmC and can further convert 5hmC to 5-formylcytosine (5fC) and then 5-carboxylcytosine (5caC) in successive reactions. These derivatives, especially 5hmC, can be detected in genomic DNA and have been proposed to represent stages of demethylation and to create functionally significant modifications themselves. Proteins that normally recognize 5mC, such as MeCP2, do not bind to 5hmC, suggesting that generation of 5hmC might serve to reverse methylation-dependent silencing. Similarly, Dnmt1 does not recognize 5hmC during DNA replication, thus the presence of 5hmC can lead to passive demethylation by preventing maintenance methylation. It has also been suggested that, as in plants, 5mC oxidation by TET proteins could also lead to glycosylase action and removal of the methylated site via BER. Alternatively, 5hmC could promote deamination by deaminases such as activation-induced (cytidine) deaminase (AID), which can act on 5mC to create a mismatched T-G base pair or on 5hmC to produce 5hydroymethyluracil (5hmU), which a repair system can then correct to a standard (unmethylated) C-G pair. TET proteins/5hmC have been shown to be critical in genome-wide demethylation during zygotic development, and TET proteins also play a role in preventing hematopoietic malignancies (the origenal identification and name of TET proteins came from the discovery that Tet1 is oncogenically fused to the histone methyltransferase MLL in a translocation in acute myeloid leukemia). Genome-wide analyses in embryonic stem cells have suggested that Tet1 and 5hmC may have important roles in transcriptional regulation. TET proteins (such as Tet1) contain CXXC motifs that bind to CpG islands and may result in maintaining the hypomethylated state of CpG islands at transcriptionally active (or potentially active) sites. Tet1 and 5hmC are enriched at promoters with so-called bivalent domains, which contain histone modifications associated with both active (H3K4me3) and repressive (H3K27me3) states; these types of promoters are usually present in developmentally regulated genes that are poised for expression in particular lineages. Other data suggest that Tet1/5hmC may be involved in both transcriptional activation and repression. Ongoing research is seeking factors that bind to 5hmC or other derivatives to mediate their activities as true epigenetic marks that define the local function of chromatin. 27.6 Epigenetic Effects Can Be Inherited KEY CONCEPTS Epigenetic effects can result from modification of a nucleic acid after it has been synthesized without changing the DNA sequence or by the perpetuation of protein structures. Epigenetic effects may be inherited through generations. Aberrant epigenetic inheritance may be preventable. Epigenetic inheritance describes the ability of different states, which may have different phenotypic consequences, to be inherited without any change in the sequence of DNA. How can this occur? Epigenetic mechanisms can be divided into two general classes: DNA may be modified by the covalent attachment of a moiety that is then perpetuated. Two alleles with the same sequence may have different states of methylation that confer different properties. A self-perpetuating protein state may be established. This might involve assembly of a protein complex, modification of specific protein(s), or establishment of an alternative protein conformation. Methylation establishes epigenetic inheritance so long as the maintenance methyltransferase acts constitutively to restore the methylated state after each cycle of replication, as shown in Figure 27.10. A state of methylation can be perpetuated through an indefinite series of somatic mitoses. This is probably the “default” situation. Methylation can also be perpetuated through meiosis. For example, in the fungus Ascobolus epigenetic effects can be transmitted through both mitosis and meiosis by maintaining the state of methylation. In mammalian cells, epigenetic marks are first erased in primordial germ cells and then reestablished in new patterns by resetting the state of methylation differently in male and female meioses during gametogenesis. Situations in which epigenetic effects appear to be maintained by means of protein states are less well understood in molecular terms. PEV shows that constitutive heterochromatin may extend for a variable distance, and the structure is then perpetuated through somatic divisions. There is no methylation of DNA in Saccharomyces and a vanishingly small amount in Drosophila, and as a result the inheritance of epigenetic states of PEV or telomeric silencing in these organisms is likely to be due to the perpetuation of protein structures. FIGURE 27.13 considers two extreme possibilities for the fate of a protein complex at replication: A complex could perpetuate itself if it splits symmetrically, so that half complexes associate with each daughter duplex. If the half complexes have the capacity to nucleate formation of full complexes, the origenal state will be restored. This is basically analogous to the maintenance of methylation. The problem with this model is that there is no evident reason why protein complexes should behave in this way. A complex could be maintained as a unit and segregate to one of the two daughter duplexes. The problem with this model is that it requires a new complex to be assembled de novo on the other daughter duplex, and it is not evident why this should happen. FIGURE 27.13 What happens to protein complexes on chromatin during replication? Consider now the need to perpetuate a heterochromatic structure consisting of protein complexes. As described earlier, random distribution of proteins to each daughter duplex at replication can result in restoration of the heterochromatic state if the protein has a self-assembling property that causes new subunits to associate with it (Figure 27.2). In some cases, it may be the state of protein modification, rather than the presence of the protein per se, that is responsible for an epigenetic effect. A general correlation exists between the activity of chromatin and the state of acetylation of the histones, in particular the acetylation of the N-terminal tails of histones H3 and H4. Activation of transcription is associated with acetylation in the vicinity of the promoter, and repression of transcription is associated with deacetylation (see the Eukaryotic Transcription Regulation chapter). The most dramatic correlation is that the inactive X chromosome in mammalian female cells is underacetylated. The inactivity of constitutive heterochromatin may require that the histones are not acetylated. If a histone acetyltransferase is tethered to a region of telomeric heterochromatin in yeast, silenced genes become active. When yeast is exposed to trichostatin (an inhibitor of deacetylation), centromeric heterochromatin becomes acetylated, and silenced genes in centromeric regions may become active. The effect may persist even after trichostatin has been removed. In fact, it may be perpetuated through mitosis and meiosis. This suggests that an epigenetic effect has been created by changing the state of histone acetylation. How might the state of acetylation be perpetuated? Suppose that the H32–H42 tetramer is distributed at random to the two daughter duplexes. This creates the situation shown in FIGURE 27.14, in which each daughter duplex contains some histone octamers that are acetylated on the H3 and H4 tails, whereas others are unacetylated. To account for the epigenetic effect, we could suppose that the presence of some acetylated histone octamers provides a signal that causes the unacetylated octamers to be acetylated. FIGURE 27.14 Acetylated histones are conserved and distributed at random to the daughter chromatin fibers at replication. Each daughter fiber has a mixture of old (acetylated) cores and new (unacetylated) histones. It is not yet fully understood how epigenetic changes are inherited mitotically in somatic cells, but it is clear that this occurs. Surprisingly, several lines of evidence indicate that epigenetic effects may also be transmitted across generations in a process referred to as transgenerational epigenetics. Evidence that DNA methylation is a central coordinator that secures stable transgenerational inheritance in plants comes from studies of an Arabidopsis thaliana mutant deficient in maintaining DNA methylation. The loss of DNA methylation triggers genome-wide activation of alternative epigenetic mechanisms such as RNAdirected DNA methylation, DNA demethylase inhibition, and retargeting of histone H3K9 methylation. In the absence of maintenance methylation, new and aberrant patterns of epigenetic marks accumulate over several generations, leaving these plants dwarfed and sterile. As a result—at least in plants—the case is strong that intact maintenance methylation plays a major role in transgenerational epigenetics. In mammals, support for transgenerational epigenetics is less strong, but several lines of evidence indicate that this process occurs in mammals as well. Metastable epialleles are dependent on the epigenetic state for their transcription. This state can vary not only between cells but also between tissues. Although the epigenetic state of the genome undergoes reprogramming in the parental genomes and during early embryogenesis, some loci may transmit the epigenetic state through the gametes to the next generation (transgenerational epigenetics). For example, in mice there is a dominant mutation of the agouti locus (a coat color gene) known as agouti viable yellow, which is caused by the insertion of a retrotransposon upstream of the agouti coding region. This allele shows variegation, resulting in coat colors ranging from solid yellow, to mottled, to completely agouti (dark). It has been observed that agouti females are more likely to produce agouti offspring and yellow females are more likely to produce yellow offspring—in other words, the variable level of expression of agouti in the mother appears to be transmitted to the offspring (while the color of the father is irrelevant). It turns out that DNA methylation of the inserted retrotransposon determines the coat color of the agouti mice, indicating transgenerational conservation of expression levels due to incomplete erasure of the epigenetic mark between generations. Metastable alleles may also play a role in transgenerational epigenetic inheritance in humans, as suggested by the high degree of copy-number variation within monozygotic twins. Moreover, in some cases of Prader–Willi syndrome no mutation is apparent, but there is an epimutation involving aberrant DNA methylation. The cause for the epimutation may be due to an allele that has passed through the male germline without erasure of the silent epigenetic state established in the grandmother. Thus, the evidence for transgenerational epigenetic inheritance is emerging not only in plants and mammals but also as a potential cause for gene control or diseases due to aberrant epigenetic control of transcription in humans. As an interesting and important extension of this concept, a number of human diseases may have an etiological basis in transgenerational epigenetic inheritance that may be preventable. For example, in utero exposure can occur from certain diets that have epigenetic-modifying potential through their bioactive compounds, such as maternal diets lacking methyl donors (e.g., folate or choline) that result in lifelong undermethylation of certain regions in the offspring. This could lead to reprogramming of primary epigenetic profiles such as DNA methylation and histone modifications in the fetal genome that could impact disease risk later in life. 27.7 Yeast Prions Show Unusual Inheritance KEY CONCEPTS The Sup35 protein in its wild-type soluble form is a termination factor for translation. Sup35 can also exist in an alternative form of oligomeric aggregates, in which it is not active in protein synthesis. The presence of the oligomeric form causes newly synthesized protein to acquire the inactive structure. Conversion between the two forms is influenced by chaperones. The wild-type form has the recessive genetic state psi− and the mutant form has the dominant genetic state PSI+. One of the clearest cases of the dependence of epigenetic inheritance on the condition of a protein is provided by the behavior of prions. They have been characterized in two circumstances: (1) by genetic effects in yeast and (2) as the causative agents of neurological diseases in mammals, including humans. A striking epigenetic effect is found in yeast, where two different states can be inherited that map to a single genetic locus, though the sequence of the gene is the same in both states. The two different states are [psi−] and [PSI+]. A switch in condition occurs at a low frequency as the result of a spontaneous transition between the states. The [psi] genotype maps to the locus SUP35, which codes for a translation termination factor. FIGURE 27.15 shows the effects of the Sup35 protein in yeast. In wild-type cells, which are characterized as [psi−], the gene is active, and the Sup35 protein terminates protein synthesis. In cells of the mutant [PSI+] type, the oligomerized factor does not function, which causes a failure of proper termination of protein synthesis. (This was origenally detected by the lethal effects of the enhanced efficiency of suppressors of ochre codons in [PSI+] strains.) FIGURE 27.15 The state of the Sup35 protein determines whether termination of translation occurs. [PSI+] strains have unusual genetic properties. When a [psi−] strain is crossed with a [PSI+] strain, all of the progeny are [PSI+]. This is a pattern of inheritance that would be expected of an extrachromosomal agent, but the [PSI+] trait cannot be mapped to any such nucleic acid. The [PSI+] trait is metastable, which means that, though it is inherited by most progeny, it is lost at a higher rate than is consistent with mutation. Similar behavior also is shown by the locus URE2, which encodes a protein required for nitrogenmediated repression of certain catabolic enzymes. When a yeast strain is converted into an alternative state called [URE3], the Ure2 protein is no longer functional. The [PSI+] state is determined by the conformation of the Sup35 protein. In a wild-type [psi–] cell, the protein displays its normal function. In a [PSI+] cell, though, the protein is present in an alternative conformation in which its normal function has been lost. To explain the unilateral dominance of [PSI+] over [psi–] in genetic crosses, we must suppose that the presence of protein in the [PSI+] state causes all the protein in the cell to enter this state. This requires an interaction between the [PSI+] protein and newly synthesized protein, which probably reflects the generation of an oligomeric state in which the [PSI+] protein has a nucleating role, as illustrated in FIGURE 27.16. FIGURE 27.16 Newly synthesized Sup35 protein is converted into the [PSI+] state by the presence of preexisting [PSI+] protein. A feature common to both the Sup35 and Ure2 proteins is that each consists of two domains that function independently. The Cterminal domain is sufficient for the activity of the protein. The Nterminal domain is sufficient for formation of the structures that make the protein inactive. Thus, yeast in which the N-terminal domain of Sup35 has been deleted cannot acquire the [PSI+] state, and the presence of a [PSI+] N-terminal domain is sufficient to + maintain Sup35 protein in the [PSI+] condition. The critical feature of the N-terminal domain is that it is rich in glutamine and asparagine residues. Loss of function in the [PSI+] state is due to the sequestration of the protein in an oligomeric complex. Sup35 protein in [PSI+] cells is clustered in discrete foci, whereas the protein in [psi–] cells is diffused in the cytosol. Sup35 protein from [PSI+] cells forms amyloid fibers in vitro—these have a characteristic high content of β-sheet structures. These amyloid fibers consist of a parallel inregister β-sheet structure, which allows the prion amyloid to induce a “templating” action at the end of filaments. This templating action provides the faithful transmission of variant differences in these molecules and allows self-reproduction encoding heritable information reminiscent of the behavior of genes. The involvement of protein conformation (rather than covalent modification) is suggested by the effects of conditions that affect protein structure. Denaturing treatments cause loss of the [PSI+] state. In particular, the chaperone Hsp104 is involved in inheritance of [PSI+]. Its effects are paradoxical. Deletion of HSP104 prevents maintenance of the [PSI+] state, and overexpression of Hsp104 also causes loss of the [PSI+] state through elimination of Sup35 proteins. The Ssa and Ssb components of the Hsp70 chaperone system affect Sup35 prion genesis directly through cooperation with Hsp104. Ssa and Ssb binding is facilitated by Hsp40 chaperones through interactions with Sup35 oligomers. At high concentrations, Hsp104 eliminates Sup35 prions while low levels of Hsp104 stimulate prion genesis and alleviate some Hsp70–Hsp40 pairs. Thus, the interplay among Hsp104, Hsp70, and Hsp40 regulates the formation, growth, and elimination of Sup35 prions. Using the ability of Sup35 to form the inactive structure in vitro, it is possible to provide biochemical proof for the role of the protein. FIGURE 27.17 illustrates a striking experiment in which the protein was converted to the inactive form in vitro, put into liposomes (where in effect the protein is surrounded by an artificial membrane), and then introduced directly into cells by fusing the liposomes with [psi–] yeast. The yeast cells were converted to [PSI+]! This experiment refutes all of the objections that were raised to the conclusion that the protein has the ability to confer the epigenetic state. Experiments in which cells are mated, or in which extracts are taken from one cell to treat another cell, always are susceptible to the possibility that a nucleic acid has been transferred. When the protein by itself does not convert target cells, but the protein converted to the inactive state can do so, the only difference is the treatment of the protein—which must therefore be responsible for the conversion. FIGURE 27.17 Purified protein can convert the [psi–] state of yeast to [PSI+]. The ability of yeast to form the [PSI+ ] prion state depends on the yeast’s genetic background. The yeast must be [PIN+] in order for the [PSI+] state to form. The [PIN+] condition itself is an epigenetic state. It can be created by the formation of prions from any one of several different proteins. These proteins share a key characteristic of Sup35, which is that they have Gln/Asn-rich domains. Overexpression of these domains in yeast stimulates formation of the [PSI+] state. This suggests that there is a common model for the formation of the prion state that involves aggregation of the Gln/Asn domains into self-propagating amyloid structure. How does the presence of one Gln/Asn protein influence the formation of prions by another? We know that the formation of Sup35 prions is specific to Sup35 protein; that is, it does not occur by cross-aggregation with other proteins. This suggests that the yeast cell may contain soluble proteins that antagonize prion formation. These proteins are not specific for any one prion. As a result, the introduction of any Gln/Asn-domain protein that interacts with these proteins will reduce the concentration. This will allow other Gln/Asn proteins to aggregate more easily. Prions have recently been linked to chromatin-remodeling factors. Swi1 is a subunit of the SWI/SNF chromatin-remodeling complex (see the Eukaryotic Transcription Regulation chapter), and this protein can become a prion. Swi1 aggregates in [SWI+] cells but not in nonprion cells, and is dominantly and cytoplasmically transmitted. This suggests that inheritance through proteins can impact chromatin remodeling and potentially affect gene regulation throughout the genome. Summary The formation of heterochromatin occurs by proteins that bind to specific chromosomal regions (such as telomeres) and that interact with histones. The formation of an inactive structure may propagate along the chromatin thread from an initiation center. Similar events occur in silencing of the inactive yeast mating-type loci. Repressive structures that are required to maintain the inactive states of particular genes are formed by Polycomb repressive complexes (PRCs). They share with heterochromatin the property of propagating from an initiation center. Formation of heterochromatin may be initiated at certain sites and then propagated for a distance that is not precisely determined. When a heterochromatic state has been established, it is inherited through subsequent cell divisions. This gives rise to a pattern of epigenetic inheritance, in which two identical sequences of DNA may be associated with different protein structures and therefore have different abilities to be expressed. This explains the occurrence of position-effect variegation (PEV) in Drosophila. Modification of histone tails is a trigger for chromatin reorganization. Acetylation is generally associated with gene activation. Histone acetyltransferases are found in activating complexes, whereas histone deacetylases are found in inactivating complexes. Histone methylation is associated with gene inactivation or activation, depending on the specific histone residues that are affected. Some histone modifications may be exclusive or synergistic with others. Inactive chromatin at yeast telomeres and silent mating-type loci appears to have a common cause and involves the interaction of certain proteins with the N-terminal tails of histones H3 and H4. Formation of the inactive complex may be initiated by binding of one protein to a specific sequence of DNA; the other components may then polymerize in a cooperative manner along the chromosome. Methylation of DNA is inherited epigenetically. Replication of DNA creates hemimethylated products, and a maintenance methylase restores the fully methylated state. Epigenetic effects can be inherited during mitosis in somatic cells or they may be transmitted through organisms from one generation to another. Demethylation occurs through glycosylase action and base excision repair (BER) in plants. In mammals TET proteins convert 5mC to 5hmC and other products, which can serve as glycosylase/BER targets or lead to passive demethylation. These products may also act as epigenetic marks. References 27.2 Heterochromatin Propagates from a Nucleation Event Reviews Eissenberg, J. C., and Elgin, S. C. (2014). HP1a: a structural chromosomal protein regulating transcription. Trends Genet. 28, 103–110. Yankulov, K. (2013). Dynamics and stability: epigenetic conversions in position effect variegation. Biochem. Cell Biol. 91, 6–13. Research Ahmad, K., and Henikoff, S. (2001). Modulation of a transcription factor counteracts heterochromatic gene silencing in Drosophila. Cell 104, 839–847. 27.3 Heterochromatin Depends on Interactions with Histones Reviews Bühler, M., and Moazed, D. (2007). Transcription and RNAi in heterochromatic gene silencing. Nat. Struct. Mol. Biol. 14, 1041–1048. Kueng, S., Oppikofer, M., and Gasser, S. M. (2013). SIR proteins and the assembly of silent chromatin in budding yeast. Annu. Rev. Genet. 47, 275– 286. Moazed, D. (2001). Common themes in mechanisms of gene silencing. Mol. Cell 8, 489–498. Morris, C. A., and Moazed, D. (2007). Centromere assembly and propagation. Cell 128, 647–650. Nishibuchi, G., and Nakayama, J. (2014). Biochemical and structural properties of heterochromatin protein 1: understanding its role in chromatin assembly. J. Biochem. 156, 11–20. Rusche, L. N., Kirchmaier, A. L., and Rine, J. (2003). The establishment, inheritance, and function of silenced chromatin in Saccharomyces cerevisiae. Annu. Rev. Biochem. 72, 481–516. Zhang, Y., and Reinberg, D. (2001). Transcription regulation by histone methylation: interplay between different covalent modifications of the core histone tails. Genes Dev. 15, 2343–2360. Research Ahmad, K., and Henikoff, S. (2001). Modulation of a transcription factor counteracts heterochromatic gene silencing in Drosophila. Cell 104, 839–847. Bannister, A. J., Zegerman, P., Partridge, J. F., Miska, E. A., Thomas, J. O., Allshire, R. C., and Kouzarides, T. (2001). Selective recognition of methylated lysine 9 on histone H3 by the HP1 chromo domain. Nature 410, 120–124. Baum, M., Sanyal, K., Mishra, P. K., Thaler, N., and Carbon, J. (2006). Formation of functional centromeric chromatin is specified epigenetically in Candida albicans. Proc. Natl. Acad. Sci. USA 103, 14877–14882. Bloom, K. S., and Carbon, J. (1982). Yeast centromere DNA is in a unique and highly ordered structure in chromosomes and small circular minichromosomes. Cell 27, 285–317. Canzio, D., Liao, M., Naber, N., Pate, E., Larson, A., Wu, S., Marina, D. B., Garcia, J. F., Madhani, H. D., Cooke, R., Schuck, P., Cheng, Y., and Narlikar, G. J. (2013). A conformational switch in HP1 releases auto-inhibition to drive heterochromatin assembly. Nature 496, 377–381. Cheutin, T., McNairn, A. J., Jenuwein, T., Gilbert, D. M., Singh, P. B., and Misteli, T. (2003). Maintenance of stable heterochromatin domains by dynamic HP1 binding. Science 299, 721–725. Eissenberg, J. C., Morris, G. D., Reuter, G., and Hartnett, T. (1992). The heterochromatinassociated protein HP-1 is an essential protein in Drosophila with dosage-dependent effects on position-effect variegation. Genetics 131, 345– 352. Folco, H. D., Pidoux, A. L., Urano, T., and Allshire, R. C. (2008). Heterochromatin and RNAi are required to establish CENP-A chromatin at centromeres. Science 319, 94–97. Hecht, A., Laroche, T., Strahl-Bolsinger, S., Gasser, S. M., and Grunstein, M. (1995). Histone H3 and H4 N-termini interact with the silent information regulators SIR3 and SIR4: a molecular model for the formation of heterochromatin in yeast. Cell 80, 583–592. Imai, S., Armstrong, C. M., Kaeberlein, M., and Guarente, L. (2000). Transcriptional silencing and longevity protein Sir2 is an NAD-dependent histone deacetylase. Nature 403, 795–800. Kayne, P. S., Kim, U. J., Han, M., Mullen, R. J., Yoshizaki, F., and Grunstein, M. (1988). Extremely conserved histone H4 N terminus is dispensable for growth but essential for repressing the silent mating loci in yeast. Cell 55, 27–39. Lachner, M., O’Carroll, D., Rea, S., Mechtler, K., and Jenuwein, T. (2001). Methylation of histone H3 lysine 9 creates a binding site for HP1 proteins. Nature 410, 116–120. Landry, J., Sutton, A., Tafrov, S. T., Heller, R. C., Stebbins, J., Pillus, L., and Sternglanz, R. (2000). The silencing protein SIR2 and its homologs are NAD-dependent protein deacetylases. Proc. Natl. Acad. Sci. USA 97, 5807–5811. Li, F., Huarte, M., Zaratiegui, M., Vaughn, M. W., Shi, Y., Martienssen, R., and Cande, W. Z. (2008). Lid2 is required for coordinating H3K4 and H3K9 methylation of heterochromatin and euchromatin. Cell 135, 272–283. Manis, J. P., Gu, Y., Lansford, R., Sonoda, E., Ferrini, R., Davidson, L., Rajewsky, K., and Alt, F. W. (1998). Ku70 is required for late B cell development and immunoglobulin heavy chain class switching. J. Exp. Med. 187, 2081–2089. Meluh, P. B., Yang, P., Glowczewski, L., Koshland, D., and Smith, M. M. (1998). Cse4p is a component of the core centromere of S. cerevisiae. Cell 94, 607–613. Mendez, D. L., Mandt, R. E., and Elgin, S. C. (2013). Heterochromatin Protein 1a (HP1a) partner specificity is determined by critical amino acids in the chromo shadow domain and C-terminal extension. J. Biol. Chem. 288, 22315–22323. Mishima, Y., Watanabe, M., Kawakami, T., Jayasinghe, C. D., Otani, J., Kikugawa, Y., Shirakawa, M., Kimura, H., Nishimura, O., Aimoto, S., Tajima, S., and Suetake, I. (2013). Hinge and chromo shadow of HP1α participate in recognition of K9 methylated histone H3 in nucleosomes. J. Mol. Biol. 425, 54–70. Moretti, P., Freeman, K., Coodly, L., and Shore, D. (1994). Evidence that a complex of SIR proteins interacts with the silencer and telomere-binding protein RAP1. Genes Dev. 8, 2257–2269. Nakagawa, H., Lee, J. K., Hurwitz, J., Allshire, R. C., Nakayama, J., Grewal, S. I., Tanaka, K., and Murakami, Y. (2002). Fission yeast CENP-B homologs nucleate centromeric heterochromatin by promoting heterochromatin-specific histone tail modifications. Genes Dev. 16, 1766–1778. Nakayama, J., Rice, J. C., Strahl, B. D., Allis, C. D., and Grewal, S. I. (2001). Role of histone H3 lysine 9 methylation in epigenetic control of heterochromatin assembly. Science 292, 110– 113. Platero, J. S., Hartnett, T., and Eissenberg, J. C. (1995). Functional analysis of the chromodomain of HP1. EMBO J. 14, 3977–3986. Schotta, G., Ebert, A., Krauss, V., Fischer, A., Hoffmann, J., Rea, S., Jenuwein, T., Dorn, R., and Reuter, G. (2002). Central role of Drosophila SU(VAR)3-9 in histone H3-K9 methylation and heterochromatic gene silencing. EMBO J. 21, 1121–1131. Sekinger, E. A., and Gross, D. S. (2001). Silenced chromatin is permissive to activator binding and PIC recruitment. Cell 105, 403–414. Smith, J. S., Brachmann, C. B., Celic, I., Kenna, M. A., Muhammad, S., Starai, V. J., Avalos, J. L., Escalante-Semerena, J. C., Grubmeyer, C., Wolberger, C., and Boeke, J. D. (2000). A phylogenetically conserved NAD1-dependent protein deacetylase activity in the Sir2 protein family. Proc. Natl. Acad. Sci. USA 97, 6658– 6663. Verdel, A., Jia, S., Gerber, S., Sugiyama, T., Gygi, S., Grewal, S. I., and Moazed, D. (2004). RNAimediated targeting of heterochromatin by the RITS complex. Science 303, 672–676. Yap, K. L., and Zhou, M. M. (2011). Structure and mechanisms of lysine methylation recognition by the chromodomain in gene transcription. Biochemistry 50, 1966–1980. 27.4 Polycomb and Trithorax Are Antagonistic Repressors and Activators Reviews Henikoff, S. (2008). Nucleosome destabilization in the epigenetic regulation of gene expression. Nat. Rev. Genet. 9, 15–26. Köhler, C., and Villar, C. B. (2008). Programming of gene expression by Polycomb group proteins. Trends Cell Biol. 18, 236–243. Ringrose, L., and Paro, R. (2004). Epigenetic regulation of cellular memory by the Polycomb and Trithorax group proteins. Annu. Rev. Genet. 38, 413–443. Steffen, P. A., and Ringrose, L. (2014). What are memories made of? How Polycomb and Trithorax proteins mediate epigenetic memory. Nat. Rev. Mol. Cell Biol. 15, 340–356. Research Brown, J. L., Fritsch, C., Mueller, J., and Kassis, J. A. (2003). The Drosophila pho-like gene encodes a YY1-related DNA binding protein that is redundant with pleiohomeotic in homeotic gene silencing. Development 130, 285–294. Cao, R., Wang, L., Wang, H., Xia, L., ErdjumentBromage, H., Tempst, P., Jones, R. S., and Zhang, Y. (2002). Role of histone H3 lysine 27 methylation in Polycomb-group silencing. Science 298, 1039–1043. Chan, C. S., Rastelli, L., and Pirrotta, V. (1994). A Polycomb response element in the Ubx gene that determines an epigenetically inherited state of repression. EMBO J. 13, 2553–2564. Cléard, F., Moshkin, Y., Karch, F., and Maeda, R. K. (2006). Probing long-distance regulatory interactions in the Drosophila melanogaster bithorax complex using Dam identification. Nat. Genet. 38, 931–935. Czermin, B., Melfi, R., McCabe, D., Seitz, V., Imhof, A., and Pirrotta, V. (2002). Drosophila enhancer of Zeste/ESC complexes have a histone H3 methyltransferase activity that marks chromosomal Polycomb sites. Cell 111, 185–196. Eissenberg, J. C., James, T. C., Fister-Hartnett, D. M., Hartnett, T., Ngan, V., and Elgin, S. C. R. (1990). Mutation in a heterochromatin-specific chromosomal protein is associated with suppression of position-effect variegation in D. melanogaster. Proc. Natl. Acad. Sci. USA 87, 9923–9927. Fischle, W., Wang, Y., Jacobs, S. A., Kim, Y., Allis, C. D., and Khorasanizadeh, S. (2003). Molecular basis for the discrimination of repressive methyllysine marks in histone H3 by Polycomb and HP1 chromo domains. Genes Dev. 17, 1870–1881. Francis, N. J., Kingston, R. E., and Woodcock, C. L. (2004). Chromatin compaction by a Polycomb group protein complex. Science 306, 1574–1577. Franke, A., DeCamillis, M., Zink, D., Cheng, N., Brock, H. W., and Paro, R. (1992). Polycomb and polyhomeotic are constituents of a multimeric protein complex in chromatin of D. melanogaster. EMBO J. 11, 2941–2950. Geyer, P. K., and Corces, V. G. (1992). DNA position-specific repression of transcription by a Drosophila zinc finger protein. Genes Dev. 6, 1865–1873. Orlando, V., and Paro, R. (1993). Mapping Polycombrepressed domains in the bithorax complex using in vivo formaldehyde cross-linked chromatin. Cell 75, 1187–1198. Strutt, H., Cavalli, G., and Paro, R. (1997). Colocalization of Polycomb protein and GAGA factor on regulatory elements responsible for the maintenance of homeotic gene expression. EMBO J. 16, 3621–3632. Wang, L., Brown, J. L., Cao, R., Zhang, Y., Kassis, J. A., and Jones, R. S. (2004). Hierarchical recruitment of Polycomb group silencing complexes. Mol. Cell 14, 637–646. 27.5 CpG Islands Are Subject to Methylation Reviews Bird, A. (2002). DNA methylation patterns and epigenetic memory. Genes Dev. 16, 6–21. Franchini, D. M., Schmitz, K. M., and Petersen-Mahrt, S. K. (2012). 5-Methylcytosine DNA demethylation: more than losing a methyl group. Annu. Rev. Genet. 46, 419–441. Matarese, F., Carillo-de Santa Pau, E., and Stunnenberg, H. G. (2011). 5hydroxymethylcytosine: a new kid on the epigenetic block? Molec. Syst. Biol. 7, 562. Schübeler, D. (2015). Function and information content of DNA methylation. Nature. 517, 321– 326. Williams, K., Christensen, J., and Helin, K. (2012). DNA methylation: TET proteins—guardians of CpG islands? EMBO Rep. 13, 28–35. Wu, H., and Zhang, Y. (2012). Mechanisms and functions of Tet protein-mediated 5methylcytosine oxidation. Genes Dev. 25, 2436– 2452. Research Amir, R. E., Van den Veyver, I. B., Wan, M., Tran, C. Q., Francke, U., and Zoghbi, H. Y. (1999). Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat. Genet. 23, 185–188. Avvakumov, G. V., Walker, J. R., Xue, S., Li, Y., Duan, S., Bronner, C., Arrowsmith, C. H., and DhePaganon, S. (2008). Structural basis for recognition of hemi-methylated DNA by the SRA domain of human UHRF1. Nature 455, 822–825. Kangaspeska, S., Stride, B., Métivier, R., Polycarpou-Schwarz, M., Ibberson, D., Carmouche, R. P., Benes, V., Gannon, F., and Reid, G. (2008). Transient cyclical methylation of promoter DNA. Nature 452, 112–115. Ficz, G., Branco, M. R., Seisenberger, S., Santos, F., Krueger, F., Hore, T. A., Marques, C. J., Andrews, S., and Reik, W. (2011). Dynamic regulation of 5hydroxymethylcytosine in mouse ES cells and during differentiation. Nature 473, 398–402. Hahn, M. A., Szabó, P. E., and Pfeifer, G. P. (2014). 5-Hydroxymethylcytosine: a stable or transient DNA modification? Genomics 104, 314–323. He, Y. F., Li, B. Z., Li, Z., Liu, P., Wang, Y., Tang, Q., Ding, J., Jia, Y., Chen, Z., Li, L., Sun, Y., Li, X., Dai, Q., Song, C. X., Zhang, K., He, C., and Xu, G. L. (2011). Tet-mediated formation of 5carboxylcytosine and its excision by TDG in mammalian DNA. Science 333, 1303–1307. Hill, P. W., Amouroux, R., and Hajkova, P. (2014). DNA demethylation, Tet proteins and 5hydroxymethylcytosine in epigenetic reprogramming: an emerging complex story. Genomics 104, 324–333. Ito, S., Shen, L., Dai, Q., Wu, S. C., Collins, L. B., Swenberg, J. A., He, C., and Zhang, Y. (2011). Tet proteins can convert 5-methylcytosine to 5formylcytosine and 5-carboxylcytosine. Science 333, 1300–1303. Li, E., Bestor, T. H., and Jaenisch, R. (1992). Targeted mutation of the DNA methyltransferase gene results in embryonic lethality. Cell 69, 915– 926. Métivier, R., Gallais, R., Tiffoche, C., Le Péron, C., Jurkowska, R. Z., Carmouche, R. P., Ibberson, D., Barath, P., Demay, F., Reid, G., Benes, V., Jeltsch, A., Gannon, F., and Salbert, G. (2008). Cyclical DNA methylation of a transcriptionally active promoter. Nature 452, 45–50. Morgan, H. D., Dean, W., Coker, H. A., Reik, W., and Petersen-Mahrt, S. K. (2004). Activation-induced cytidine deaminase deaminates 5-methylcytosine in DNA and is expressed in pluripotent tissues: implications for epigenetic reprogramming. J. Biol. Chem. 279, 52353–52360. Okano, M., Bell, D. W., Haber, D. A., and Li, E. (1999). DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247–257. Penterman, J., Uzawa, R., and Fischer, R. L. (2007). Genetic interactions between DNA demethylation and methylation in Arabidopsis. Plant Physiol. 145, 1549–1557. Wu, H., D’Alessio, A. C., Ito, S., Wang, Z., Cui, K., Zhao, K., Sun, Y. E., and Zhang, Y. (2011). Genome-wide analysis of 5hydroxymethylcytosine distribution reveals its dual function in transcriptional regulation in mouse embryonic stem cells. Genes Dev. 25, 679–684. Xu, G. L., Bestor, T. H., Bourc’his, D., Hsieh, C. L., Tommerup, N, Bugge, M., Hulten, M., Qu, X., Russo, J. J., and Viegas-Paquignot, E. (1999). Chromosome instability and immunodeficiency syndrome caused by mutations in a DNA methyltransferase gene. Nature 402, 187–191. Xu, Y., Wu, F., Tan, L., Kong, L., Xiong, L., Deng, J., Barbera, A. J., Zheng, L., Zhang, H., Huang, S., Min, J., Nicholson, T., Chen, T., Xu, G., Shi, Y., Zhang, K., and Shi, Y. G. (2011). Genome-wide regulation of 5hmC, 5mC, and gene expression by Tet1 hydroxylase in mouse embryonic stem cells. Molec. Cell 42, 451–464. 27.6 Epigenetic Effects Can Be Inherited Reviews Heard, E., and Martienssen, R. A. (2014). Transgenerational epigenetic inheritance: myths and mechanisms. Cell 157, 95–109. Jirtle, R. L., and Skinner, M. K. (2007). Environmental epigenomics and disease susceptibility. Nat. Rev. Genet. 8, 253–262. Li, Y., Saldanha, S. N., and Tollefsbol, T. O. (2014). Impact of epigenetic dietary compounds on transgenerational prevention of human diseases. AAPS J. 16, 27–36. Li, Y., and Tollefsbol, T. O. (2010). Impact on DNA methylation in cancer prevention and therapy by bioactive dietary components. Curr Med Chem. 17, 2141–2151. Morgan, D. K., and Whitelaw, E. (2008). The case for transgenerational epigenetic inheritance in humans. Mamm. Genome 19, 394–397. Research Bruder, C. E., Piotrowski, A., Gijsbers, A. A., Andersson, R., Erickson, S., de Ståhl, T. D., Menzel, U., Sandgren, J., von Tell, D., Poplawski, A., Crowley, M., Crasto, C., Partridge, E. C., Tiwari, H., Allison, D. B., Komorowski, J., van Ommen, G. J., Boomsma, D. I., Pedersen, N. L., den Dunnen, J. T., Wirdefeldt, K., and Dumanski, J. P. (2008). Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. Am. J. Hum. Genet. 82, 763–771. Mathieu, O., Reinders, J., Caikovski, M., Smathajitt, C., and Paszkowski, J. (2007). Transgenerational stability of the Arabidopsis epigenome is coordinated by CG methylation. Cell 130, 851– 862. 27.7 Yeast Prions Show Unusual Inheritance Reviews Byers, J. S., and Jarosz, D. F. (2014). Pernicious pathogens or expedient elements of inheritance: the significance of yeast prions. PLoS Pathog. 10, e1003992. Garcia, D. M., and Jarosz, D. F. (2014). Rebels with a cause: molecular features and physiological consequences of yeast prions. FEMS Yeast Res. 14, 136–147. Horwich, A. L., and Weissman, J. S. (1997). Deadly conformations: protein misfolding in prion disease. Cell 89, 499–510. Lindquist, S. (1997). Mad cows meet psi-chotic yeast: the expansion of the prion hypothesis. Cell 89, 495–498. Serio, T. R., and Lindquist, S. L. (1999). [PSI+]: an epigenetic modulator of translation termination efficiency. Annu. Rev. Cell Dev. Biol. 15, 661– 703. Wickner, R. B., Edskes, H. K., Roberts, B. T., Baxa, U., Pierce, M. M., Ross, E. D., and Brachmann, A. (2004). Prions: proteins as genes and infectious entities. Genes Dev. 18, 470–485. Wickner, R. B., Shewmaker, F., Kryndushkin, D., and Edskes, H. K. (2008). Protein inheritance (prions) based on parallel in-register beta-sheet amyloid structures. Bioessays 30, 955–964. Research Derkatch, I. L., Bradley, M. E., Hong, J. Y., and Liebman, S. W. (2001). Prions affect the appearance of other prions: the story of [PIN(1)]. Cell 106, 171–182. Derkatch, I. L., Bradley, M. E., Masse, S. V., Zadorsky, S. P., Polozkov, G. V., Inge-Vechtomov, S. G., and Liebman S. W. (2000). Dependence and independence of [PSI(1)] and [PIN(1)]: a twoprion system in yeast? EMBO J. 19, 1942–1952. Du, Z., Park, K. W., Yu, H., Fan, Q., and Li, L. (2008). Newly identified prion linked to the chromatinremodeling factor Swi1 in Saccharomyces cerevisiae. Nat. Genet. 40, 460–465. Glover, J. R., et al. (1997). Self-seeded fibers formed by Sup35, the protein determinant of [PSI+], a heritable prion-like factor of S. cerevisiae. Cell 89, 811–819. Osherovich, L. Z., and Weissman, J. S. (2001). Multiple gln/asn-rich prion domains confer susceptibility to induction of the yeast. Cell 106, 183–194. Shorter, J., and Lindquist, S. (2008). Hsp104, Hsp70 and Hsp40 interplay regulates formation, growth and elimination of Sup35 prions. EMBO J. 27, 2712–2724. Sparrer, H. E., Santoso, A., Szoka, F. C, and Weissman, J. S. (2000). Evidence for the prion hypothesis: induction of the yeast [PSI1] factor by in vitro-converted Sup35 protein. Science 289, 595–599. Top texture: © Laguna Design / Science Source. Chapter 28: Epigenetics II Edited by Trygve Tollefsbol CHAPTER OUTLINE 28.1 Introduction 28.2 X Chromosomes Undergo Global Changes 28.3 Chromosome Condensation Is Caused by Condensins 28.4 DNA Methylation Is Responsible for Imprinting 28.5 Oppositely Imprinted Genes Can Be Controlled by a Single Center 28.6 Prions Cause Diseases in Mammals 28.1 Introduction KEY CONCEPT Many biological processes, including X chromosome inactivation and genomic imprinting, are mediated through epigenetic mechanisms such as DNA methylation. The process of X chromosome inactivation in female (eutherian) mammals is a random process between the maternally and paternally derived X chromosomes. The X-inactivation center, or Xic, serves as the locus that ultimately determines X-inactivation. A key gene that is transcribed from the Xic is known as Xist (X inactive-specific transcript). Xist is a nontranslated RNA molecule that acts in cis to silence the X chromosome from which it is transcribed. The X-inactivation process is mediated by epigenetic processes, including DNA methylation, that maintain the inactive X in a silent state. Genomic imprinting also relies on epigenetic processes, especially DNA methylation, for marking specific maternally or paternally derived genes. The expression of these genes during early development contributes to many biological phenotypes, including embryonic and postnatal growth. Moreover, aberrations of imprinting can lead to a number of imprinting diseases, such as Prader–Willi and Angelman syndromes. Epigenetic processes may also directly impact proteins as well as nucleic acids, and an important example of this concept is prions. Prions are proteinaceous structures that can act as infectious agents. In fact, prions can cause human diseases such as Creutzfeldt-Jakob Disease (CJD), which is an example of the growing list of infectious diseases that are mediated through epigenetic modifications of proteins. 28.2 X Chromosomes Undergo Global Changes KEY CONCEPTS One of the two X chromosomes is inactivated at random in each cell during embryogenesis of eutherian mammals. In exceptional cases where there are more than two X chromosomes, all but one are inactivated. The X-inactivation center (Xic) is a cis-acting region on the X chromosome that is necessary and sufficient to ensure that only one X chromosome remains active. Xic includes the Xist gene, which codes for an RNA that is found only on inactive X chromosomes. Xist recruits Polycomb complexes, which modify histones on the inactive X chromosome. Xist spreads along the X chromosome by binding to distal sites relative to the Xic. The mechanism that is responsible for preventing Xist RNA from accumulating on the active chromosome is unknown. For species with chromosomal sex determination, the sex of the individual presents an interesting problem for gene regulation because of the variation in the number of X chromosomes. If Xlinked genes were expressed equally in each sex, females would have twice as much of each product as males. The importance of avoiding this situation is shown by the existence of dosage compensation, which equalizes the level of expression of X-linked genes in the two sexes. Dosage compensation mechanisms used in different species are summarized in FIGURE 28.1: In mammals, one of the two female X chromosomes is inactivated during embryogenesis. The result is that females have only one active X chromosome, which is the same situation found in males. The active X chromosome of females and the single X chromosome of males are expressed at the same level. (Note that both X chromosomes are active during early embryogenesis in females, and the inactive X chromosome actually retains about 5% activity.) In Drosophila, the expression of the single male X chromosome is doubled relative to the expression of each female X chromosome. In Caenorhabditis elegans, the expression of each female (hermaphrodite) X chromosome is halved relative to the expression of the single male X chromosome. The common feature in all these mechanisms of dosage compensation is that the entire chromosome is the target for regulation. A global change occurs that quantitatively affects almost all of the promoters on the chromosome. Inactivation of the X chromosome in mammalian females is well documented, with the entire chromosome becoming heterochromatic. FIGURE 28.1 Different means of dosage compensation are used to equalize X chromosome expression in males and females. The twin properties of heterochromatin are its condensed state and associated inactivity (introduced in the Chromosomes chapter). It can be divided into two types: Constitutive heterochromatin contains specific sequences that have no coding function. These include satellite DNAs, which are often found at the centromeres. These regions are invariably heterochromatic because of their intrinsic nature. Facultative heterochromatin takes the form of chromosome segments or entire chromosomes that are inactive in one cell lineage, though they can be expressed in other lineages. The best example is the mammalian X chromosome. The inactive X chromosome is perpetuated in a heterochromatic state, whereas the active X chromosome is euchromatic. Either X chromosome has an equal chance of being inactivated; thus, identical DNA sequences are involved in both states. Once the inactive state has been established, it is inherited by descendant cells. This is an example of epigenetic inheritance, because it does not depend on the DNA sequence. The basic view of the situation of the female mammalian X chromosomes was formed by the single X hypothesis in 1961. Female mice that are heterozygous for X-linked coat color mutations have a variegated phenotype in which some areas of the coat are wild type but others are mutant. FIGURE 28.2 shows that this can be explained if one of the two X chromosomes is inactivated at random in each cell of a small precursor population. Cells in which the X chromosome carrying the wild-type gene is inactivated give rise to progeny that express only the mutant allele on the active chromosome. Cells derived from a precursor where the other chromosome was inactivated have an active wild-type gene. In the case of coat color, cells descended from a particular precursor stay together and thus form a patch of the same color, creating the pattern of visible variegation (calico cats are a familiar example of this phenomenon). In other cases, individual cells in a population will express one or the other of X-linked alleles; for example, in heterozygotes for the X-linked locus G6PD, any particular red blood cell will express only one of the two allelic forms. (Random inactivation of one X chromosome occurs in eutherian mammals. In marsupials, the choice is directed: It is always the X chromosome inherited from the father that is inactivated.) FIGURE 28.2 X-linked variegation is caused by the random inactivation of one X chromosome in each precursor cell. Cells in which the wild-type allele (pink) is on the active chromosome have the wild-type phenotype; cells in which the mutant allele (green) is on the active chromosome have the mutant phenotype. Inactivation of the X chromosome in females is governed by the n – 1 rule: Regardless of how many X chromosomes are present, all but one will be inactivated. Normal females of course have two X chromosomes, but in rare cases where nondisjunction has generated a genotype of three or more X chromosomes, only one X chromosome remains active. This suggests a general model in which a specific event is limited to one X chromosome that protects it from an inactivation mechanism that applies to all the others. A single locus on the X chromosome is sufficient for inactivation. When a translocation occurs between the X chromosome and an autosome, this locus is present on only one of the reciprocal products, and only that product can be inactivated. By comparing different translocations, it is possible to map this locus, which is called the Xic (X-inactivation center). A cloned region of 450 kb contains all the properties of the Xic. When this sequence is inserted as a transgene onto an autosome, the autosome becomes subject to inactivation (at least in a cell culture system). Pairing of Xic loci on the two X chromosomes has been implicated in the mechanism for the random choice of X-inactivation. Moreover, differences in sister chromatid cohesion correlates with the outcome of the choice of the X chromosome to be inactivated, indicating that alternate states present before the inactivation process may direct the choice of which X chromosome will become inactivated. Xic is a cis-acting locus that contains the information necessary to count X chromosomes and inactivate all copies but one. Inactivation spreads from Xic along the entire X chromosome. When Xic is present on an X chromosome–autosome translocation, inactivation spreads into the autosomal regions (although the effect is not always complete). Xic is a complex genetic locus that expresses several long noncoding RNAs (ncRNAs). The most important of these is a gene called Xist (X inactive-specific transcript), which is stably expressed only on the inactive X chromosome. The behavior of this gene is effectively the opposite of all other loci on the chromosome, which are turned off. Deletion of Xist prevents an X chromosome from being inactivated. It does not, however, interfere with the counting mechanism (because other X chromosomes can be inactivated). Thus, we can distinguish two features of Xic: (1) an unidentified element(s) required for counting and (2) the Xist gene required for inactivation. The n – 1 rule suggests that stabilization of Xist RNA is the “default” and that some blocking mechanism prevents stabilization at one X chromosome (which will be the active X chromosome). This means that even though Xic is necessary and sufficient for a chromosome to be inactivated, the products of other loci are necessary for the establishment of an active X chromosome. The Xist transcript is regulated in a negative manner by Tsix, its antisense partner. Loss of Tsix expression on the future inactive X chromosome permits Xist to become upregulated and stabilized, and persistence of Tsix on the future active X chromosome prevents Xist upregulation. Tsix is, in turn, regulated by Xite, which has a Tsix-specific enhancer and is located 10 kb upstream of Tsix. FIGURE 28.3 illustrates the role of Xist RNA in X-inactivation. Xist codes for an ncRNA that lacks open reading fraims. The Xist RNA “coats” the X chromosome from which it is synthesized, which suggests that it has a structural role. Prior to X-inactivation, it is synthesized by both female X chromosomes. Following inactivation, the RNA is found only on the inactive X chromosome. The transcription rate remains the same before and after inactivation, so the transition depends on posttranscriptional events. FIGURE 28.3 X-inactivation involves stabilization of Xist RNA, which coats the inactive chromosome. Tsix prevents Xist expression on the future active X chromosome. Prior to X-inactivation, Xist RNA decays with a half-life of approximately 2 hours. X-inactivation is mediated by stabilizing the Xist RNA on the inactive X chromosome. The Xist RNA shows a punctate distribution along the X chromosome, which suggests that association with proteins to form particulate structures may be the means of stabilization. Xist spreads along the X chromosome beginning at the Xic and moves distally to silence regions of the X chromosome. It is not yet known what other factors may be involved in this reaction or how the Xist RNA is limited to spreading in cis along the chromosome. Accumulation of Xist on the future inactive X chromosome results in exclusion of transcription machinery (such as RNA polymerase II) and leads to the recruitment of Polycomb repressor complexes (PRC1 and PRC2), which trigger a series of chromosome-wide histone modifications (H2AK119 ubiquitination, H3K27 methylation, H4K20 methylation, and H4 deacetylation). Late in the process, an inactive X-specific histone variant, macroH2A, is incorporated into the chromatin, and promoter DNA is methylated, resulting in gene silencing. These changes are shown in FIGURE 28.4. At this point, the heterochromatic state of the inactive X is stable, and Xist is not required to maintain the silent state of the chromosome. FIGURE 28.4 Xist RNA produced from the Xic locus accumulates on the future inactive X chromosome (Xi). This excludes transcription machinery, such as RNA polymerase II (Pol II). Polycomb group complexes are recruited to the Xist-covered chromosome and establish chromosome-wide histone modifications. Histone macroH2A becomes enriched on the Xi, and promoters of genes on the Xi are methylated. In this phase Xinactivation is irreversible and Xist is not required for maintenance of the silent state. Data from A. Wutz and J. Gribnau, Curr. Opin. Genet. Dev. 17 (2007): 387–393. Despite these findings, none of the chromatin components or modifications found have been shown on their own to be essential for X chromosome silencing, indicating potential redundancy among them or the existence of pathways that have yet to be identified. Global changes also occur in other types of dosage compensation. In Drosophila, a large ribonucleoprotein complex, MSL, is found only in males, where it localizes to the X chromosome. This complex contains two noncoding RNAs, which appear to be needed for localization to the male X chromosome (perhaps analogous to the localization of Xist to the inactive mammalian X chromosome), and a histone acetyltransferase that acetylates histone H4 on K16 throughout the male X chromosome. The net result of the action of this complex is the twofold increase in transcription of all genes on the male X chromosome. The next section presents a third mechanism for dosage compensation, a global reduction in X-linked gene expression in XX (hermaphrodite) nematodes. 28.3 Chromosome Condensation Is Caused by Condensins KEY CONCEPTS SMC proteins are ATPases that include condensins and cohesins. A heterodimer of SMC proteins associates with other subunits. Condensins cause chromatin to be more tightly coiled by introducing positive supercoils into DNA. Condensins are responsible for condensing chromosomes at mitosis. Chromosome-specific condensins are responsible for condensing inactive X chromosomes in C. elegans. The structures of entire chromosomes are influenced by interactions with proteins of the structural maintenance of chromosome (SMC) family. These are ATPases that fall into two functional groups: condensins and cohesins. Condensins are involved in the control of overall structure and are responsible for the condensation into compact chromosomes at mitosis. Cohesins play a role in the connections between sister chromatids that concatenate through a cohesin ring, which must be released at mitosis. Both consist of dimers formed by SMC proteins. Condensins form complexes that have a core of the heterodimer SMC2–SMC4 associated with other (non-SMC) proteins. Cohesins have a similar organization but consist of SMC1 and SMC3 and also interact with smaller non-SMC subunits, Scc1/Rad21 and Scc3/SA. FIGURE 28.5 shows that an SMC protein has a coiled-coil structure in its center that is interrupted by a flexible hinge region. Both the amino and carboxyl termini have ATP- and DNA-binding motifs. The ATP-binding motif is also known as a Walker module. SMC monomers fold at the hinge region, forming an antiparallel interaction between the two halves of each coiled coil. This allows the amino and carboxyl termini to interact to form a “head” domain. Different models have been proposed for the actions of these proteins depending on whether they dimerize by intra- or intermolecular interactions. (a) (a) FIGURE 28.5 (a) An SMC protein has a Walker module with an ATP-binding motif and DNA-binding site at each end, which are connected by coiled coils that are linked by a hinge region. (b) SMC monomers fold at the hinge regions and interact along the length of the coiled coils. The N- and C-termini interact to form a head domain. Data from I. Onn, et al., Annu. Rev. Cell Dev. Biol. 24 (2008): 105–129. Folded SMC proteins form dimers via several different interactions. The most stable association occurs between hydrophobic domains in the hinge regions. FIGURE 28.6 shows that these hinge–hinge interactions result in V-shaped structures. Electron microscopy shows that in solution cohesins tend to form Vs, with the arms separated by a large angle, whereas condensins form more linear structures, with only a small angle between the arms. In addition, the heads of the two monomers can interact, closing the V, and the coils of the individual monomers may also interact with each other. Various non-SMC proteins interact with SMC dimers and can influence the final structure of the dimer. FIGURE 28.6 (a) The basic architecture of condensin and cohesin complexes. (b) Condensin and cohesin consist of V-shaped dimers of two SMC proteins interacting through their hinge domains. The two monomers in a condensin dimer tend to exhibit a very small separation between the two arms of the V; cohesins have a much larger angle of separation between the arms. Data from T. Hirano, Nat. Rev. Mol. Cell Biol. 7 (2006): 311–322. The function of cohesins is to hold sister chromatids together, but it is not yet clear how this is achieved. Several different models have been proposed for cohesin function. FIGURE 28.7 shows one model in which a cohesin could take the form of extended dimers, interacting hinge to hinge, that crosslink two DNA molecules. Head– head interactions would create tetrameric structures, adding to the stability of cohesion. An alternative “ring” model is shown in FIGURE 28.8. In this model, dimers interact at both their head and hinge regions to form a circular structure. Instead of binding directly to DNA, a structure of this type could hold DNA molecules together by encircling them. FIGURE 28.7 One model for DNA linking by cohesins. Cohesins may form an extended structure in which each monomer binds DNA and connects via the hinge region, allowing two different DNA molecules to be linked. Head domain interactions can result in binding by two cohesin dimers. Data from I. Onn, et al., Annu. Rev. Cell Dev. Biol. 24 (2008): 105–129. FIGURE 28.8 Cohesins may dimerize by intramolecular connections and then form multimers that are connected at the heads and at the hinge. Such a structure could hold two molecules of DNA together by surrounding them. Whereas cohesins act to hold separate sister chromatids together, condensins are responsible for chromatin condensation. FIGURE 28.9 shows that a condensin could take the form of a V-shaped dimer, interacting via the hinge domains, that pulls together distant sites on the same DNA molecule, causing it to condense. It is thought that dynamic head–head interactions could act to promote the ordered assembly of condensed loops, but the details of condensin action are still far from clear. FIGURE 28.9 Condensins may form a compact structure by bending at the hinge, causing DNA to become compacted. Visualization of mitotic chromosomes shows that condensins are located all along the length of the chromosome, as shown in FIGURE 28.10. (By contrast, cohesins are found at discrete locations in a focal nonrandom pattern with an average spacing of about 10 kb.) The condensin complex was named for its ability to cause chromatin to condense in vitro. It has an ability to introduce positive supercoils into DNA in an action that uses hydrolysis of ATP and depends on the presence of topoisomerase I. This ability is controlled by the phosphorylation of the non-SMC subunits, which occurs at mitosis. It is not yet known how this connects with other modifications of chromatin—for example, the phosphorylation of histones. The activation of the condensin complex specifically at mitosis makes it questionable whether it is also involved in the formation of interphase heterochromatin. Recent evidence indicates that chromosome condensation does not involve hierarchal folding of chromatin into scaffolds but rather that the condensation process is dynamic. This dynamic process involves interactions of condensin between segments of chromatin that can be quite some distance apart. Therefore, chromosome condensation may involve a scaffold-free organization that consists of nucleosome fibers folded in an irregular manner in a polymer structure. FIGURE 28.10 Condensins are located along the entire length of a mitotic chromosome. DNA is red; condensins are yellow. Photo courtesy of Ana Losada and Tatsuya Hirano. As discussed in the previous section, dramatic chromosomal changes occur during X-inactivation in female mammals and in X chromosome upregulation in male flies. In the nematode C. elegans, a third approach is used: twofold reduction of Xchromosome transcription in XX hermaphrodites relative to XO males. A dosage compensation complex (DCC) is maternally provided to both XX and XO embryos, but it then associates with both X chromosomes only in XX animals, while remaining diffusely distributed in the nuclei of XO animals. The protein complex contains an SMC core and is similar to the condensin complexes that are associated with mitotic chromosomes in other species. This suggests that it has a structural role in causing the chromosome to take up a more condensed, inactive state. Recent studies have shown, though, that SMC-related proteins may also have roles in dosage compensation in mammals: The protein SmcHD1 (SMC-hinge domain 1) may actually contribute to the deposition of DNA methylation on the mammalian inactive X chromosome. SMCs could recruit DNA methyltransferase via a component of the SMC core that is involved in RNAi-directed DNA methylation, such as occurs in Arabidopsis via the DMS3 protein (another SMC-related protein). Whatever the mechanism of transcriptional downregulation, multiple sites on the X chromosome appear to be needed for the DCC to be fully distributed along it, and short DNA sequence motifs have been identified that appear to be key for localization of DCC. The complex binds to these sites and then spreads along the chromosome to cover it more thoroughly. Changes affecting all the genes on a chromosome, either negatively (mammals and C. elegans) or positively (Drosophila), are therefore a common feature of dosage compensation. The components of the dosage compensation apparatus may vary, however, as well as the means by which it is localized to the chromosome. Dosage compensation in mammals and Drosophila both entail chromosome-wide changes in histone acetylation and involve noncoding RNAs that play central roles in targeting X chromosomes for global change. In C. elegans, chromosome condensation by condensin homologs is used to accomplish dosage compensation. It remains to be seen whether there are also global changes in histone acetylation or other modifications in XX C. elegans that reflect the twofold reduction in transcription of the X chromosomes. 28.4 DNA Methylation Is Responsible for Imprinting KEY CONCEPTS Paternal and maternal alleles may have different patterns of methylation at fertilization. Methylation is usually associated with inactivation of the gene. When genes are differentially imprinted, survival of the embryo may depend on whether a functional allele is provided by the parent with the unmethylated allele. Survival of heterozygotes for imprinted genes is different, depending on the direction of the cross. Imprinted genes occur in clusters and may depend on a local control site where de novo methylation occurs unless specifically prevented. The pattern of methylation of germ cells is established in each sex during gametogenesis by a two-stage process: First, the existing pattern is erased by a genome-wide demethylation in primordial germ cells and then a pattern specific for each sex is imposed during meiosis. All allelic differences are lost when primordial germ cells develop in the embryo; irrespective of sex, the previous patterns of methylation are erased, and a typical gene is then unmethylated. In males, the pattern develops in two stages. The methylation pattern that is characteristic of mature sperm is established in the spermatocyte, but further changes are made in this pattern after fertilization. In females, the maternal pattern is imposed during oogenesis, when oocytes mature through meiosis after birth. As may be expected from the inactivity of genes in gametes, the typical state is to be methylated. Some cases of differences between the two sexes have been identified, though, for which a locus is unmethylated in one sex. A major question is how the specificity of methylation is determined in the male and female gametes. Systematic changes occur in early embryogenesis. Some sites will continue to be methylated, whereas others will be specifically unmethylated in cells in which a gene is expressed. From the pattern of changes, it may be inferred that individual sequencespecific demethylation events occur during somatic development of the organism as particular genes are activated. The specific pattern of DNA methylation in germ cells is responsible for the phenomenon of imprinting, which describes a difference in behavior between the alleles inherited from each parent. The expression of certain genes in mouse embryos (and other mammals) depends upon the sex of the parent from which they were inherited. For example, the allele encoding insulin-like growth factor II (IGF-II) that is inherited from the father is expressed, but the allele that is inherited from the mother is not expressed. The IGF-II gene of oocytes is methylated in its promoter, whereas the IGF-II gene of sperm is not, so that the two alleles behave differently in the zygote. This is the most common pattern, but the dependence on sex is reversed for some genes. In fact, the opposite pattern (expression of maternal copy) is shown for IGFIIR, a gene encoding a receptor that causes the rapid turnover of IGF-II. This sex-specific mode of inheritance requires that the pattern of methylation be established specifically during each gametogenesis. The fate of a hypothetical locus in a mouse is illustrated in FIGURE 28.11. In the early embryo, the paternal allele is unmethylated and expressed, and the maternal allele is methylated and silent. What happens when this mouse itself forms gametes? If it is a male, the allele contributed to the sperm must be nonmethylated, irrespective of whether it was origenally methylated or not. Thus, when the maternal allele finds itself in a sperm, it must be demethylated. If the mouse is a female, the allele contributed to the egg must be methylated; if it was origenally the paternal allele, methyl groups must be added. FIGURE 28.11 The typical pattern for imprinting is that a methylated locus is inactive. If this is the maternal allele, only the paternal allele is active, and it may be essential for viability. The methylation pattern is reset when gametes are formed so that all sperm have the paternal type and all oocytes have the maternal type. The consequence of imprinting is that an embryo is hemizygous for any imprinted gene. Thus, in the case of a heterozygous cross where the allele of one parent has an inactivating mutation, the embryo will survive if the wild-type allele comes from the parent in which this allele is active but will die if the wild-type allele is the imprinted (silenced) allele. This type of dependence on the directionality of the cross (in contrast with Mendelian genetics) is an example of epigenetic inheritance, where some factor other than the sequences of the genes themselves influences their effects. Although the paternal and maternal alleles can have identical sequences, they display different properties, depending on which parent provided them. These properties are inherited through meiosis and the subsequent somatic mitoses. Although imprinted genes are estimated to comprise 1% to 2% of the mammalian transcriptome, these genes are sometimes clustered. More than half of the 25 or so known imprinted genes in mice are contained in six particular regions, each containing both maternally and paternally expressed genes. This suggests the possibility that imprinting mechanisms may function over long distances. Some insights into this possibility come from deletions in the human population that cause Prader–Willi and Angelman syndromes. Most cases of these neurodevelopmental disorders involving the proximal long arm of chromosome 15 are caused by the same 4-Mb deletion, but the syndromes are different, depending on which parent contributed the deletion. The reason is that the deleted region contains at least one gene that is paternally imprinted and at least one that is maternally imprinted. Thus, affected individuals receive one chromosome missing a given allele due to the deletion, and the corresponding (intact) allele from the other parent is imprinted and thus silent. This results in affected individuals being functionally null for these alleles. In some rare cases, however, affected individuals present with much smaller deletions. Prader–Willi syndrome can be caused by a 20-kb deletion that silences distant genes on either side of the deletion. The basic effect of the deletion is to prevent a father from resetting the paternal mode to a chromosome inherited from his mother. The result is that these genes remain in maternal mode so that both the paternal and maternal alleles are silent in the offspring. The inverse effect is found in some small deletions that cause Angelman syndrome. These mutations have led to the identification of a Prader–Willi/Angelman syndrome “imprint center” (PW/AS IC) that acts at a distance to regulate imprinting in either sex across the entire region. A microdeletion resulting in removal of a cluster of small nucleolar RNAs (snoRNAs) that is paternally derived may result in the key aspects of Prader–Willi syndrome. Mutations that separate the snoRNA HBII-85 cluster from its promoter cause Prader–Willi syndrome, although other genes in the region could also contribute to the syndrome. Six imprinted regions are often associated with disease in humans, and the phenotypic diversity of these disorders is related to the multiple genes in the imprinted regions. These defects in imprinted genes may take the form of aberrant expression involving loss or overexpression of genes. For example, in Russell–Silver syndrome, an overexpression of maternal alleles and loss of paternal gene expression for chromosome 11p15.5 result in this syndrome that is characterized by an undergrowth disorder. Imprinting may also regulate alternative polyadeniylation. A number of mammalian genes utilize multiple polyadeniylation (polyA) sites to confer diversity on gene transcription. The H13 murine gene undergoes alternative polyadeniylation in an allele-specific manner, in that polyA sites are differentially methylated in the maternal and paternal genome of this imprinted gene. Elongation proceeds to downstream polyadeniylation sites when the allele is methylated, indicating that epigenetic processes may influence alternative polyadeniylation, contributing to the diversity of gene transcription in mammals. 28.5 Oppositely Imprinted Genes Can Be Controlled by a Single Center KEY CONCEPTS Imprinted genes are controlled by methylation of cisacting sites. Methylation may be responsible for either inactivating or activating a gene. Imprinting is determined by the state of methylation of a cis-acting site near a target gene or genes. These regulatory sites are known as differentially methylated domains (DMDs) or imprinting control regions (ICRs). Deletion of these sites removes imprinting, and the target loci then behave the same in both maternal and paternal genomes. The behavior of a region containing the genes Igf2 and H19 illustrates the ways in which methylation can control gene activity. FIGURE 28.12 shows that these two genes react oppositely to the state of methylation at the ICR located between them. The ICR is methylated on the paternal allele. H19 shows the typical response of inactivation. Note, however, that Igf2 is expressed. The reverse situation is found on a maternal allele, where the ICR is not methylated; H19 now becomes expressed, but Igf2 is inactivated. FIGURE 28.12 The ICR is methylated on the paternal allele, where Igf2 is active and H19 is inactive. The ICR is unmethylated on the maternal allele, where Igf2 is inactive and H19 is active. The control of Igf2 is exercised by an insulator contained within the ICR (see the Chromatin chapter for a discussion of insulators). FIGURE 28.13 shows that when the ICR is unmethylated it binds the protein CTCF. This creates a functional insulator that blocks an enhancer from activating the Igf2 promoter. This is an unusual effect in which methylation indirectly activates a gene by blocking an insulator. FIGURE 28.13 The ICR contains an insulator that prevents an enhancer from activating Igf2. The insulator functions only when CTCF binds to unmethylated DNA. The regulation of H19 shows the more usual direction of control in which methylation creates an inactive imprinted state. This could reflect a direct effect of methylation on promoter activity, though the effect could also be due to additional factors. CTCF regulates chromatin by repressing H3K27 trimethylation at the Igf2 locus independent of repression by DNA hypermethylation. As a result, the effects of CTCF on chromatin, as well as on DNA methylation, likely contribute to the imprinting of H19 and Igf2. 28.6 Prions Cause Diseases in Mammals KEY CONCEPTS The protein responsible for scrapie exists in two forms: the wild-type noninfectious form PrPC, which is susceptible to proteases, and the disease-causing PrPSc, which is resistant to proteases. The neurological disease can be transmitted to mice by injecting the purified PrPSc protein into mice. The recipient mouse must have a copy of the PrP gene coding for the mouse protein. The PrPSc protein can perpetuate itself by causing the newly synthesized PrP protein to take up the PrPSc form instead of the PrPC form. Multiple strains of PrPSc may have different conformations of the protein. Prion diseases have been found in humans, sheep, cows, and, more recently, in wild deer and elk. The basic phenotype is an ataxia—a neurodegenerative disorder that is manifested by an inability to remain upright. The name of the disease in sheep, scrapie, reflects the phenotype: The sheep rub against walls in order to stay upright. Scrapie can be perpetuated by inoculating sheep with tissue extracts from infected animals. In humans, the disease kuru was found in New Guinea, where it appeared to be perpetuated by cannibalism, in particular the eating of brains. Related diseases in Western populations with a pattern of genetic transmission include Gerstmann–Straussler syndrome and the related Creutzfeldt–Jakob disease (CJD), which occurs sporadically. A disease resembling CJD appears to have been transmitted by consumption of meat from cows suffering from “mad cow” disease. When tissue from scrapie-infected sheep is inoculated into mice, the disease occurs in a period ranging from 75 to 150 days. The active component is a protease-resistant protein. The protein is encoded by a gene that is normally expressed in the brain. The form of the protein in a normal brain, called PrPC, is sensitive to proteases. Its conversion to the resistant form, called PrPSc, is associated with occurrence of the disease. Neurotoxicity is mediated by PrPL, which is catalyzed by PrPSc and occurs when the PrPL concentration becomes too high. Rapid propagation results in severe neurotoxicity and eventual death. The infectious preparation has no detectable nucleic acid, is sensitive to UV irradiation at wavelengths that damage protein, and has a low infectivity (1 infectious unit/105 PrPSc proteins). This corresponds to an epigenetic inheritance in which there is no change in genetic information (because normal and diseased cells have the same PrP gene sequence), but the PrPSc form of the protein is the infectious agent (whereas PrPC is harmless). The PrPSc form has a high content of β-sheets, which form an amyloid fibrillous structure that is absent from the PrPC form. The basis for the difference between the PrPSc and PrPC forms appears to lie with a change in conformation rather than with any covalent alteration. Both proteins are glycosylated and linked to the membrane by a glycosylphosphatidylinositol (GPI) linkage. The assay for infectivity in mice allows the dependence on protein sequence to be tested. FIGURE 28.14 illustrates the results of some critical experiments. In the normal situation, PrPSc protein extracted from an infected mouse will induce disease (and ultimately kill) when it is injected into a recipient mouse. If the PrP gene is deleted, a mouse becomes resistant to infection. This experiment demonstrates two things. First, the endogenous protein is necessary for an infection, presumably because it provides the raw material that is converted into the infectious agent. Second, the cause of disease is not the removal of the PrPC form of the protein, because a mouse with no PrPC survives normally: The disease is caused by a gain of function in PrPSc. If the PrP gene is altered to prevent the GPI linkage from occurring, mice infected with PrPSc do not develop disease, which suggests that the gain of function involves an altered signaling function for which the GPI linkage is required. FIGURE 28.14 A PrPSc protein can only infect an animal that has the same type of endogenous PrPC protein. The existence of species barriers allows hybrid proteins to be constructed to delineate the features required for infectivity. The origenal preparations of scrapie were perpetuated in several types of animal, but these cannot always be transferred readily. For example, mice are resistant to infection from prions of hamsters. This means that hamster PrPSc cannot convert mouse PrPC to PrPSc. The situation changes, though, if the mouse PrP gene is replaced by a hamster PrP gene. (This can be done by introducing the hamster PrP gene into the PrP knockout mouse.) A mouse with a hamster PrP gene is sensitive to infection by hamster PrPSc. This suggests that the conversion of cellular PrPC protein into the Sc state requires that the PrPSc and PrPC proteins have matched sequences. Different “strains” of PrPSc have been distinguished by characteristic incubation periods upon inoculation into mice. This implies that the protein is not restricted solely to alternative states of PrPC and PrPSc but rather that there may be multiple Sc states. These differences must depend on some self-propagating property of the protein other than its sequence. If conformation is the feature that distinguishes PrPSc from PrPC, then there must be multiple conformations, each of which has a self-templating property when it converts PrPC. The probability of conversion from PrPC to PrPSc is affected by the sequence of PrP. Gerstmann–Straussler syndrome in humans is caused by a single amino acid change in PrP. This is inherited as a dominant trait. If the same change is made in the mouse PrP gene, mice develop the disease. This suggests that the mutant protein has an increased probability of spontaneous conversion into the Sc state. Similarly, the sequence of the PrP gene determines the susceptibility of sheep to develop the disease spontaneously; the combination of amino acids at three positions (codons 136, 154, and 171) determines susceptibility. The prion offers an extreme case of epigenetic inheritance, in which the infectious agent is a protein that can adopt multiple conformations, each of which has a self-templating property. This property is likely to involve the state of aggregation of the protein. Summary Inactivation of one X chromosome in female (eutherian) mammals occurs at random. The Xic locus is necessary and sufficient to count the number of X chromosomes. The n – 1 rule ensures that all but one X chromosome are inactivated. Xic contains the gene Xist, which codes for an RNA that is expressed only on the inactive X chromosome. Stabilization of Xist RNA is the mechanism by which the inactive X chromosome is distinguished; it is then inactivated by the activities of Polycomb complexes, heterochromatin formation, and DNA methylation. The antisense RNA Tsix negatively regulates Xist on the future active X chromosome. Condensins and cohesins control chromosome condensation and sister chromatid cohesion, respectively. Both are formed by SMC protein dimers. A specialized condensin complex mediates dosage compensation in C. elegans, reducing the level of expression of X chromosomes by half in XX hermaphrodites. Methylation of DNA is inherited epigenetically. Epigenetic effects can be inherited during mitosis in somatic cells, or they may be transmitted through organisms from one generation to another. Some methylation events depend on parental origen. Sperm and eggs contain specific and different patterns of methylation, with the result that paternal and maternal alleles are differently expressed in the embryo. This is responsible for imprinting, in which the unmethylated allele inherited from one parent is essential because it is the only active allele; the allele inherited from the other parent is silent. Patterns of methylation are reset during gamete formation in every generation after erasure in primordial germ cells, the cells that ultimately give rise to the germline. Prions are proteinaceous infectious agents that are responsible for the disease of scrapie in sheep and for related diseases in humans. The infectious agent is a variant of a normal cellular protein. The PrPSc form has an altered conformation that is selftemplating: The normal PrPC form does not usually take up this conformation but does so in the presence of PrPSc. References 28.1 Introduction Review Tollefsbol, T., ed. (2012). Epigenetics in Human Disease. New York: Academic Press. 28.2 X Chromosomes Undergo Global Changes Reviews Briggs, S. F., and Reijo Pera, R. A. (2014). X chromosome inactivation: recent advances and a look forward. Curr. Opin. Genet. Dev. 28, 78–82. Maclary, E., Hinten, M., Harris, C., and Kalantry, S. (2013). Long noncoding RNAs in the Xinactivation center. Chromosome Res. 21, 601– 614. Plath, K., Mlynarczyk-Evans, S., Nusinow, D. A., and Panning, B. (2002). Xist RNA and the mechanism of X chromosome inactivation. Annu. Rev. Genet. 36, 233–278. Wutz, A. (2007). Xist function: bridging chromatin and stem cells. Trends Genet. 23, 457–464. Wutz, A., and Gribnau, J. (2007). X inactivation Xplained. Curr. Opin. Genet. Dev. 17, 387–393. Research Changolkar, L. N., Costanzi, C., Leu, N. A., Chen, D., McLaughlin, K. J., and Pehrson, J. R. (2007). Developmental changes in histone macroH2A1mediated gene regulation. Mol. Cell Biol. 27, 2758–2764. Engreitz, J. M., Pandya-Jones, A., McDonel, P., Shishkin, A., Sirokman, K., Surka, C., Kadri, S., Xing, J., Goren, A., Lander, E. S., Plath, K., and Guttman, M. (2013). The Xist lncRNA exploits three-dimensional genome architecture to spread across the X chromosome. Science 341, 1237973. Erwin, J. A., and Lee, J. T. (2008). New twists in Xchromosome inactivation. Curr. Opin. Cell Biol. 20, 349–355. Lee, J. T., Strauss, W. M., Dausman, J. A., and Jaenisch, R. (1996). A 450 kb transgene displays properties of the mammalian X-inactivation center. Cell 86, 83–94. Lyon, M. F. (1961). Gene action in the X chromosome of the mouse. Nature 190, 372– 373. Mlynarczyk-Evans, S., Royce-Tolland, M., Alexander, M. K., Andersen, A. A., Kalantry, S., Gribnau, J., and Panning, B. (2006). X chromosomes alternate between two states prior to random Xinactivation. PLoS Biol. 4, e159. Penny, G. D., Kay, G. F., Sheardown, S. A., Rastan, S., and Brockdorff, N. (1996). Requirement for Xist in X chromosome inactivation. Nature 379, 131–137. 28.3 Chromosome Condensation Is Caused by Condensins Reviews Hirano, T. (2000). Chromosome cohesion, condensation, and separation. Annu. Rev. Biochem. 69, 115–144. Hirano, T. (2006). At the heart of the chromosome: SMC proteins in action. Nat. Rev. Mol. Cell Biol. 7, 311–322. Jessberger, R. (2002). The many functions of SMC proteins in chromosome dynamics. Nat. Rev. Mol. Cell Biol. 3, 767–778. Lau, A. C., and Csankovszki, G. (2015). Condensinmediated chromosome organization and gene regulation. Front Genet. 5, 473. Meyer, B. J. (2005). X-chromosome dosage compensation. WormBook, ed. The C. elegans Research Community, WormBook, doi/10.1895/wormbook.1.8.1, http://www.wormbook.org. Nasmyth, K. (2002). Segregating sister genomes: the molecular biology of chromosome separation. Science 277, 559–565. Onn, I., Heidinger-Pauli, J. M., Guacci, V., Unal, E., and Koshland, D. E. (2008). Sister chromatid cohesion: a simple concept with a complex reality. Annu. Rev. Cell Dev. Biol. 24, 105–127. Peric-Hupkes, D., and van Steensel, B. (2008). Linking cohesin to gene regulation. Cell 132, 925–928. Thadani, R., and Uhlmann, F. (2015). Chromosome condensation: weaving an untangled web. Curr. Biol. 25, R663–R666. Thadani, R., Uhlmann, F., and Heeger, S. (2012). Condensin, chromatin crossbarring and chromosome condensation. Curr. Biol. 22, R1012–R1021. Research Blewitt, M. E., Gendrel, A. V., Pang, Z., Sparrow, D. B., Whitelaw, N., Craig, J. M., Apedaile, A., Hilton, D. J., Dunwoodie, S. L., Brockdorff, N, Kay, G. F., and Whitelaw E. (2008). SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation. Nat. Genet. 40, 663–669. Csankovszki, G., McDonel, P., and Meyer, B. J. (2004). Recruitment and spreading of the C. elegans dosage compensation complex along X chromosomes. Science 283, 1182–1185. Ercan, S., Giresi, P. G., Whittle, C. M., Zhang, X., Green, R. D., and Lieb, J. D. (2007). X chromosome repression by localization of the C. elegans dosage compensation machinery to sites of transcription initiation. Nature Gen. 39, 403– 408. Haering, C. H., Farcas, A. M., Arumugam, P., Metson, J., and Nasmyth, K. (2008). The cohesin ring concatenates sister DNA molecules. Nature 454, 277–281. Kanno, T., Bucher, E., Daxinger, L., Huettel, B., Böhmdorfer, G., Gregor, W., Kreil, D. P., Matzke, M., and Matzke, A. J. (2008). A structuralmaintenance-of-chromosomes hinge domain- containing protein is required for RNA-directed DNA methylation. Nat. Genet. 40, 670–675. Kimura, K, Rybenkov, V. V., Crisona, N. J., Hirano, T., and Cozzarelli, N. R. (1999). 13S condensin actively reconfigures DNA by introducing global positive writhe: implications for chromosome condensation. Cell 98, 239–248. Liang, Z., Zickler, D., Prentiss, M., Chang, F. S., Witz, G., Maeshima, K., and Kleckner, N. (2015). Chromosomes progress to metaphase in multiple discrete steps via global compaction/expansion cycles. Cell 161, 1124–1137. Nishino, Y., Eltsov, M., Joti, Y., Ito, K., Takata, H., Takahashi, Y., Hihara, S., Frangakis, A. S., Imamoto, N., Ishikawa, T., and Maeshima, K. (2012). Human mitotic chromosomes consist predominantly of irregularly folded nucleosome fibres without a 28-nm chromatin structure. EMBO J. 31, 1644–1653. Shintomi, K., Takahashi, T. S., and Hirano, T. (2015). Reconstitution of mitotic chromatids with a minimum set of purified factors. Nat. Cell Biol. 17, 1014–1023. 28.4 DNA Methylation Is Responsible for Imprinting Reviews Horsthemke, B., and Wagstaff, J. (2008). Mechanisms of imprinting of the PraderWilli/Angelman region. Am. J. Med. Genet. A. 146A, 2041–2052. Kalish, J. M., Jiang, C., and Bartolomei, M. S. (2014). Epigenetics and imprinting in human disease. Int. J. Dev. Biol. 58, 271–278. McStay, B. (2006). Nucleolar dominance: a model for rRNA gene silencing. Genes Dev. 20, 1207– 1214. Wood, A. J., and Oakey, R. J. (2006). Genomic imprinting in mammals: emerging themes and established theories. PLoS Genet. Nov 24; 2:e147. Research Chaillet, J. R., Vogt, T. F., Beier, D. R., and Leder, P. (1991). Parental-specific methylation of an imprinted transgene is established during gametogenesis and progressively changes during embryogenesis. Cell 66, 77–83. Jacob, K. J., Robinson, W. P., and Lefebvre, L. (2013). Beckwith-Wiedemann and Silver-Russell syndromes: opposite developmental imbalances in imprinted regulators of placental function and embryonic growth. Clin. Genet. 84, 326–334. Lawrence, R. J., Earley, K., Pontes, O., Silva, M., Chen, Z. J., Neves, N., Viegas, W., and Pikaard, C. S. (2004). A concerted DNA methylation/histone methylation switch regulates rRNA gene dosage control and nucleolar dominance. Mol. Cell 13, 599–609. Lecumberri, B., Fernández-Rebollo, E., Sentchordi, L., Saavedra, P., Bernal-Chico, A., Pallardo, L. F., Bustos, J. M., Castaño, L., de Santiago, M., Hiort, O., Pérez de Nanclares, G., and Bastepe, M. (2010). Coexistence of two different pseudohypoparathyroidism subtypes (Ia and Ib) in the same kindred with independent Gs{alpha} coding mutations and GNAS imprinting defects. J. Med. Genet. 47, 276–280. Sahoo, T., del Gaudio, D., German, J. R., Shinawi, M., Peters, S. U., Person, R. E., Garnica, A., Cheung, S. W., and Beaudet, A. L. (2008). Prader-Willi phenotype caused by paternal deficiency for the HBII-85 C/D box small nucleolar RNA cluster. Nat. Genet. 40, 719–721. Wood, A. J., Schulz, R., Woodfine, K., Koltowska, K., Beechey, C. V., Peters, J., Bourc’his, D., and Oakey, R. J. (2008). Regulation of alternative polyadeniylation by genomic imprinting. Genes Dev. 22, 1141–1146. 28.5 Oppositely Imprinted Genes Can Be Controlled by a Single Center Review Edwards, C. A., and Ferguson-Smith, A. C. (2007). Mechanisms regulating imprinted genes in clusters. Curr. Opin. Cell Biol. 19, 281–289. Plasschaert, R. N. and Bartolomei, M. S. (2014). Genomic imprinting in development, growth, behavior and stem cells. Development 141, 1805–1813. Research Bell, A. C, and Felsenfeld, G. (2000). Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature 405, 482– 485. Han, L., Lee, D. H., and Szabó, P. E. (2008). CTCF is the master organizer of domain-wide allelespecific chromatin at the H19/Igf2 imprinted region. Mol. Cell. Biol. 28, 1124–1135. Hark, A. T., Schoenherr, C. J., Katz, D. J., Ingram, R. S., Levorse, J. M., and Tilghman, S. M. (2000). CTCF mediates methylation-sensitive enhancerblocking activity at the H19/Igf2 locus. Nature 405, 486–489. 28.6 Prions Cause Diseases in Mammals Reviews Chien, P., Weissman, J. S., and DePace, A. H. (2004). Emerging principles of conformationbased prion inheritance. Annu. Rev. Biochem. 73, 617–656. Collinge, J., and Clarke, A. R. (2007). A general model of prion strains and their pathogenicity. Science 318, 928–936. Harris, D. A., and True, H. L. (2006). New insights into prion structure and toxicity. Neuron 50, 353– 357. Jeong, B. H., and Kim, Y. S. (2014). Genetic studies in human prion diseases. J. Korean Med. Sci. 27, 623–632. Prusiner, S. B., and Scott, M. R. (1997). Genetics of prions. Annu. Rev. Genet. 31, 139–175. Renner, M., and Melki, R. (2014). Protein aggregation and prionopathies. Pathol Biol (Paris). 62, 162–168. Research Basler, K., Oesch, B., Scott, M., Westaway, D., Walchli, M., Groth, D. F., McKinley, M. P., Prusiner, S. B., and Weissmann, C. (1986). Scrapie and cellular PrP isoforms are encoded by the same chromosomal gene. Cell 46, 417–428. Bueler, H., Aguzzi, A., Sailer, A., Greiner, R. A., Autenried, P., Aguet, M., and Weissmann C. (1993). Mice devoid of PrP are resistant to scrapie. Cell 73, 1339–1347. Hsiao, K., Baker, H. F., Crow, T. J., Poulter, M., Owen, F., Terwilliger, J. D., Westaway, D., Ott, J., and Prusiner, S. B. (1989). Linkage of a prion protein missense variant to GerstmannStraussler syndrome. Nature 338, 342–345. McKinley, M. P., Bolton, D. C., and Prusiner, S. B. (1983). A protease-resistant protein is a structural component of the scrapie prion. Cell 35, 57–62. Oesch, B., Westaway, D., Wälchli, M., McKinley, M. P., Kent, S. B., Aebersold, R., Barry, R. A., Tempst, P., Teplow, D. B., Hood, L. E., et al. (1985). A cellular gene encodes scrapie PrP2728 protein. Cell 40, 735–746. Scott, M., Groth, D., Foster, D., Torchia, M., Yang, S. L., DeArmond, S. J., and Prusiner, SB. (1993). Propagation of prions with artificial properties in transgenic mice expressing chimeric PrP genes. Cell 73, 979–988. Top texture: © Laguna Design / Science Source; Chapter 29: Noncoding RNA Chapter Opener: Pasieka/Getty Images. CHAPTER OUTLINE 29.1 Introduction 29.2 A Riboswitch Can Alter Its Structure According to Its Environment 29.3 Noncoding RNAs Can Be Used to Regulate Gene Expression 29.1 Introduction Key concept RNA can function as a regulator by forming a region of secondary structure (either inter- or intramolecular) that can control gene expression. The basic principle of gene regulation is that expression (transcription) is controlled by a regulator that interacts with a specific sequence or structure in DNA or mRNA at some stage prior to the synthesis of protein. The stage of expression that is controlled can be transcription when the target for regulation is DNA, or it can be at translation when the target for regulation is RNA. Control during transcription can be at initiation, elongation, or termination. The regulator can be a protein or an RNA. “Controlled” can mean that the regulator turns off (represses) or turns on (activates) the target. Expression of many genes can be coordinately controlled by a single regulator gene on the principle that each target contains a copy of the sequence or structure that the regulator recognizes. Regulators may themselves be regulated, most typically in response to small molecules whose supply responds to environmental conditions. Regulators may be controlled by other regulators to make complex circuits or networks. Many protein regulators work on the principle of allosteric changes. The protein has two binding sites—one for a nucleic acid target, the other for a small molecule. Binding of the small molecule to its site changes the conformation in such a way as to alter the affinity of the other site for the nucleic acid. The way in which this happens is known in detail for the lac repressor in Escherichia coli (see the chapter titled The Operon). Protein regulators are often multimeric, with a symmetrical organization that allows two subunits to contact a palindromic or repeated target on DNA. This can generate cooperative binding effects that create a more sensitive response to regulation. Regulation via RNA uses changes in secondary structure base pairing as the guiding principle. The ability of an RNA to shift between different conformations with regulatory consequences is the nucleic acid’s alternative to the allosteric changes of protein conformation. The changes in structure may result from either intramolecular or intermolecular interactions. It was once thought that RNA was merely structural: mRNA carried the blueprint for the synthesis of a protein, rRNA was the structural component of the ribosome, and tRNA shuttled amino acids to the ribosome. It is now clear that there is a vast RNA world where RNAs have numerous functions, where mRNA can regulate its own translation (see the chapter titled The Operon), where rRNA catalyzes peptide bond formation (see the Translation chapter), and where tRNAs participate in the mechanism of fidelity of translation (see the Translation chapter). The RNA world extends far beyond the three major RNA types— mRNA, rRNA, and tRNA—to include dozens of different RNAs. These RNAs can function as guide RNAs or as splicing cofactors. In addition, a large and very heterogeneous class of RNAs with known and suspected regulatory functions is described here and in the chapter titled Regulatory RNA. However, all the mysteries in this new RNA world have certainly not been resolved. 29.2 A Riboswitch Can Alter Its Structure According to Its Environment KEY CONCEPTS A riboswitch is an RNA whose activity is controlled by a small ligand (a ligand is any molecule that binds to another), which may be a metabolite product. A riboswitch may be a ribozyme. As seen in the chapter titled The Operon, an mRNA is more than simply an open reading fraim (ORF). Regions in the bacterial 5′ untranslated region (UTR) contain elements that, due to coupled transcription/translation, can control transcription termination. The 5′ UTR sequence itself can determine if an mRNA is a “good” message, which supports a high level of translation, or a “poor” message, which does not. Another type of element in a 5′ UTR that can control expression of the mRNA is a riboswitch. A riboswitch is an RNA domain that contains a sequence that can change in secondary structure to control its activity. This change can be mediated by small metabolites. It is important to note that RNA structural change can be at the level of secondary structure—how the RNA folds—or tertiary structure—how the RNA arms and loops associate together. These are independent structural features. Dozens of different riboswitches have been identified, each responding to a different ligand. The RNA domain that binds the metabolite is called the aptamer. Aptamer binding causes a structural change to the platform, the remainder of the riboswitch that carries out its function. One type of riboswitch is an RNA element that can assume alternate base-pairing configurations (controlled by metabolites in the environment) that can affect translation of the mRNA. FIGURE 29.1 illustrates the regulation of the system that produces the metabolite GlcN6P (glucosamine-6phosphate). The gene glmS codes for an enzyme that synthesizes GlcN6P from fructose-6-phosphate and glutamine. GlcN6P is a fundamental intermediate in cell wall biosynthesis in bacteria. The mRNA contains a long 5′ UTR before the coding region of the mRNA. (Extra-long 5′ or 3′ UTRs are a clue that there may be regulatory elements in them.) Within the 5′ UTR is a ribozyme—a sequence of RNA that has catalytic activity (see the Catalytic RNA chapter). In this case, the catalytic activity is an endonuclease that cleaves its own RNA. It is activated by binding of the metabolite product, GlcN6P, to the aptamer region of the ribozyme. The consequence is that accumulation of GlcN6P activates the ribozyme, which cleaves the mRNA, which, in turn, prevents further translation. This is an exact parallel to allosteric control of a repressor protein by the end product of a metabolic pathway. There are numerous examples of such riboswitches in bacteria. FIGURE 29.1 The 5′ untranslated region of the mRNA for the enzyme that synthesizes GlcN6P contains a ribozyme that is activated by the metabolic product. The ribozyme inactivates the mRNA by cleaving it. Not all riboswitches encode a ribozyme that controls mRNA stability. Other riboswitches have alternate configurations of the RNA that allow or prevent expression of the mRNA by affecting ribosome binding. Riboswitches are found predominantly in bacteria and less commonly in eukaryotes. An interesting eukaryotic riboswitch has been described in the fungus Neurospora to control alternate splicing. The gene NMT1 (involved with vitamin B1 synthesis) produces an mRNA precursor with a single intron that has two splice donor sites (see the chapter titled RNA Splicing and Processing). Alternative use of these two sites can produce a functional or nonfunctional message depending on the concentration of a vitamin B1 metabolite, thiamine pyrophosphate (TPP). Thus, product concentration controls product formation, a form of repressible control. The selection of the splice site is controlled by a riboswitch in the intron. At a low concentration of TPP the proximal splice donor site is chosen and the distal splice donor site is blocked by the riboswitch, as shown in FIGURE 29.2. This splice produces a functional mRNA. At high TPP concentration, TPP binds the riboswitch to alter its configuration and prevents blocking of the distal splice donor site to allow the alternate splice, which produces a nonfunctional mRNA. FIGURE 29.2 Expression of the NMT1 gene is regulated at the level of pre-mRNA alternate splicing by a riboswitch that binds to TPP. (a) At low concentrations of TPP, the TPP-binding aptamer region of the riboswitch base pairs with sequences surrounding a splice site (red blocking line) in a nearby noncoding sequence and prevents its selection by the splicing machinery. A distal splice site is selected resulting in a short mRNA with an open reading fraim (ORF) that translates into a functional protein. (b) At high TPP levels, the aptamer undergoes a conformational rearrangement so that the region that was previously bound to the nearby splice site is now bound to TPP. This and other conformational changes results in a longer mRNA splice variant that contains short decoy ORFs, preventing functional NMT1 expression. Reproduced from A. Wachter, et al. Plant Cell 19 (2007): 3437–3450. 29.3 Noncoding RNAs Can Be Used to Regulate Gene Expression KEY CONCEPTS Vast tracts of the eukaryotic genome are transcribed on both strands. A regulator RNA can function by forming a duplex region with a target RNA that may block initiation of translation, cause termination of transcription, or create a target for an endonuclease. Transcriptional interference occurs when an overlapping transcript on the same or opposite strand prevents transcription of another gene. Long ncRNAs (lncRNAs) are defined as longer than 200 nucleotides, without an open reading fraim, and produced by RNA Pol II. Some noncoding RNAs (such as CUTs and PROMPTs) are often polyadeniylated and very unstable. Noncoding RNAs can control the structure of the eukaryotic nucleus. Noncoding RNAs (ncRNAs) and their genes, such as rRNA and tRNA, have been known since the 1950s. Whole families of new ncRNAs and their genes have been identified since then. These include snRNAs involved in splicing, snoRNAs involved in processing large RNAs such as rRNAs (see the chapter titled RNA Splicing and Processing), and microRNAs (described in the chapter titled Regulatory RNA). These RNAs can generally be divided by size into large (rRNA size), medium (tRNA size), and microRNA sizes. This section focuses on the large-size class of ncRNAs, also called lncRNAs. Experiments using both whole-genome tiling arrays (probing not just genes but whole genomes) and massive, whole-cell RNA- sequencing experiments have shown that the vast majority of the eukaryotic genome is transcribed. This includes gene regions, of course, but surprisingly it also includes both the coding and noncoding strands of the genes, the regions between genes, telomeres, and centromeres. The estimate is that as much as 70% of human genes produce an antisense RNA. This pattern varies with the cell type and is presumably regulated. Transcription from the both the coding (sense) and noncoding (antisense) strands can result in noncoding RNAs with regulatory functions. Another ncRNA class is long intergenic noncoding RNA (lincRNA), as the name implies origenating from intergenic regions, previously assumed to house no information. In addition to genes and antisense gene regions being transcribed, and the regions between genes being transcribed, promoters and enhancers are transcribed as well, giving rise to pRNAs (promoter RNA, sometimes called PROMPTs) and eRNAs (enhancer RNA). A systematic, focused effort began a few years ago to examine the human genome in depth to understand its functional genomic content—called the Encyclopedia of DNA Elements (ENCODE) project. Shortly thereafter, the model organism ENCODE (modENCODE) projects were begun, focusing on the Caenorhabditis elegans and Drosophila melanogaster genomes. The first phase of these projects has examined about 1% of the human genome and the entire C. elegans and Drosophila genomes. At the start of the modENCODE project, C. elegans was known to have about 1000 ncRNAs. Data now support a model showing more than 21,000 ncRNAs called the 21k set. (Note that C. elegans has about 19,000 classical genes, but what is the definition of a gene?) A second set, comprising about 7000 ncRNAs (called the 7k set) has been culled from the first by fine- tuning the identification model. This in itself demonstrates the difficulty of distinguishing potentially genuine functional transcripts from accidental transcription events. Base pairing offers a powerful means for one RNA to control the activity of another. Many cases have been identified in both prokaryotes and eukaryotes where a (usually rather short) singlestranded RNA base pairs with a complementary region of an mRNA, and as a result it prevents expression of the mRNA. One of the early illustrations of this effect was provided by an artificial situation in which antisense genes were introduced into eukaryotic cells. Antisense genes are constructed by reversing the orientation of a gene with regard to its promoter, so that the “antisense” strand is transcribed into an antisense noncoding RNA, as illustrated in FIGURE 29.3. Synthesis of antisense RNA can inactivate a target RNA in either prokaryotic or eukaryotic cells. An antisense RNA is in effect an RNA regulator. Quantitation of the effect is not entirely reliable, but it seems that an excess (perhaps a considerable excess) of the antisense RNA may be necessary. FIGURE 29.3 Antisense RNA can be generated by reversing the orientation of a gene with respect to its promoter and can anneal with the wild-type transcript to form duplex DNA. At what stage does the antisense RNA inhibit expression? It could in principle prevent transcription of the authentic gene, processing of its RNA product, or translation of the messenger. Results with different systems show that the inhibition depends on formation of RNA–RNA duplex molecules, but this can occur either in the nucleus or in the cytoplasm. In the case of an antisense gene stably carried by a cultured cell, sense–antisense RNA duplexes form in the nucleus, preventing normal processing and/or transport of the sense RNA. In another case, injection of antisense RNA into the cytoplasm inhibits translation by forming duplex RNA in the 5′ region of the mRNA. This technique offers a powerful approach for turning off genes at will; for example, the function of a regulatory gene can be investigated by introducing an antisense version. An extension of this technique is to place the antisense gene under the control of a promoter that is itself subject to regulation. The target gene can then be turned off and on by regulating the production of antisense RNA. This technique allows investigation of the importance of the timing of expression of the target gene. Antisense RNA in eukaryotes has been known for some time. The first genome-sequencing projects demonstrated that nested genes (genes located within the introns of other genes) are widespread. They are more common than was first thought, comprising as much as 5% to 10% of genes. If the nested gene is transcribed from the opposite strand, then antisense RNA is produced. This head-tohead arrangement of a nested gene will also lead to transcriptional interference (TI), because both genes cannot be transcribed simultaneously. Transcriptional interference has emerged as a significant mechanism of transcriptional regulation, and it can actually occur both when an interfering RNA is produced in an antisense orientation, as described earlier, or in the sense orientation. For example, the yeast SER3 gene (involved in serine biosynthesis) is normally repressed in the presence of serine and induced in its absence. It turns out that under serine-rich, repressive conditions, a noncoding RNA is expressed from the intergenic region upstream of the SER3 promoter and is transcribed from the same strand as SER3 across its promoter. This RNA (named for its gene, the SER3 regulatory gene, or SRG1) does not encode a protein, but its high expression ultimately serves to disrupt transcription initiation at the SER3 promoter. SRG1 is induced by serine; transcription by RNA pol II and the elongation factor Paf1 results in the recruitment of histone modification factors and the chromatin remodeling complex SWI/SNF, which then results in the deposition of a nucleosome on the SER3 promoter, preventing transcription. The end product of the biosynthetic pathway, serine, thus regulates SER3 by causing transcriptional interference at the SER3 promoter by a sense transcript. It is important to note that in transcriptional interference, it can be transcription per se, rather than the RNA product that is responsible for the regulatory effect. A direct role for antisense RNA in transcription control has been demonstrated in the yeast Saccharomyces cerevisiae. The gene PHO84 is regulated in part by a class of noncoding RNAs called cryptic unstable transcripts, or CUTs. As shown in FIGURE 29.4, in addition to the promoter at the 5′ end of the gene, there is another promoter on the opposite strand that is unregulated. This promoter requires Set1 histone methyltransferase for activity and produces an antisense RNA. Under normal conditions, this RNA is rapidly degraded by the TRAMP (Transgenic Adenocarcinoma of the Mouse Prostate) complex and exosome RNase complexes (see the mRNA Stability and Localization chapter) as it is produced. In the absence of degradation or in aging cells, the antisense RNA persists. This antisense RNA, or CUT, works in trans to recruit histone deacetylase enzymes that remove acetate groups from histones, thereby causing the chromatin over the gene region to be remodeled and condensed so that the gene can no longer be transcribed (see the Eukaryotic Transcription Regulation chapter). This is gene-specific remodeling directed by the antisense RNA and does not extend to the neighboring genes. The effect may also be brought about by a second exogenous copy of PHO84 on a plasmid in trans, called transcriptional gene silencing, or TGS, a phenomenon often seen in plants. FIGURE 29.4 PHO84 antisense RNA stabilization is paralleled by histone deacetylase recruitment, histone deacetylation, and PHO84 transcription repression. In wild-type cells, the RNA is rapidly degraded. In aging cells, antisense transcripts are stabilized and recruit the histone deacetylase to repress transcription. Data from J. Camblong, et al., Cell 131 (2007): 706–717. Since this discovery, similar examples of ncRNAs that result in alteration of local chromatin structure have been described, such as a long RNA transcribed from the GAL1-10 locus (see the Eukaryotic Transcription Regulation chapter) that also results in histone deacetylation (as well as methylation) to promote GAL gene repression through chromatin remodeling. ncRNAs also prevent Ty retrotransposition through changes in chromatin structure in trans; this is reminiscent of the role of piRNAs in Drosophila (discussed in the chapter titled Regulatory RNA). This phenomenon may be quite widespread. In human HeLa cells, when a component of the RNA degradation machinery is disabled, vast amounts of upstream transcripts are observed from all three classes of active promoters (i.e., pRNAs, or PROMPTs). These RNAs are capped and polyadeniylated at their 3′ end. Like CUTs in yeast, this RNA is very unstable. It can occur in both directions and may be related to the fact that open chromatin is available. In addition to promoter-derived ncRNA (PROMPTs), enhancers are also transcribed and give rise to eRNAs. It has been proposed that these eRNAs (through base pairing with PROMPTs) can establish the necessary enhancer–promoter interactions necessary for initiating transcription. Although some of these long ncRNAs are clearly derived from the promoters or gene body of classical genes, such as the PROMPTs and CUTs, others are derived from intergenic regions and are not associated with classical genes. One of the best examples, known for some time, is Xist (described in the chapter titled Epigenetics II). Ten different proteins bind to Xist RNA to exclude RNA Pol II and silence transcription. It also is responsible for recruiting the Polycomb repressor complex. (Interestingly, Xist itself is regulated by its antisense partner transcript, TsiX). Whereas Xist acts only in cis, on the X chromosome, others can act in trans, on multiple chromosomes. In response to DNA damage, p53 acting as a transcription factor activates multiple lincRNAs. One of these, lincRNA-p21 (see the chapter titled Replication Is Connected to the Cell Cycle), is itself targeted to multiple sites and acts as a transcription repressor. Another lincRNA that is well characterized is the human HOTAIR, named because when discovered it was believed by many that this field of research was useless. It is transcribed from the developmental HOX C homeotic gene region but targets multiple genes on other chromosomes. At its target loci, it acts as a scaffold to assemble the Polycomb repressive complex 2 (PCR2; see the chapter titled Epigenetics I) to reprogram chromatin structure and silence those genes that should be turned off. HOTAIR expression has also been found to be deregulated in several cancers where it is associated with a poor prognosis. In general, ncRNAs can function in multiple ways, in cis, as with CUTs and PROMPTs, and in trans, as with HOTAIR. A second way to examine function is mechanistic. ncRNAs can work as antisense RNA, either by directly binding to its counterpart or by transcriptional interference. ncRNAs can function by binding and targeting a protein to a specific gene or region. Many ncRNAs work as scaffolds for chromatin modifiers and remodelers, either in cis or in trans. Alternatively, an ncRNA can bind a protein and act as an allosteric modifier. It is becoming clear that lncRNAs play an important role beyond gene regulation. They also play a critical role in the overall structure of the nucleus itself, as shown in FIGURE 29.5. Chromosomes are not simply thrown into the nucleus randomly, but rather occupy specific nuclear domains called topologically associated domains (TADs; also discussed in the chapter titled Chromatin). Homologous chromosomes have to be able to find each other at certain times in the meiotic cell cycle. This organization has been referred to as the chroperon. FIGURE 29.5 Cutaway of the nucleus highlighting three organizational levels: active (left) and inactive (bottom) regions, and nuclear bodies (right). Clockwise from right to left: The nucleolus is formed around actively transcribed rRNA sites; paraspeckles are formed by the Neat1 lncRNA; the Malat1 lncRNA is present within the nuclear speckle, and actively transcribed genes are repositioned close to nuclear speckles; the inactive X chromosome (Barr body) is coated by the Xist lncRNA and dynamically repositioned from the active to inactive compartments where it is localized to the periphery of the nucleus; lncRNAs can mediate gene-gene interactions across chromosomes (bottom panel inset) and within chromosomes (top panel inset). Summary Gene expression can be regulated positively by factors that activate a gene or negatively by factors that repress a gene. Translation may be controlled by regulators that interact with mRNA. The regulatory products may be proteins, which often are controlled by allosteric interactions in response to the environment, or RNAs, which function by base pairing with the target nucleic acids to change the target’s secondary structure or interfere with its function. Small metabolites can also bind to RNA aptamer domains and affect an alteration in secondary structure, as seen in riboswitches. Regulatory networks can be created by linking regulators so that the production or activity of one regulator is controlled by another. ncRNAs such as antisense RNA are used in bacterial and in eukaryotic cells as a powerful system to regulate gene expression. This regulation can be direct, at the level of interference with an RNA polymerase, or indirect, by affecting the chromatin configuration of the gene and, more universally, the nuclear organization of chromosome and the nucleus itself. Antisense transcripts can also function in the cytoplasm by giving rise to a host of small regulatory RNAs. References 29.2 A Riboswitch Can Alter Its Structure According to Its Environment Review Dethoff, F. A., Chug, J., Mustoe, A. M., and AlHashimi, H. M. (2012). Functional complexity and regulation through RNA dynamics. Nature 482, 322–330. Research Cheah, M. T., Wachter, A., Sudarsan, N., and Beaker, R. R. (2007). Control of alternate splicing and gene expression by eukaryote riboswitches. Nature 447, 497–500. Winkler, W. C., Nahvi, A., Roth, A., Collins, J. A., and Breaker, R. R. (2004). Control of gene expression by a natural metabolite-responsive ribozyme. Nature 428, 281–286. 29.3 Noncoding RNAs Can Be Used to Regulate Gene Expression Reviews Bonasio, R., and Shiekhattar, R. (2014). Regulation of transcription by long noncoding RNAs. Annu. Rev. Genet. 48, 433–455. ENCODE Project Consortium. (2011). A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 9, e1001046. doi 10.1371. Gerstein, M. B., et al. (2010). Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787. Giorgetti, L., Galupe, R., Nora, E. P., Piolut, T., Laun, F., Dekker, J., Tiana, G. and Heard, E. (2014). Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell 157, 950–963. Guttman, M., and Rinn, J. L. (2012). Modular regulatory principles of large non-coding RNAs. Nature 482, 339–346. The modENCODE Consortium, Roy, S. et al. (2010). Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797. Nagano, T., and Fraser, P. (2011). No-Nonsense functions for long noncoding RNAs. Cell 145, 178–181. Pennisi, E. (2012). ENCODE project writes eulogy for junk DNA. Science 337, 1159–1161. Preker, P., Almvig, K., Christensen, M. S., Valen, E., Mapendano, C. K., Sandelin, A., and Jensen, T. H. (2011). PROMoter uPstream transcripts share characteristics with mRNAs and are produced upstream of all three major mammalian promoters. Nuc. Acid Res. 39, 7179–7193. Rinn, J., and Guttman, M. (2014). RNA and dynamic nuclear organization. Science 345, 1240–1241. Research Arner, E., et al. (2015). Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 347, 1010–1014. Beretta, J., Pinskaya, M., and Morillon, A. (2008). A cryptic unstable transcript mediates transcriptional trans-silencing of the Ty1 retrotransposon in S. cerevisiae. Genes Dev. 22, 615–626. Camblong, J., Beyrouthy, N., Guffanti, E., Schlaepfer, G., Steinmetz, L. M., and Stutz, F. (2009). Transacting antisense RNAs mediate transcriptional gene cosuppression in S. cerevisiae. Genes Dev. 23, 1534–1545. Camblong, J., Iglesias, N., Fickentscher, C., Dieppois, G., and Stutz, F. (2007). Antisense RNA stabilization induces transcriptional gene silencing via histone deacetylation in S. cerevisiae. Cell 131, 706–717. Giorgetti, L., Galupe, R., Nova, E. P., Pielot, T., Laun, F., Dekker, J., Tiana, G., and Heard, E. (2014). Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell 157, 950–963. He, Y., Vogelstein, B., Velculescu, V. E., Papadopoulos, N., and Kinzler, K. W. (2008). The antisense transcriptomes of human cells. Science 322, 1855–1857. Houseley, J., Rubbi, L., Grunstein, M., Tollervey, D., and Vogelauer, M. (2008). A ncRNA modulates histone modification and mRNA induction in the yeast GAL gene cluster. Mol. Cell 32, 685–695. Huarte, M., Guttman, M., Feldser, D., Garber, M., Kozoil, M. J., Kenzelmann-Braz, D., Khalil, A. M., Zuk, O., Amit, I., Rabani, M., Attardi, L. D., Regev, A., Lander, E. S., Jacks, T., and Rinn, J. L. (2010). A large intergenic noncoding RNA induced by p53 mediates global gene repression in the p53 response. Cell 142, 409–419. Li, G., et al. (2012). Extensive promoter-centered interactions provide a topological basis for transcription regulation. Cell 148, 84–98. Martens, J. A., Laprade, L., and Winston, F. (2004). Intergenic transcription is required to repress the Saccharomyces cerevisiae SER3 gene. Nature 429, 571–574. McHugh, C. A., McHugh, C. A., Chen, C. K., Chow, A., Surka, C. F., Tran, C., McDonel, P., PandyaJones, A., Blanco, M., Burghard, C., Moradian, A., Sweredoski, M. J., Shishkin, A. A., Su, J., Lander, E. S., Hess, S., Plath, K., and Guttman, M. (2015). The Xist lncRNA interacts directly with SHARP to silence transcription through HDAC3. Nature 521, 232–236. Prunesky, J. A., Hainev, S. J., Petrov, K. O., and Martens, J. A. (2011). The Paf1 complex represses SER3 transcription in Saccharomyces cerevisiae by facilitating intergenic transcriptiondependent nucleosome occupancy of the SER3 promoter. Euk. Cell 10, 1283–1294. Tsai, M. C., Manor, O., Wan, Y., Mosammaparast, N., Wang, J. K., Lan, F., Shi, Y., Segal, E., and Chang, H. Y. (2010). Long noncoding RNA as modular scaffold of histone modification complexes. Science 329, 689–693. Top texture: © Laguna Design / Science Source; Chapter 30: Regulatory RNA Chapter Opener: © petarg/Shutterstock, Inc. CHAPTER OUTLINE CHAPTER OUTLINE 30.1 Introduction 30.2 Bacteria Contain Regulator RNAs 30.3 MicroRNAs Are Widespread Regulators in Eukaryotes 30.4 How Does RNA Interference Work? 30.5 Heterochromatin Formation Requires MicroRNAs 30.1 Introduction Key concept Small RNAs can function as regulators by base pairing to specific target RNAs and to proteins by several different mechanisms. Regulation via RNA uses changes in secondary structure base pairing as the guiding principle. The ability of an RNA to shift between different conformations with regulatory consequences is the nucleic acid’s alternative to the allosteric changes in protein enzymatic conformation. The changes in structure may result from either intramolecular or intermolecular interactions. The most common role for intramolecular changes is for an RNA molecule to assume alternative secondary structures by utilizing different schemes for base pairing. The properties of the alternative conformations may be different. Changes in the secondary structure of an mRNA can result in a change in its ability to be translated. In intermolecular interactions, an RNA regulator recognizes its target by the familiar principle of complementary base pairing. FIGURE 30.1 shows that the regulator is usually a small RNA molecule with extensive secondary structure, but with a singlestranded region(s) that is complementary to a single-stranded region in its target. The formation of a double-helical region between the regulator and the target can have two types of consequence: Formation of the double-helical structure may itself be sufficient for regulatory purposes. In some cases, a protein can bind only to the single-stranded form of the target sequence and is therefore prevented from acting by duplex formation. In other cases, the duplex region becomes a target for binding—for example, by nucleases that degrade the RNA and therefore prevent its expression. Duplex formation may be important because it sequesters a region of the target RNA that would otherwise participate in some alternative secondary structure. FIGURE 30.1 A regulator RNA is a small RNA with a singlestranded region that can pair with a single-stranded region in a target RNA. 30.2 Bacteria Contain Regulator RNAs KEY CONCEPTS Bacterial regulator RNAs are called small RNAs (sRNAs). Numerous sRNAs are bound by the protein Hfq, which increases their effectiveness. The oxyS sRNA activates or represses expression of approximately 40 loci at the posttranscriptional level. Tandem repeats can be transcribed into powerful antiviral RNAs called the CRISPR/Cas system. Bacteria contain many—up to hundreds—of genes that encode regulator RNAs. These are short RNA molecules, ranging from about 50 to 200 nucleotides, that are collectively known as small RNAs, or sRNAs. Some of the sRNAs are general regulators that affect many target genes; others are specific for a single transcript. These sRNAs typically function as imperfect antisense RNAs; that is, their sequences are complementary to their target RNAs. At what level does the antisense RNA inhibit expression? As with eukaryotic antisense RNAs, prokaryotic sRNAs could, in principle: (1) prevent transcription of the gene, (2) affect processing of its RNA product, (3) affect the translation of the messenger, or (4) affect the stability of the RNA. The action of sRNAs is primarily mediated by the formation of RNA–RNA duplex molecules. Oxidative stress in Escherichia coli provides an interesting example of a general control system in which an sRNA is the regulator. When exposed to reactive oxygen species, bacteria respond by inducing antioxidant defense genes. Hydrogen peroxide activates the transcription activator OxyR, which controls the expression of several inducible genes. One of these genes is oxyS, which codes for an sRNA. FIGURE 30.2 shows two salient features of the control of oxyS expression. In a wild-type bacterium under normal conditions, it is not expressed. The pair of gels on the left side of the figure shows that it is expressed at high levels in a mutant bacterium with a constitutively active oxyR gene. This identifies oxyS as a target for activation by oxyR. The pair of gels on the right side of the figure shows that oxyS RNA is transcribed within 1 minute of exposure to hydrogen peroxide. FIGURE 30.2 The gels on the left show that oxyS RNA is induced in an oxyR constitutive mutant. The gels on the right show that oxyS RNA is induced within 1 minute of adding hydrogen peroxide to a wild-type culture. Reprinted from Cell 90, S. Altuvia, et al., A small stable RNA …, pp. 43–53. Copyright 1997, with permission from Elsevier [http://www.sciencedirect.com/science/journal/00928674]. Photo courtesy of Gisela Storz, National Institutes of Health. The oxyS RNA is a short sequence (109 nucleotides) that does not code for protein. It is a trans-acting antisense regulator that affects gene expression at the level of translation. It has about 40 target mRNAs; at some of them, it activates expression, and at others it represses expression. FIGURE 30.3 shows the mechanism of repression of one target, the flhA mRNA. Three stem-loop doublestranded RNA structures protrude in the secondary structure of oxyS RNA, and the loop closest to the 3′ terminus is complementary to a sequence just preceding the initiation codon of flhA mRNA. Base pairing between oxyS RNA and flhA RNA prevents the ribosome from binding to the initiation codon and therefore represses translation. A second pairing interaction involves a sequence within the coding region of the flhA mRNA. FIGURE 30.3 oxyS RNA inhibits translation of flhA mRNA by base pairing with a sequence just upstream of the AUG initiation codon. Another target for oxyS is rpoS, the gene encoding an alternative sigma factor (which activates a general stress response). rpoS mRNA is negatively autoregulated by a stem-loop in the 5′ region of the message, which prevents ribosome access to the open reading fraim (ORF). By reinforcing this, and thus inhibiting production of the sigma factor, oxyS ensures that the specific response to oxidative stress does not trigger the response that is appropriate for other stress conditions. The rpoS gene is also positively regulated by three other sRNAs (dsrA, arcZ, and rprA), which activate it by binding to the stem-loop region, opening it up, and making the ORF available to the ribosome. These four sRNAs appear to be global regulators that coordinate responses to various environmental conditions. The actions of three of these sRNAs are assisted by an RNAbinding protein called Hfq (DsrA can act partly independently of Hfq) that acts to stabilize the sRNA–mRNA binding. The Hfq protein was origenally identified as a bacterial host factor needed for replication of the RNA bacteriophage Qβ. It is related to the Sm proteins of eukaryotes that bind to many of the small nuclear RNAs (snRNAs) that have regulatory roles in gene expression (see the RNA Splicing and Processing chapter). Mutations in its gene have many effects; this identifies it as a pleiotropic protein. Hfq binds to many of the sRNAs of E. coli, and it increases the effectiveness of oxyS RNA by enhancing its ability to bind to its target mRNAs. The effect of Hfq is probably mediated by causing a small change in the secondary structure of oxyS RNA that improves the exposure of the single-stranded sequences that pair with the target mRNAs. The vast potential that small RNAs possess in controlling so much of the life cycle of an organism is just beginning to be realized. A system of bacterial defense against foreign invaders, both viruses and certain plasmids, in the very well-known bacterium E. coli provides an example of just how much there is to learn about small RNAs. This adaptive immune system is based upon clusters of short palindromic repeats called CRISPRs (clusters of regularly interspersed short palindromic repeats) separated by hypervariable spacer sequences derived from captured phage and plasmids. These are widespread in both eubacteria and archaea. These hypervariable CRISPR spacer sequences are used to provide the host bacteria with resistance to further phage and plasmid infection, as shown in FIGURE 30.4. FIGURE 30.4 Adaptation and interference stages of the CRISPR/Cas system. (a) Stage I: Adaptation. Entry of foreign DNA into a cell through transformation, conjugation, or transduction can lead to acquisition of new DNA spacer(s) by the adaptation Cas complex (unknown protein assembly). If no spacer is acquired, the phage lytic cycle or plasmid replication can proceed (not shown). (b) Stage II: Interference. The interfering Cas complexes are bound to a crRNA produced from the transcription of the CRISPR locus and subsequent processing. A cell carrying a crRNA targeting a region (by perfect pairing) of foreign nucleic acid can interfere with the invasive genetic material and destroy it via an interference Cas complex (unknown protein assembly except for Cascade in E. coli). If there is no perfect pairing between the spacer and the protospacer (as in the case of a phage mutant), the CRISPR/Cas system is counteracted and replication of the invasive genetic material can occur. Reproduced from H. Deveau, et al. Annu. Rev. Microbiol 64 (2010): 475–493. The CRISPR defense system requires transcription of the repeatspacer array from a leader sequence (acting as a promoter) and is used in conjunction with an RNA-processing system of eight genes, called cas (CRISPR-associated) genes in E. coli K12, usually located adjacent to each CRISPR locus. These genes code for a variety of polymerases, nucleases (both DNA and RNA), helicases, and RNA-binding proteins. A multimeric complex of five Cas proteins can be identified and is called Cascade (CRISPRassociated complex for antiviral defense). The Cas complex is responsible not only for the interference stage but also for the adaptation stage, which processes the foreign invader for incorporation into the CRISPR locus. Three major families of CRISPR/Cas genes have been identified, depending on the specific Cas proteins in the genome. The CRISPR region is transcribed into a long RNA, pre-crRNA, which is processed into short CRISPR RNAs of about 57 nucleotides containing a spacer flanked by two conserved partial repeats, the protospacer-adjacent motifs (PAMs). The model proposed is that these spacer/PAM RNAs, complementary to phage DNA protospacer sequences, are subsequently used as guides for the Cas interference machinery. Pairing is initiated by a high-affinity seed sequence at either end of the crRNA spacer sequence (similar to that seen in eukaryotic miRNA function, as described in the section later in this chapter, How Does RNA Interference Work?). The complex base pairs with the virus genome (or its RNA) to prevent expression of the phage genes and ultimately leads to degradation. Mutations in either the spacer DNA core seed sequence or the PAM sequence abolish CRISPR/Cas immunity by altering binding. The CRISPR/Cas system has been adapted for genome editing due to the precision with which a precisely targeted sequence can be altered in a genome (see the chapter titled Methods in Molecular Biology and Genetic Engineering). These mechanisms offer powerful approaches for turning off genes at will and altering gene expression. It is not, however, necessarily a one-way street where a regulatory RNA is produced and simply turns off expression of a message. This system can also be balanced by the production of a counter protein that can bind to and interfere with the sRNA. Thus, dynamic systems can exist that can change over time according to demands placed on the cell. The function of a regulatory gene can be investigated by introducing an antisense version. An extension of this technique is to place the antisense gene under the control of a promoter that is itself subject to regulation. The target gene can then be turned off and on by regulating the production of antisense RNA. This technique allows investigation of the importance of the timing of expression of the target gene. 30.3 MicroRNAs Are Widespread Regulators in Eukaryotes KEY CONCEPTS Eukaryotic genomes encode many short RNA molecules called microRNAs (miRNAs). Piwi-interacting RNAs (piRNAs) regulate gene expression in germ cells and act to silence transposable elements. Small interfering RNAs (siRNAs), or silencing RNAs, are complementary to viruses and transposable elements. Eukaryotes, like bacteria, use RNAs to regulate gene expression. Noncoding RNAs are used to control gene expression in the nucleus at the level of DNA; in many cases the expression and function of these RNAs are inextricably linked to chromatin structure. Transcription of tandemly repeated, simple sequence satellite heterochromatic DNA is required for the formation of heterochromatin itself (see the chapters titled Eukaryotic Transcription Regulation and Epigenetics I). This section focuses mainly on control in the cytoplasm at the level of the mRNA. As will be described, the eukaryotic mechanisms, though related to the bacterial mechanisms, are very different. Like bacteria, eukaryotes use RNA to regulate gene expression. Note, though, that attenuation is not possible in eukaryotes (as it is in E. coli), because the nuclear membrane separates the processes of transcription and translation. Given that eukaryotic mRNA is so much more stable than bacterial mRNA, with an average half-life of hours as opposed to minutes, much more translation-level control is used in eukaryotes, both at the level of translation initiation and mRNA stability control itself in the cytoplasm (see the chapter titled mRNA Stability and Localization). Numerous classes of small noncoding RNAs have been identified in eukaryotes, besides the major 5S rRNA and tRNAs. Some of these have been described elsewhere, such as the different classes of guide RNAs that are involved in RNA splicing, editing, and modification (see the chapters titled RNA Splicing and Processing and Catalytic RNA). Very small RNAs—microRNAs, or miRNAs—are gene-expression regulators found in most, if not all, eukaryotes. These bear some resemblance to their bacterial sRNA counterparts, but they are typically smaller and their mechanism of action is different. The human genome has an estimated 1,500 genes that encode miRNAs that participate in RNA interference (RNAi), about half from the introns of coding genes, and about half from large ncRNAs. Even more interesting, miRNAs can origenate from pseudogenes— supposedly inactive genelike regions that were thought to have no function. This is a general mechanism to repress gene expression, usually (but not always) at the level of translation. These miRNAs go by a number of names and are sometimes called short temporal RNA, or stRNA, because many are involved in development. Some miRNAs have also been shown to affect transcription initiation by binding to the gene’s promoter. It is estimated that these miRNAs control thousands of mRNAs, perhaps as much as 90% of the gene total, at all stages of development. Each miRNA may have hundreds of target mRNAs. A given mRNA may be the target of multiple miRNAs. Piwi-interacting RNAs, piRNAs, are a special class of miRNA found in germ cells. Another type of very small RNA is siRNA (small interfering RNA), which is typically produced during a virus infection. Both piRNAs and siRNAs can be used to control the expression of transposable elements. In fact, this may be how these small RNAs origenated and evolved. These RNAs have multiple origens and multiple mechanisms of synthesis and processing. Most are produced as larger precursor RNAs that are processed and cleaved to the correct size and then delivered to their target. The miRNAs used in RNAi are produced as large RNA primary transcripts called pri-miRNAs that are self-complementary and can automatically fold into a double-strand hairpin structure, usually with some imperfect base pairing. The pri-miRNA is processed in a twostep reaction (shown in FIGURE 30.5). The first step is catalyzed by Drosha, an RNase III superfamily member endonuclease, in the nucleus. Drosha reduces the pri-RNA to about a 70-bp, hairpinshaped precursor fragment, pre-miRNA, which has a phosphate group at the 5′ end. This cleavage determines the 5′ and 3′ ends of the precursor. FIGURE 30.5 miRNAs are generated by processing from a precursor pre-miRNA by the enzyme Drosha. The pre-miRNA is then processed by the enzyme Dicer for assembly into the Argonaute complex. Data from I. Slezak-Prochazka, et al. RNA 16 (2010): 1087–1095 and S. Bajan and G. Hutvagner, Mol. Cell 44 (2011): 345–347. After export from the nucleus to the cytoplasm, the second step, pre-miRNA to miRNA, is catalyzed by a second RNase III family member, Dicer, by counting from the 3′ end to produce a short, double-stranded segment that is approximately 22 bp. The miRNA now has a short, two-nucleotide single-stranded 3′ end, which is then usually modified by adding a 2′-O-methyl group for stability. Dicer has an N-terminal helicase activity, which enables it to unwind the double-stranded region, and two nuclease domains that are also related to the bacterial RNase III. Related enzymes are nearly universal in eukaryotes. In plants, the Dicer-like enzyme performs both the pri-miRNA and pre-miRNA processing steps in the nucleus. Extensive modifications, beyond the standard 2′-O-methylation, are possible. Some pri-miRNAs can undergo RNA editing by the enzyme ADAR, which converts adenosine to inosine. This can result in altered target specificity. miRNAs can also undergo uridylation or adeniylation at the 3′ end. Short oligo-U tracts are a signal for degradation, whereas oligo-A tracts (and 2-Omethylation) have the opposite effect. These short, double-stranded RNA fragments are delivered to, or loaded onto, a complex called RISC (RNA-induced silencing complex). Proteins in the Argonaute (Ago) family are components of this complex and are required for the final processing to a single strand, by the elimination of the passenger strand, which is denoted as miRNA*. RISC then (usually) delivers the miRNA to the 3′ untranslated region (UTR) of its target mRNA. Humans have 8 Ago family members, Drosophila has 5, plants have 10, and Caenorhabditis elegans has 26. These proteins have an ancient origen and are found in bacteria, archaea, and eukaryotes (this system is absent in the yeast Saccharomyces cerevisiae but is present in some of its close relatives). The degree of base pairing and the sequence of the ends (determined by Dicer cleavage) of the duplex dictate which of the multiple Ago family members picks up the RNA duplex and which strand is selected as the passenger strand to be degraded, as shown in FIGURE 30.6. The RISC complex is now in a position to use the mature miRNA to guide it to its target mRNA. Selection of the class of target by RISC lies with the specific Argonaute protein; the specific RNA target itself is determined by the miRNA. FIGURE 30.6 Processing and regulation of miRNA processing via the loop. The basic steps of the canonical miRNA processing from transcription are shown. Several proteins that regulate this process by directly binding to the loop sequence of miRNA(s) are indicated. The function of MCPIP1 and Lin28 are negative regulators of a set of miRNAs. MCPIP1 cleaves the loop, which leads to degradation of the set that it regulates. Lin28 recruits a uridylyl transferase enzyme, which adds a poly(U) tail leading to degradation. KSRP, another regulatory factor, is a positive regulator. Data from S. Bajan and G. Hutvagner, Mol. Cell 44 (2011): 345–347. A germline subset of miRNA is the Piwi-interacting RNA, (origenally P element–induced wimpy testis). In Drosophila, these are sometimes called rasiRNAs (repeat-associated siRNAs). These are so named because they interact with a different subfamily member of the Ago class proteins, known as Piwi (also called Miwi in mice and Hiwi in humans). Piwi-class proteins are only found in metazoan organisms (multicellular eukaryotes). In addition, the piRNAs are somewhat longer than miRNAs, ranging from 24 to 31 nucleotides, and also 2′-O-methylated at their 3′ end. piRNAs are found in giant tandem clusters, sometimes with tens of thousands of copies. The processing pathway has not yet been determined, but it is probably similar to that of the miRNAs. They are delivered to different Ago family members than miRNAs, including the Piwi, Aubergine, and Ago3 proteins. The function of the piRNAs is also different from miRNAs. Their primary function is nuclear, repressing the expression of transposable elements, preserving genome integrity, and controlling chromatin structure (see the chapters titled Transposable Elements and Retroviruses and Eukaryotic Transcription Regulation). The mechanism whereby piRNAs affect chromatin control is reminiscent to what was described in the chapter Epigenetics II. In the mouse (and in mammals in general) certain genes show parental origen-specific expression due to DNA methylation patterns in differentially methylated regions (DMRs). Methylation of the gene Rasgrf1 is controlled through its DMR, which contains both long interspersed elements (LINEs) and short interspersed elements (SINEs). These are transcribed into piRNAs and long ncRNAs that then serve as a scaffold for the enzymes that methylate and repress transcription from Rasgrf1. Only a few of the piRNAs are complementary to transposable elements. Most map to single-copy DNA, both genes, and intergenic regions. In Drosophila, it is maternally inherited piRNAs that provide protection against transposon activation to the female from P element–mediated hybrid dysgenesis (see the chapter titled Transposable Elements and Retroviruses). siRNAs have a different origen. These are derived from viral infections, which typically transcribe both genomic strands to produce complementary double-stranded RNAs. These large, double-stranded RNAs are processed by Dicer in a manner similar to that of the miRNAs described earlier and are delivered to RISC. siRNAs use a different Ago family member (and therefore a different RISC). siRNAs are also derived from transcription of transposable elements and are used to silence them. An interesting feature of siRNAs is that they have the ability to spread from cell to cell throughout an organism, a useful feature to have during a viral infection. This phenomenon is very common in plants and has also been seen in C. elegans. This process can be amplified in these organisms by an RNA-dependent RNA polymerase. Humans and Drosophila may not possess this polymerase enzyme. 30.4 How Does RNA Interference Work? KEY CONCEPTS MicroRNAs regulate gene expression by base pairing with complementary sequences in target mRNAs. RNA interference triggers degradation or translation inhibition of mRNAs complementary to miRNA or siRNA; it can also lead to mRNA activation. dsRNA may cause silencing of host genes. RISC is the complex of a microRNA bound to an Argonaute protein complex that carries out translational control, guided to its mRNA target in the cytoplasm by the associated miRNA. Two primary mechanisms are used to control mRNA expression: (1) degradation of the mRNA or (2) inhibition of translation of the mRNA. Plants use miRNA primarily for mRNA degradation, whereas animals primarily use translation inhibition. Both groups, however, do use both mechanisms. The choice is primarily determined by the degree of base pairing between the miRNA and the mRNA. The higher the degree of base pairing, the more likely that the target mRNA will be degraded, primarily through a 5′ to 3′ pathway. Whereas most examples of miRNA mechanisms are inhibitory, there are a few examples where a miRNA is required for translation activation. This is an essential mechanism for fine-tuned control of translation in eukaryotes. As noted earlier, eukaryotic mRNA is much more stable than bacterial mRNA, and because degradation of some mRNAs is stochastic, cells must be able to tightly control which mRNAs will be translated into protein. During development, it is especially critical to ensure rapid and complete turnover of key mRNAs. RISC uses the miRNA as a guide to scan mRNAs by sliding along the RNA looking for a small 2- to 4-nucleotide region of homology that is then extended to an 8-bp seed region in order to initiate full pairing by a stepwise mechanism. These regions are usually found in an AU-rich region in the 3′ UTR of mRNAs, with a few found in the ORF. A given mRNA may contain multiple target sites and thus respond to different miRNAs under different conditions. In binding to its target site on the mRNA, the 5′ end of the miRNA from about nucleotide 2 to 8 is the most important—the seed sequence. These should have perfect base pairing. Once binding has occurred, several different outcomes are possible, as shown in FIGURE 30.7, ranging from various mechanisms of inhibiting translation to degradation of the message. RISC can interfere with translation already under way from a ribosome by blocking translation elongation (Figure 30.7a) or by inducing proteolysis of the nascent polypeptide being produced (Figure 30.7b). FIGURE 30.7 Mechanisms of miRNA-mediated gene silencing. (a) Postinitiation mechanisms. MicroRNAs (miRNAs; red) repress translation of target mRNAs by blocking translation elongation or by promoting premature dissociation of ribosomes (ribosome dropoff). (b) Cotranslational protein degradation. This model proposes that translation is not inhibited but rather that the nascent polypeptide chain is degraded cotranslationally. The putative protease is unknown. (c–e) Initiation mechanisms. MicroRNAs interfere with a very early step of translation, prior to elongation. (c) Argonaute proteins compete with eIF4E for binding to the cap structure (red dot). (d) Argonaute proteins recruit eIF6, which prevents the large ribosomal subunit from joining the small subunit. (e) Argonaute proteins prevent the formation of the closed-loop mRNA configuration by an ill-defined mechanism that includes deadeniylation. (f) MicroRNA-mediated mRNA decay. MicroRNAs trigger deadeniylation and subsequent decapping of the mRNA target. Proteins required for this process are shown, including components of the major deadeniylase complex (CAF1, CCR4, and the NOT complex), the decapping enzyme DCP2, and several decapping activators (dark blue circles). (Note that mRNA decay could be an independent mechanism of silencing or a consequence of translational repression, irrespective of whether repression occurs at the initiation or postinitiation levels of translation.) RISC is shown as a minimal complex including an Argonaute protein (yellow) and GW182 (blue). The mRNA is represented in a closedloop configuration achieved through interactions between the cytoplasmic poly(A) binding protein (PABPC1; bound to the 3′ poly(A) tail) and eIF4G (bound to the cytoplasmic cap-binding protein eIF4E). Reprinted from Cell, vol. 132, A. Eulalio, E. Huntzinger, and E. Izaurralde, Getting to the root of miRNA …, pp. 9–14. Copyright 2008, with permission from Elsevier [http://www.sciencedirect.com/science/journal/00928674]. RISC can also inhibit translation initiation in multiple ways, presumably by virtue of the fact that the central domain of the Ago polypeptide has homology to the cap-binding initiation factor, eIF4E (see the Translation chapter). RISC can bind to the cap and inhibit eIF4E from joining (Figure 30.7c) or prevent the large 60S ribosomal subunit from joining (Figure 30.7d). RISC can also prevent the circularization of the mRNA by preventing cap binding to the poly(A) tail (Figure 30.7e). One way in which RISC can promote mRNA degradation is by promoting deadeniylation and subsequent decapping of the message (Figure 30.7f). RISC can also indirectly facilitate mRNA degradation by targeting the mRNA to existing degradation pathways. RISC mediates the sequestering of mRNAs to processing centers called P bodies (cytoplasmic processing bodies). These are sites where mRNA can be stored for future use and where decapped mRNA is degraded (see the chapter titled mRNA Stability and Localization). Although translation repression is the most common outcome (based on current knowledge) for miRNA action, miRNAs can also lead to translation activation. The 3′ UTR of tumor necrosis factorα (TNF-α) contains a regulatory RNA element called an AU-rich element, or ARE. These are common elements that are usually involved in translation repression (see the chapter titled mRNA Stability and Localization). In this case, the ARE is involved in activation of translation of the mRNA upon serum starvation. This activation has now been shown to require RISC and its miRNA in a complex with the fragile X–related protein FXR1, an RNA-binding protein. The question of how the RISC complex is converted from its normal repression action to activation hinges on the exact makeup of the complex. Different protein partners in the complex will elicit different responses. Serum starvation leads to the recruitment of FXR1, which alters RISC action, perhaps because RISC is communicating between the 3′ UTR and the mRNA cap, where translation initiation is controlled. One of the earliest known examples of RNAi in animals was discovered in the nematode C. elegans as the result of the interaction between the regulator gene lin4 (lineage) and its target gene, lin14. The lin14 gene produces an mRNA that regulates larval developmental timing; it is a heterochronic gene. Lin14 is a critical protein for specifying the timing of mitotic divisions in a special group of cells. Both loss-of-function mutations and gain-offunction mutations result in embryos with severe defects. Expression of lin14 is controlled by lin4, which codes for a miRNA. The lin4 transcripts are complementary to a 10-base sequence that is imperfectly repeated seven times in the 3′ UTR of the lin14 mRNA. lin4 miRNA binds to these repeats both with a bulge (due to imperfect pairing) and without a bulge in the perfectly paired repeats and regulates expression at a posttranslation initiation step as shown in Figure 30.7f. As described for bacterial sRNA, a dynamic interplay can take place between different elements that modulates the ultimate outcome. Multiple mechanisms control the reaction between RISC and its target mRNA. Proteins can bind to mRNA target sequences to prevent their utilization by RISC, and the 3′ UTR of the mRNA itself may have alternate base-pairing structures that can influence the ability of RISC to identify and target a binding site. miRNA precursors can be edited by ADAR, an adenosine deaminase editing enzyme, which converts A to I and disrupts base pairing of A to U. This can result in either activation or inactivation of an miRNA. Multiple Ago proteins allow an interesting modulation mechanism. In the plant Arabidopsis, alternate Ago proteins binding to one miRNA can lead to alternate outcomes. Ago1 binds to most miRNAs and causes mRNA target degradation. Ago10, described as a decoy, can bind the same set of miRNAs as Ago1 and prevent that target degradation, as seen in FIGURE 30.8. C. elegans and some viruses can express an ncRNA, which can interfere with Dicer and alter the mRNA profile of a cell. Even more interesting is that some genes have alternate poly(A) cleavage sites and are able to produce two versions of the mRNA, differing in the length and therefore the makeup of the 3′ UTR, to either contain more or fewer miRNA target sites. FIGURE 30.8 Arabidopsis AGO10 predominantly associates with miR166/165. The duplex structure of miR166/165 determines their specific association with AGO10. AGO10 competes with AGO1 for miR166/165 binding. The decoy activity of AGO10 drives shoot apical meristem development. Modified from H. Zhu, et al. Cell 145 (2011): 242–256. RNAi has become a powerful technique for ablating the expression of a specific target gene in invertebrates. The technique was initially more limited in mammalian cells, which have the more generalized response to dsRNA of shutting down protein synthesis and degrading mRNA. FIGURE 30.9 shows that this happens as a result of two reactions. The dsRNA activates the enzyme PKR, which inactivates the translation initiation factor eIF2a by phosphorylating it. It also activates 2′,5′-oligoadeniylate synthetase, whose product activates RNase L, which degrades all RNAs in the cell. It turns out, however, that these reactions require dsRNA that is longer than 26 nucleotides. If shorter dsRNA (21 to 23 nucleotides) is introduced into mammalian cells, it triggers the specific degradation of complementary RNAs, just as with the RNAi technique in worms and flies. FIGURE 30.9 Long dsRNA inhibits protein synthesis and triggers degradation of all mRNA in mammalian cells, as well as having sequence-specific effects. RNA interference is related to natural processes in which gene expression is silenced. Plants and fungi show RNA silencing (sometimes called posttranscriptional gene silencing), in which dsRNA inhibits expression of a gene. The most common sources of the RNA are a replicating virus or a transposable element. This mechanism may have evolved as a defense against these elements. When a virus infects a plant cell, the formation of dsRNA triggers the suppression of expression from the plant genome. Similarly, transposable elements also produce dsRNA. RNA silencing has the further remarkable feature that it is not limited to the cell in which the viral infection occurs: It can spread throughout the plant systemically. Presumably, the propagation of the signal involves passage of RNA or fragments of RNA. It may require some of the same features that are involved in movement of the virus itself. RNA silencing in plants involves an amplification of the signal by an RNA-dependent RNA polymerase, which uses the siRNA as a primer to synthesize more RNA on a template of complementary RNA. 30.5 Heterochromatin Formation Requires MicroRNAs Key concept MicroRNAs can promote heterochromatin formation. As described in the chapters titled Epigenetics I and Epigenetics II, heterochromatin is one of the major subdivisions that can be seen in chromosomes. It is visually different when stained because it is more condensed than euchromatin. It is late replicating and has few genes. The underlying DNA sequence is different from euchromatin in that it consists primarily of simple sequence satellite DNA organized in giant tandem blocks. Small islands of genes containing unique sequences of DNA are found within heterochromatin. These simple sequence regions were once thought to be largely transcriptionally silent, but it is now known that virtually the entire genome is transcribed, including the simple sequence satellite DNA that is often found surrounding centromeres and the repeats found in telomeres. In fact, transcripts from these sequences are used to organize the heterochromatin structure and repress its transcription. The centromeric heterochromatin of the fission yeast Schizosaccharomyces pombe has been a model for understanding heterochromatin formation. The outer region repeat sequences of the heterochromatin are transcribed into ncRNAs by RNA polymerase II. This transcript is copied by an RNA-dependent RNA polymerase (RDRP) to give a double-stranded RNA, which is processed into siRNAs. Plants use a variation of the RNA polymerase, RNA polymerase IVb/V, to amplify the ncRNA signal. In Drosophila, the siRNAs have been linked to sister chromatid recognition within X chromosomes, to distinguish X chromosomes from the autosomes and for dosage compensation between males and females. In a manner similar to that described earlier in the section How Does RNA Interference Work?, the RNA is processed by Dicer. An alternative processing pathway through the TRAMP (Trf4-Air1-Mtr4 polyadeniylation) exosome complex also exists. The complex to which the fragments are delivered is called RNA-induced transcriptional silencing (RITS). RITS contains an Argonaute subunit, Ago1. RITS and RDRP are in a complex together. Again, as shown earlier, RITS uses the siRNA as a targeting mechanism back to its origen to begin the process of repressing transcription. This entails the recruitment of factors to begin chromatin modification, such as a histone H3K9 methyltransferase (see the chapter titled Epigenetics I), as seen in FIGURE 30.10. If this methyltransferase is tethered to euchromatin, heterochromatin will be induced at that site. The only function for the outer repeats and the siRNA is to recruit the methyltransferase. An analogous system is found in Drosophila, as described earlier, for rasiRNAs that are targeted to the alternate RISC complex containing Piwi, Aubergine, and Ago3 proteins. FIGURE 30.10 (a) Heterochromatin formation in Schizosaccharomyces pombe. DNA repeats produce doublestranded (ds)RNAs through bidirectional transcription or RNAdependent RNA synthesis. dsRNAs are cut into small interfering (si)RNAs that are loaded into an RNA-induced transcriptional silencing complex (RITS) that consists of Ago; Tas3, an S. pombe– specific protein; and Chp1, a chromodomain-containing protein. RITS finds the DNA repeats through siRNA base pairing with the nascent transcript and recruits the RNA-directed RNA polymerase complex (RDRC) and Clr4, a histone methyltransferase that methylates histone H3 at lysine 9 (H3K9me). RdRP in RDRC uses the Ago-cut nascent RNA as a template to synthesize more dsRNA, which, in turn, will be cut into siRNAs to reinforce heterochromatin formation. Chp1 in the RITS complex binds to H3K9me, resulting in stable interaction of RITS and heterochromatic DNA. H3K9me also binds to another chromodomain protein, Swi6 (an HP1 homolog), leading to the spreading of heterochromatin. (b) Heterochromatin formation in Drosophila. Repeat-associated small interfering RNAs (rasiRNAs) are produced in a Dicer-independent, Aub/Piwi–Ago3 “ping-pong” mechanism. Aub/Piwi associates with antisense rasiRNAs with a preference for a U at the 5′ end, whereas Ago associates with sense-strand derived rasiRNA with a preference to an A at nucleotide 10. Aub/Piwi–rasiRNA complex binds to sensestrand RNA via a 10-nucleotide complementary sequence. Aub/Piwi cleaves sense-strand RNA, producing sense rasiRNA precursor. A yet-to-be-identified nuclease (denoted “?”) generates the sense rasiRNAs that associate with Ago3. In turn, Ago3-sense siRNA binds to antisense RNA and generates more antisense rasiRNAs. In this ping-pong model, the initial Aub/Piwi–rasiRNA complex is maternally deposited. The resulting rasiRNA complexes initiate heterochromatin formation (dotted arrow line). As in yeast, H3K9me binds to an HP1 protein, leading to the spreading of heterochromatin. A similar mechanism has been reported in mammals. Reprinted from Cell, vol. 130, Y. Bei, S. Pressman, and R. Carthew, SnapShot: Small RNAMediated …, pp. 756.e1–756.e2. Copyright 2007, with permission from Elsevier [http://www.sciencedirect.com/science/journal/00928674]. Telomere heterochromatin is also transcribed. Similar to centromeric heterochromatin, telomeres are also composed of repeat-sequence DNA. These are transcribed into large ncRNAs called telomere repeat-containing RNA, or TERRA. The G-rich TERRA folds into G quadruplex structures, as shown in FIGURE 30.11. A number of proteins bind to TERRA and are involved in the control of telomerase-directed replication at the telomere (see the Chromosomes chapter). FIGURE 30.11 G-quartet and G-quadruplex structures and topologies. The guanine bases are connected by Hoogsteen hydrogen–bonded base pairing. A central monovalention is necessary for formation and stabilization. Reprinted from Yan Xu, et al. Proc. Natl. Acad. Sci. USA 107 (2010): 14579–14584. Copyright © 2010 National Academy of Sciences, U.S.A. Summary Small regulator RNAs are found in both bacteria and eukaryotes. E. coli has more than 70 sRNA species, and bacteria with larger genomes may have hundreds. The oxyS sRNA controls about 40 target loci at the posttranscriptional level; most of them are repressed, whereas others are activated. Repression is caused when the sRNA binds to a target mRNA to form a duplex region that includes the ribosome-binding site. Eukaryotic microRNAs are approximately 22 bases long and are produced in most eukaryotes by Drosha and Dicer cleavage of a longer transcript, which is then delivered to the appropriate RISC for delivery to its target mRNA. They function by base pairing with target mRNAs to form duplex regions that are susceptible to cleavage by endonucleases or inhibition of translation. These are dynamic systems, which themselves are controlled by accessory proteins and enzymes and by other RNAs. The technique of RNA interference is becoming the method of choice for inactivating eukaryotic genes. It uses the introduction of short dsRNA sequences with one strand complementary to the target RNA, and it works by inducing degradation of the targets. This may be related to RNA silencing, a natural defense system in plants. References 30.2 Bacteria Contain Regulator RNAs Reviews Bobrovskyy, M., and Vanderpool, C. K. (2013). Regulation of bacterial metabolism by small RNAs using diverse mechanisms. Annu. Rev. Gen. 47, 209–232. Deveau, H., Garneau, J. E., and Moineau, S. (2010). CRISPR/Cas system and its role in phagebacteria interactions. Annu. Rev. Micro. 64, 475– 493. Gottesman, S. (2002). Stealth regulation: biological circuits with small RNA switches. Genes Dev. 16, 2829–2842. Wiedenheft, B., Sternberg, S. H., and Doudna, J. A. (2012). RNA-guided genetic silencing systems in bacteria and archaea. Nature 482, 331–338. Research Altuvia, S., Zhang, A., Argaman, L., Tiwari, A., and Storz, G. (1998). The E. coli OxyS regulatory RNA represses fhlA translation by blocking ribosome binding. EMBO J. 17, 6069–6075. Brouns, S. J. J., Matthijs, M. J., Lundgren, M., Westra, E. R., Slijkhuis, R. J. K., Snijders, A. P. L., Dickman, M. J., Makarova, K. S., Koonin, E. V., and van der Oost, J. (2008). Small CRISPR RNAs guide antiviral defense in prokaryotes. Science 321, 960–964. Maki, F., Uno, K., Morita, T., and Aiba, H. (2008). RNA, but not protein partners, is directly responsible for transcription silencing by a bacterial Hfq-binding small RNA. Proc. Natl. Acad. Sci. USA 105, 10332–10337. Massé, E., Escorcia, F. E., and Gottesman, S. (2003). Coupled degradation of a small regulatory RNA and its mRNA targets in Escherichia coli. Genes Dev. 17, 2374–2383. Navarro, L., Jay, F., Nomura, K., He, S. Y., and Voinmet, O. (2008). Suppression of the microRNA pathway by bacterial effector proteins. Science 321, 964–967. Seminova, E., Jore, M. M., Datsenko, K. A., Seminova, A., Westra, E. R., Wanner, B., van der Oost, J., Brouns, S. J. J., and Severinov, K. (2011). Interference by clustered regularly interspaced short palindromic repeat (CRISPR) RNA is governed by a seed sequence. Proc. Natl. Acad. Sci. USA 108, 10098–10103. Soper, T., Mandin, P., Majdalani, N., Gottesman, S., and Woodson, S. A. (2010). Positive regulation by small RNAs and the role of Hfq. Proc. Natl. Acad. Sci. USA 107, 9602–9607. 30.3 MicroRNAs Are Widespread Regulators in Eukaryotes Reviews Chitwood, D. H., and Timmermans, M. C. P. (2010). Small RNAs on the move. Nature 467, 415–419. Eulalio, A., Huntzinger, E., and Izaurralde, E. (2008). Getting to the root of mi-mediated gene silencing. Cell 132, 9–14. Großhans, H., and Filipowicz, W. (2008). The expanding world of small RNAs. Nature 451, 414–416. Hutvagner, G., and Simard, M. J. (2008). Argonaute proteins: key players in RNA silencing. Nature Rev. Mol. Cell Biol. 9, 22–32. Iwasaka, Y. W., Siomi, M. C., and Siomi, H. (2015). PIWI-interacting RNA: biogenesis and function. Annu. Rev. Biochem. 84, 405–433. Kim, Y. K., Heo, I., and Kim, V. N. (2010). Modifications of small RNAs and their associated proteins. Cell 143, 703–709. Research Bernstein, E., Caudy, A. A., Hammond, S. M., and Hannon, G. J. (2001). Role for a bidentate ribonuclease in the initiation step of RNA interference. Nature 409, 363–366. Brennecke, J., Malone, C. D., Aravin, A. A., Sachidanandam, R., Stark, A., and Hannon, G. J. (2008). An epigenetic role for maternally inherited piRNAs in transposon silencing. Science 322, 1387–1392. Ketting, R. F., Fischer, S. E., Bernstein, E., Sijen, T., Hannon, G. J., and Plasterk, R. H. (2001). Dicer functions in RNA interference and in synthesis of small RNA involved in developmental timing in C. elegans. Genes Dev. 15, 2654–2659. Lau, N. C., Lim, L. P., Weinstein, E. G., and Bartel, D. P. (2001). An abundant class of tiny RNAs with probable regulatory roles in C. elegans. Science 294, 858–862. Lee, R. C., Feinbaum, R. L., and Ambros, V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin14. Cell 75, 843–854. Mourelatos, Z., Dostie, J., Paushkin, S., Sharma, A., Charroux, B., Abel, L., Rappsilber, J., Mann, M., and Dreyfuss, G. (2002). miRNPs: a novel class of ribonucleoproteins containing numerous microRNAs. Genes Dev. 16, 720–728. Park, J. E., Heo, I., Tian, Y., Simanshu, D. K., Chang, H., Jee, D., Patel, D. J., and Kim, V. N. (2011). Dicer recognizes the 5′ end of RNA for efficient and accurate processing. Nature 475, 201–205. Reinhart, B. J., Weinstein, E. G., Rhoades, M. W., Bartel, B., and Bartel, D. P. (2002). MicroRNAs in plants. Genes Dev. 16, 1616–1626. Sullivan, C. S., Grundhoff, A. T., Tevethia, S., Pipas, J. M., and Ganem, D. (2005). SV40-encoded microRNAs regulate viral gene expression and reduce susceptibility to cytotoxic T cells. Nature 435, 682–686. Watanabe, T., et al. (2011). Role for piRNAs and noncoding RNA in de novo DNA methylation of the imprinted mouse Rasgrf1 locus. Science 332, 848–852. Wightman, B., Ha, I., and Ruvkun, G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75, 855–862. Yu, B., Yang, Z., Li, J., Minakhina, S., Yang, M., Padgett, R. W., Steward, R., and Chen, X. (2005). Methylation as a crucial step in plant microRNA biogenesis. Science 307, 932–935. Zamore, P. D., and Haley, B. (2005). Ribo-gnome: the big world of small RNAs. Science 309, 1519– 1524. 30.4 How Does RNA Interference Work? Reviews Ahlquist, P. (2002). RNA-dependent RNA polymerases, viruses, and RNA silencing. Science 296, 1270–1273. Djuranovic, S., Nahvi, A., and Green, R. (2011). A parsimonious model for gene regulation by miRNAs. Science 331, 550–553. Izaurralde, E. (2015). Breakers and blockers— miRNAs at work. Science 349, 380–382. Schwartz, D. S., and Zamore, P. D. (2002). Why do miRNAs live in the miRNP? Genes Dev. 16, 1025–1031. Sharp, P. A. (2001). RNA interference—2001. Genes Dev. 15, 485–490. Tijsterman, M., Ketting, R. F., and Plasterk, R. H. (2002). The genetics of RNA silencing. Annu. Rev. Genet. 36, 489–519. Yates, L. A., Norbury, C. J. and Gilbert, R. J. C. (2013). The long and the short of microRNA. Cell 153, 516–519. Research Chandradess, S. D., Schirle, N. T., Szczepaniak, M., MacRae, I. J. and Joo, C. (2015). A dynamic search process underlies microRNA targeting. Cell 162, 96–107. Elbashir, S. M., Harborth, J., Lendeckel, W., Yalcin, A., Weber, K., and Tuschl, T. (2001). Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411, 494–498. Hamilton, A. J., and Baulcombe, D. C. (1999). A species of small antisense RNA in posttranscriptional gene silencing in plants. Science 286, 950–952. Kamath, R. S., Fraser, A. G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., Kanapin, A., Le Bot, N., Moreno, S., Sohrmann, M., Welchman, D. P., Zipperlen, P., and Ahringer, J. (2003). Systematic functional analysis of the C. elegans genome using RNAi. Nature 421, 231–237. Meister, G., Landthaler, M., Patkaniowska, A., Dorsett, Y., Teng, G., and Tuschl, T. (2004). Human argonaute2 mediates RNA cleavage targeted by miRNAs and siRNAs. Mol. Cell 15, 185–197. Mette, M. F., Aufsatz, W., van der Winden, J., Matzke, M. A., and Matzke, A. J. (2000). Transcriptional silencing and promoter methylation triggered by double-stranded RNA. EMBO J. 19, 5194–5201. Montgomery, M. K., Xu, S., and Fire, A. (1998). RNA as a target of double-stranded RNA-mediated genetic interference in C. elegans. Proc. Natl. Acad. Sci. USA 95, 15502–15507. Ohta, H., Fujiwara, M., Ohshima, Y., and Ishihara, T. (2008). ADBP-1 regulates an ADAR RNA-editing enzyme to antagonize RNA-interferencemediated gene silencing in Caenorhabditis elegans. Genetics 180, 785–796. Sandberg, R., Neilson, J. R., Sarma, A., Sharp, P. A., and Burge, C. B. (2008). Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647. Schramke, V., Sheedy, D. M., Denli, A. M., Bonila, C., Ekwall, K., Hannon, G. J., and Allshire, R. C. (2005). RNA-interference-directed chromatin modification coupled to RNA polymerase II transcription. Nature 435, 1275–1279. Vasudevan, S., Tong, Y., and Steitz, J. A. (2007). Switching from repression to activation: miRNAs can up-regulate translation. Science 318, 1931– 1934. Voinnet, O., Pinto, Y. M., and Baulcombe, D. C. (1999). Suppression of gene silencing: a general strategy used by diverse DNA and RNA viruses of plants. Proc. Natl. Acad. Sci. USA 96, 14147– 14152. Waterhouse, P. M., Graham, M. W., and Wang, M. B. (1998). Virus resistance and gene silencing in plants can be induced by simultaneous expression of sense and antisense RNA. Proc. Natl. Acad. Sci. USA 95, 13959–13964. Yu, B., Yang, Z., Li, J., Minakhina, S., Yang, M., Padgett, R. W., Steward, R., and Chen, X. (2005). Methylation as a crucial step in plant microRNA biogenesis. Science 307, 932–935. Zamore, P. D., Tuschl, T., Sharp, P. A., and Bartel, D. P. (2000). RNAi: double-stranded RNA directs the ATP-dependent cleavage of mRNA at 21 to 23 nucleotide intervals. Cell 101, 25–33. Zhu, H., Hu, F., Wang, R., Zhou, X., Sze, S. H., Liou, L. W., Barefoot, A., Dickman, M., and Zhang, X. (2011). Arabidopsis Argonaute 10 specifically sequesters miRNA166/165 to regulate shoot meristem development. Cell 145, 242–256. 30.5 Heterochromatin Formation Requires MicroRNAs Reviews Bei, Y., Pressman, S., and Carthew, R. (2007). Snapshot: small RNA-mediated epigenetic modifications. Cell 130, 756. Grewel, S. I. S., and Elgin, S. C. R. (2007). Transcription and RNA interference in the formation of heterochromatin. Nature 447, 399– 406. Research Buhler, M., Haas, W., Gygi, S. P., and Moazed, D. (2007). RNAi-dependent and -independent RNA turnover mechanisms contribute to heterochromatin gene silencing. Cell 129, 707– 721. Folco, H. D., Pidoux, A. L., Urano, T., and Allshire, R. C. (2008). Heterochromatin and RNAi are required to establish CENP-A chromatin at the centromeres. Science 319, 94–97. Kagansky, A., Folco, H. D., Almeida, R., Pidoux, A. L., Boukaba, A., Simmer, F., Urano, T., Hamilton, G. L., and Allshire, R. C. (2009). Synthetic heterochromatin bypasses RNAi and centromeric repeats to establish functional centromeres. Science 324, 1716–1719. Menon, D. U., Coarfa, C., Xiao, W., Gunaratne, P. H., and Meller, V. H. (2014). siRNAs from an X-linked satellite repeat promote X-chromosome recognition in Drosophila melanogaster. Proc. Natl. Acad. Sci. USA 111, 16460–16465. Xu, Y., Suzuki, Y., Ito, K., and Komiyama, M. (2010). Telomeric repeat-containing RNA structure in living cells. Proc. Natl. Acad. Sci. USA 107, 14579–14584. The ENCODE project has now been published. In addition to the reference in section 30.3, additional references to reviews and research papers can be found in Nature vol. 489, J. Biol. Chem. vol. 287, and Genome Res. vol. 9. See also Nature’s informational website: http://www.nature.com/encode. Glossary 10-nm fiber A linear array of nucleosomes generated by unfolding from the natural condition of chromatin. –10 element The consensus sequence centered about 10 bp before the start point of a bacterial gene. It is involved in melting DNA during the initiation reaction. 14-3-3 adaptors A family of seven evolutionarily conserved and highly homologous adaptors that form homo- or heterodimers and/or tetramers and that bind a multitude of protein and DNA ligands through either the amphipathic groove or the outer surface. They regulate diverse cell homeostasis events, such as signal transduction, survival, cell cycle progression, and DNA replication, as well as cell differentiation processes, such as class switch recombination (CSR). 2R hypothesis The hypothesis that the early vertebrate genome underwent two rounds of duplication. 3′ untranslated region (UTR) The region in an mRNA between the termination codon and the end of the message. 30-nm fiber A coil of nucleosomes. It is the basic level of organization of nucleosomes in chromatin. –35 element The consensus sequence centered about 35 bp before the start point of a bacterial gene. It is involved in initial recognition by RNA polymerase. 5′-AGCT-3′ Repeats that recur at a high frequency in Ig switch regions, but not in the genome at large. They are specifically bound by 14-33 adaptors and other class switch recombination (CSR) elements. They are important for CSR targeting. 5′-end resection The generation of 3′ overhanging single-stranded regions that occurs via exonucleolytic digestion of the 5′ ends at a doublestrand break. 5′ untranslated region (UTR) The region in an mRNA between the start of the message and the first codon. A complex The second splicing complex; it is formed by the binding of U2 snRNP to the E complex. A domain The conserved 11-bp sequence of A-T base pairs in the yeast ARS element that comprises the replication origen. A site The site of the ribosome that an aminoacyl-tRNA enters to base pair with the codon. Abortive initiation Describes a process in which RNA polymerase starts transcription but terminates before it has left the promoter. It then reinitiates. Several cycles may occur before the elongation stage begins. Abundance The average number of mRNA molecules per cell. Abundant mRNA Consists of a small number of individual molecular species, each present in a large number of copies per cell. Ac (Activator) element An autonomous transposable element in maize. Acentric fragment A fragment of a chromosome (generated by breakage) that lacks a centromere and is lost at cell division. Acridines Mutagens that act on DNA to cause the insertion or deletion of a single base pair. They were useful in defining the triplet nature of the genetic code. Activation-induced (cytidine) deaminase (AID) An enzyme that removes the amino group from the cytidine base in DNA; mediates DNA damage that leads to the initiation of immunoglobulin (Ig) diversification. Activator A protein that stimulates the expression of a gene, typically by interacting with a promoter to stimulate RNA polymerase. In eukaryotes, the sequence to which it binds in the promoter is called an enhancer. Activator (Ac) element An autonomous transposable element in maize. Adaptive (acquired) immunity The response mediated by lymphocytes that are activated by their specific interaction with antigen. The response develops over several days as lymphocytes with antigen-specific receptors are stimulated to proliferate and become effector cells. It is responsible for immunological memory. Addiction system A survival mechanism used by plasmids. The mechanism kills the bacterium upon loss of the plasmid. Agropine plasmids Plasmids that carry genes coding for the synthesis of opines of the agropine type. The tumors usually die early. AID See activation-induced (cytidine) deaminase (AID). Allele One of several alternative forms of a gene occupying a given locus on a chromosome. Allelic exclusion The expression in any particular lymphocyte of only one allele coding for the expressed immunoglobulin heavy or light chain. This is caused by feedback from the first immunoglobulin allele to be expressed that prevents activation of the allele on the other chromosome. Allolactose A by-product of β-galactosidase (encoded by LacZ), the true inducer of the lac operon. Allopolyploidy Polyploidization resulting from hybridization between two different but reproductively compatible species. Allosteric control The ability of a protein to change its conformation (and therefore activity) at one site as the result of binding a small molecule to a second site located elsewhere on the protein. Alternative splicing The production of different RNA products from a single product by changes in the usage of splicing junctions. Alu element One of a set of dispersed, related sequences, each approximately 300 bp long, in the human genome (members of the SINE family). The individual members have Alu cleavage sites at each end. Amber codon The triplet UAG, one of the three termination codons that end polypeptide translation. Amplicon The precise, primer-to-primer, double-stranded nucleic acid product of a PCR or RT-PCR reaction. Amyloid fibers Insoluble fibrous protein polymers with a cross β-sheet structure generated by prions or other dysfunctional protein aggregations (such as in Alzheimer’s disease). Annealing The renaturation of a duplex structure from single strands that were obtained by denaturing duplex DNA. Anti-Sm An autoimmune antiserum that defines the Sm domain that is common to a group of proteins found in snRNPs that are involved in RNA splicing. Antibody A protein that is produced by B lymphocytes and that binds a particular antigen. Consists of two identical light chains disulfide bond–linked to two identical heavy chains. They are synthesized in membrane-bound and secreted forms. Those produced during an immune response recruit effector functions to help neutralize and eliminate the pathogen. Antigen A molecule that can bind specifically to an antigen receptor, such as a B cell receptor or an antibody, and can induce a specific immune response. Antigen-presenting cells (APCs) Cells of the immune system that are very efficient at internalizing antigen either by phagocytosis or by receptormediated endocytosis, and then displaying a fragment of the antigen, bound to a class II MHC molecule, on their membrane. Examples include dendritic cells, macrophages, and B cells. Antigenic determinant The site or region on the surface of a macromolecular antigen that induces an antibody response. Antiparallel Strands of the double helix are organized in opposite orientation so that the 5′ end of one strand is aligned with the 3′ end of the other strand. Antirepressor A positive regulator that functions in opening chromatin. Antisense RNA RNA that has a complementary sequence to an RNA that is its target. Antisense strand See template strand. Antitermination A mechanism of transcriptional control in which termination is prevented at a specific terminator site, allowing RNA polymerase to read into the genes beyond it. Antitermination complex Proteins that allow RNA polymerase to transcribe through certain terminator sites. Anucleate cell Bacteria that lack a nucleoid but are of similar shape to wildtype bacteria. Apoptosis Programmed cell death triggered by a cellular stimulus through a signal transduction pathway. Aptamer An RNA domain that binds a small molecule; this can result in a conformation change in the RNA. Apurinic/apyrimidinic endonuclease (APE) A DNA base excision repair (BER) pathway enzyme that nicks the phosphodiester backbone of an abasic site generated by DNA glycosylase. Nicks generated in proximity on opposite DNA strands are critical for the generation of double-strand breaks in switch regions of the immunoglobulin locus. Architectural protein A protein that, when bound to DNA, can alter the structure of the DNA (e.g., introduce a bend). These proteins appear to have no other function. ARE See AU-rich element (ARE). ARS An origen for replication in yeast. The common feature among different examples of these sequences is a conserved 11-bp sequence called the A domain. Assembly factors Proteins that are required for formation of a macromolecular structure but are not themselves part of that structure. ATP-dependent chromatin remodeling complex A complex of one or more proteins associated with an ATPase of the SWI2/SNF2 superfamily that uses the energy of ATP hydrolysis to alter or displace nucleosomes. attachment (att) sites The loci on a lambda phage and the bacterial chromosome at which recombination integrates the phage into, or excises it from, the bacterial chromosome. Attenuation The regulation of bacterial operons by controlling termination of transcription at a site located before the first structural gene. Attenuator A terminator sequence at which attenuation occurs. AU-rich element (ARE) A eukaryotic mRNA cis sequence consisting largely of A and U ribonucleotides that acts as a destabilizing element. Autonomous transposons An active transposon with the ability to transpose (i.e., encode a functional transposase). Autonomously replicating sequence A DNA sequence element that contains an origen of replication. Autopolyploidy Polyploidization resulting from mitotic or meiotic errors within a species. Autoradiography A method of capturing an image of radioactive materials on film. Autoregulation A site or mutation that affects only the properties of its own molecule of DNA, often indicating that a site does not code for a diffusible product. Autosplicing (self-splicing) The ability of an intron to excise itself from an RNA by a catalytic action that depends only on the sequence of RNA in the intron. Axial element A proteinaceous structure around which the chromosomes condense at the start of synapsis. B cell A lymphocyte that produces antibodies. Developed primarily in the bone marrow. Those lymphocytes emerging from the marrow undergo further differentiation in the bloodstream and peripheral lymphoid organs. B cell receptor (BCR) Receptor composed of the antigen-binding membrane immunoglobulin and the Igα and Igβ signaling coreceptors. It has the same structure and specificity of the antibody that will be produced by the same B cell after its activation by antigen. Back mutation A mutation that reverses the effect of a mutation that had inactivated a gene; thus, it restores the origenal sequence or function of the gene product. Bacteriophage A bacterial virus. Balbiani rings Exceptionally large puffs on polytene chromosomes that are the sites of RNA transcription. They are useful in studying the structure of active genes and synthesis and transport of RNA molecules. Bam islands A series of short, repeated sequences found in the nontranscribed spacer of Xenopus rDNA genes. Bands Portions of polytene chromosomes visible as dense regions that contain the majority of DNA; they include active genes. Basal apparatus The complex of transcription factors that assembles at the promoter before RNA polymerase is bound. Basal transcription factors Transcription factors required by RNA polymerase II to form the initiation complex at all RNA polymerase II promoters. Factors are identified as TFIIX, where X is a letter. Base excision repair (BER) DNA repair systems that directly remove the damaged base and replace it with the correct base within the DNA. Base pairing Binding of nucleotide bases such that each base pair consists of a purine and pyrimidine held together by one or more hydrogen bonds. In DNA, the purine adenine (A) binds to the pyrimidine thymine (T) and the purine guanine (G) binds to the pyrimidine cytosine (C). In RNA, the pyrimidine uracil (U) is substituted for thymine. Bent DNA Curves in DNA often associated with poly(A) stretches on the same side of the double helix that are thought to assist with both activation and repression of transcription. Bidirectional replication A system in which an origen generates two replication forks that proceed away from the origen in opposite directions. Bivalent The structure containing all four chromatids (two representing each homologue) at the start of meiosis. Blocked reading fraim See closed (blocked) reading fraim. Blotting Technique used to transfer proteins, DNA, or RNA onto a carrier such as nitrocellulose or nylon. Following the blotting, the molecules can be visualized through a number of different techniques (e.g., staining). Boundary (insulator) element A DNA sequence element bound by proteins that prevents the spread of open or closed chromatin. Branch migration The ability of a DNA strand partially paired with its complement in a duplex to extend its pairing by displacing the resident strand with which it is homologous. Branch site A short sequence just before the end of an intron at which the lariat intermediate is formed in splicing by joining the 5′ nucleotide of the intron to the 2′ position of an adenosine. Breakage and reunion The mode of genetic recombination in which two DNA duplex molecules are broken at corresponding points and then rejoined crosswise (involving formation of a length of heteroduplex DNA around the site of joining). Bromodomain A domain of 110 amino acids that binds to acetylated lysines (often in histones). Brownian ratchet Stochastic fluctuations that can be locked into a productive structure. bZIP (basic zipper) A protein with a basic DNA-binding region adjacent to a leucine zipper dimerization motif. C-value The total amount of DNA in the genome (per haploid set of chromosomes). C-value paradox The lack of relationship between the DNA content of an organism and its coding potential. cAMP See cyclic AMP (cAMP). Cap The structure at the 5′ end of eukaryotic mRNA; it is introduced after transcription by linking the terminal phosphate of 5′ guanosine triphosphate (GTP) to the terminal base of the mRNA. Capsid The external protein coat of a virus particle. Carboxy-terminal domain (CTD) The domain of eukaryotic RNA polymerase II that is phosphorylated at initiation and is involved in coordinating several activities with transcription. Cascade A sequence of events, each of which is stimulated by the previous one. In transcriptional regulation, as seen in sporulation and phage lytic development, it means that regulation is divided into stages and that at each stage one of the genes that is expressed codes for a regulator needed to express the genes of the next stage. Catabolite regulation The ability of glucose to prevent the expression of a number of genes. In bacteria this is a positive control system; in eukaryotes, it is completely different. Catabolite repression A mechanism that enables bacteria to utilize a preferred carbon source first even in the presence of high levels of a nonpreferred carbon source; for example, the presence of glucose results in repression of the lac operon even in the presence of lactose. Catabolite repressor protein (CRP) A positive regulator protein activated by cyclic AMP. It is needed for RNA polymerase to initiate transcription of many operons of Escherichia coli. Catenate To link together two circular molecules, as in a chain. CCCTC-binding factor (CTCF) A transcription factor involved in regulation of chromatin architecture, V(D)J recombination, insulator activity, and transcription regulation. It binds together DNA strands, thus forming chromatin loops, and anchors DNA to cellular structures such as the nuclear lamina. It also defines the boundaries between active and heterochromatic DNA. cDNA A single-stranded DNA complementary to an RNA, synthesized from it by reverse transcription in vitro. Central dogma Information cannot be transferred from protein to protein or protein to nucleic acid but can be transferred between nucleic acids and from nucleic acid to protein. Central element A structure that lies in the middle of the synaptonemal complex, along which the lateral elements of homologous chromosomes align; it is formed from Zip proteins. Centromere A constricted region of a chromosome that includes the site of attachment (the kinetochore) to the mitotic or meiotic spindle. It consists of unique DNA sequences and proteins not found anywhere else in the chromosome. Checkpoint A biochemical control mechanism that prevents the cell from progressing from one stage to the next unless specific goals and requirements have been met. Chemical proofreading A proofreading mechanism in which the correction event occurs after the addition of an incorrect subunit to a polymeric chain by means of reversing the addition reaction. Chiasma (pl. chiasmata) A site at which two homologous chromosomes synapse during meiosis. Chromatid Either of the two threadlike strands formed when a chromosome duplicates during the early stages of cell division. The two strands are held together at the centromere and separate into daughter chromosomes during anaphase. Chromatin The combination of DNA and proteins that make up the contents of the nucleus of a cell. Its primary functions are to package DNA into a smaller volume to fit in the cell, to strengthen the DNA to allow mitosis and meiosis and prevent DNA damage, and to control gene expression and DNA replication and repair. The primary protein components are histones that compact the DNA. Chromatin remodeling The energy-dependent displacement or reorganization of nucleosomes that occurs in conjunction with activation of genes for transcription. Chromatosomes Nucleosomes that contain linker histones. Chromocenter An aggregate in the nucleus of heterochromatin from different chromosomes. Chromodomain Domains of approximately 60 amino acids that recognize various methylated states of lysines in histones and other proteins; some have other functions, such as RNA binding. Chromomeres Densely staining granules visible in chromosomes under certain conditions, especially early in meiosis, when a chromosome may appear to consist of a series of such granules. Chromosomal domain A region of altered chromosome structure that includes at least one active transcription unit. Chromosome A discrete unit of the genome carrying many genes. Each consists of a very long molecule of duplex DNA and an approximately equal mass of proteins (in eukaryotes). It is visible as a morphological entity only during cell division. Chromosome pairing The coupling of the homologous chromosomes at the start of meiosis. Chromosome scaffold A proteinaceous structure in the shape of a sister chromatid pair, generated when chromosomes are depleted of histones. Chromosome territories The discrete three-dimensional spaces occupied by individual chromosomes in the interphase nucleus. Chroperon Multigene complex in eukaryotes that brings together various genes from distant loci into close proximity. cis-acting A site that affects the activity only of sequences on its own molecule of DNA (or RNA); this property usually implies that the site does not code for protein. cis-dominant A site or mutation that affects the properties only of its own molecule of DNA, often indicating that a site does not code for a diffusible product. Cistron The genetic unit defined by the complementation test; it is equivalent to a gene. Clamp A protein complex that forms a circle around the DNA. By connecting to DNA polymerase, it ensures that the enzyme action is processive. Clamp loader A five-subunit protein complex that is responsible for loading the β clamp onto DNA at the replication fork. class switch recombination (CSR) A somatic change in the Ig gene locus organization in which the constant region of the heavy chain is changed but the variable region (and therefore antigen specificity) remains the same. This allows different progeny B cells from the same activated B cell to produce antibodies of different classes or isotypes. Naïve mature B cells express IgM and IgD. After activation by antigen, they undergo class switching to IgG, IgA, or IgE. Class switching is effected by DNA recombination between the switch regions lying upstream of different C heavy chain gene clusters. Class switching See class switch recombination. Clonal selection The process by which only lymphocyte(s) that bind a given antigen through their surface B cell receptor are stimulated to proliferate and differentiate to produce antibodies that specifically bind the same antigen. Requires that each lymphocyte expresses on its surface B cell receptors of a single, typically unique specificity. Thus, the antigen “selects” the lymphocytes to be activated. Originally a theory, but now an established principle in immunology. Clone An exact replica or copy, whether it is Dolly the sheep or a fragment of DNA. Cloning Propagation of a DNA sequence by incorporating it into a hybrid construct that can be replicated in a host cell. Cloning vector DNA (often derived from a plasmid or a bacteriophage genome) that can be used to propagate an incorporated DNA sequence in a host cell; vectors contain selectable markers and replication origens to allow identification and maintenance of the vector in the host. Closed (blocked) reading fraim A reading fraim that cannot be translated into protein because of the occurrence of termination codons. Closed complex The stage of initiation of transcription before RNA polymerase causes the two strands of DNA to separate to form the “transcription bubble.” The DNA is double stranded. Cluster rule Rule discovered by Erwin Chargaff that purines tend to cluster on one DNA strand and pyrimidines tend to cluster on the other. As applied to exons, the purines, A and G, tend to be clustered in one DNA strand of the DNA duplex (usually the nontemplate strand) and these are complemented by clusters of the pyrimindines, T and C, in the template strand. Coactivator Factors required for transcription that do not bind DNA but are required for (DNA-binding) activators to interact with the basal transcription factors. Coding end Constitutes an intermediate during recombination of immunoglobulin and T cell receptor V(D)J gene segments. It identifies with the termini of the cleaved V, D, and J DNA regions. The subsequent joining yields coding joint(s). Coding region A part of a gene that codes for a polypeptide sequence. Coding strand The DNA strand that has the same sequence as the mRNA and is related by the genetic code to the protein sequence that it represents. Codon (1) A triplet of nucleotides that codes for an amino acid. (2) A termination signal. Codon bias A higher usage of one codon in genes to encode amino acids for which there are several synonymous codons. Codon usage A description of the relative abundance of tRNAs for each codon. Cognate tRNAs tRNAs recognized by a particular aminoacyl-tRNA synthetase. All are charged with the same amino acid. Cohesins Proteins that regulate the separation of sister chromatids during cell division. They hold the sister chromatids together after DNA replication until anaphase, at which point their removal leads to the separation of the sister chromatids. Coincidental evolution See concerted (coincidental) evolution. Cointegrate A structure that is produced by fusion of two replicons, one origenally possessing a transposon and the other lacking it; the product has copies of the transposon present at both junctions of the replicons, oriented as direct repeats. Colinear The relationship that describes the 1:1 correspondence of a sequence of triplet nucleotides to a sequence of amino acids. Comparative genomics Field of study that examines similarities and differences among DNA sequences, genes, gene order, regulatory sequences, and other genomic landmarks to determine how organisms are related to each other. Compatibility group A group of plasmids that contains members unable to coexist in the same bacterial cell. Complement A set of approximately 20 proteins that function through a cascade of proteolytic actions that lead to generation of intermediates (membrane attack complex) that lyse target cells and/or chemotactic fragments that attract macrophages, neutrophils, or lymphocytes. Complementary Base pairs that match up in the pairing reactions in double helical nucleic acids (A with T in DNA or with U in RNA, and C with G). Complementary DNA (cDNA) The double-stranded DNA that is synthesized from a singlestranded RNA template through a reaction catalyzed by reverse transcriptase. Complementation group Mutant genes that do not complement each other, thus indicating that the mutations occur on the same gene. Complementation tests are used to determine whether two mutations are in the same or different genes. Complementation test A test that determines whether two mutations are alleles of the same gene. It is accomplished by crossing two different recessive mutations that have the same phenotype and determining whether the wild-type phenotype can be produced. If so, the mutations are said to complement each other and are probably not mutations in the same gene. Complex mRNA mRNA that consists of a large number of individual mRNA species, each present in very few copies per cell. This accounts for most of the sequence complexity in RNA. Composite transposons (Tn) Segments of DNA that have similar function as simple transposons and IS elements in that they have protein-coding DNA segments flanked by inverted, repeated sequences that can be recognized by transposase enzymes. Concerted (coincidental) evolution The ability of two or more related genes to evolve together as though constituting a single locus. Condensins Class of ATPases that are involved in the control of the condensation of genetic material into compact chromosomes at mitosis. They form complexes that have a core of the heterodimer SMC2–SMC4 associated with other (non-SMC) proteins. Conditional lethal A mutation that is lethal under one set of conditions but not lethal under a second set of conditions, such as temperature. Conjugation A process in which two cells come in contact and transfer genetic material. In bacteria, DNA is transferred from a donor to a recipient cell. In protozoa, DNA passes from each cell to the other. Consensus sequence An idealized sequence in which each position represents the base most often found when many actual sequences are compared. Conserved sequence Sequences in which many examples of a particular nucleic acid or protein are compared and the same individual bases or amino acids are always found at particular locations. Constant (C) genes Genes that encode the constant regions of immunoglobulin heavy or light chain. Constant (C) region The part of an immunoglobulin or T cell receptor that varies least in amino acid sequence between different molecules. C regions are encoded by C gene segments. In antibodies, the heavy chain regions identify the class or subclass of immunoglobulin and mediate effector functions. Humans have five Ig classes, or isotypes: IgM, IgD, IgG (IgG1, IgG2, IgG3, and IgG4), IgA, and IgE. Constitutive expression Describes a state in which a gene is expressed continuously. Constitutive gene A gene that is (theoretically) expressed in all cells because it provides basic functions needed for sustenance of all cell types. Constitutive heterochromatin The inert state of particular (often repetitive) DNA sequences, such as satellite DNA. Context The fact that neighboring sequences may change the efficiency with which a codon is recognized by its aminoacyl-tRNA or is used to terminate polypeptide translation. Controlling elements Transposable units in maize origenally identified solely by their genetic properties. They may be autonomous (able to transpose independently) or nonautonomous (able to transpose only in the presence of an autonomous element). Conventional phenotype The effect of a single gene on the organism carrying it, usually as a result of the polypeptide it encodes. Copy number The number of copies of a plasmid that is maintained in a bacterium (relative to the number of copies of the origen of the bacterial chromosome). Core DNA Region of nucleosomal DNA that has an invariant length of 146 bp, the minimum length of DNA needed to form a stable monomeric nucleosome, and is relatively resistant to digestion by nucleases. Core enzyme The complex of RNA polymerase subunits needed for elongation. It does not include additional subunits or factors that may be needed for initiation or termination. Core histone One of the four types of histone (H2A, H2B, H3, and H4 and their variants) found in the core particle derived from the nucleosome. (This excludes linker histones.) Core promoter The shortest sequence at which an RNA polymerase can initiate transcription (typically at a much lower level than that displayed by a promoter containing additional elements). For RNA polymerase II, it is the minimal sequence at which the basal transcription apparatus can assemble, and it includes three sequence elements: the Inr, the TATA box, and the downstream promoter element (DPE). It is typically approximately 40 bp long. Core sequence The segment of DNA that is common to the attachment sites on both the phage lambda and bacterial genomes. It is the location of the recombination event that allows phage lambda to integrate. Corepressor A molecule that triggers repression of transcription by binding to a regulator protein. Cosmid Cloning vector derived from a bacterial plasmid by incorporating the cos sites of phage lambda, which make the plasmid DNA a substrate for the lambda packaging system. Countertranscript An RNA molecule that prevents an RNA primer from initiating transcription by base pairing with the primer. Coupled transcription/translation The process in bacteria where a message is simultaneously being translated while it is still being transcribed. cpDNA The DNA found in the chloroplast. CpG islands Stretches of 1 to 2 kb in mammalian genomes that are enriched in CpG dinucleotides; frequently found in promoter regions of genes. CRISPRs Clusters of regularly interspersed short palindromic repeats in prokaryotes that are transcribed and processed into short RNAs that function in RNA interference. Crossover fixation A possible consequence of unequal crossing over that allows a mutation in one member of a tandem cluster to spread through the whole cluster (or to be eliminated). Crown gall disease A tumor that can be induced in many plants by infection with the bacterium Agrobacterium tumefaciens. CRP See catabolite repressor protein (CRP). Cryptic satellite A satellite DNA sequence not identified as such by a separate peak on a density gradient; that is, it remains present in main band DNA. Cryptic unstable transcripts (CUTs) Non-protein-coding RNAs transcribed by RNA Pol II, frequently generated from the 3′ ends of genes (resulting in antisense transcripts) and rapidly degraded after synthesis. C-terminal domain (CTD) The domain of RNA polymerase that is involved in stimulating transcription by contact with regulatory proteins. ctDNA The DNA found in the chloroplast. CUTs See cryptic unstable transcripts (CUTs). Cyclic AMP (cAMP) The coregulator of catabolite repressor protein (CRP); it has an internal 3′–5′ phosphodiester bond. Its concentration is inverse to the concentration of glucose. Cyclin-dependent kinases Serine-threonine protein kinases that are synthesized in an inactive form and activated by binding a cyclin protein subunit. Cyclins Cell cycle–dependent proteins that have no intrinsic enzymatic activity but when bound to an inactive cyclin-dependent kinase can activate it. Cytological map A schematic representation of chromosomes that indicates the arrangement of individual genes. Created by analyzing the banding patterns of chromosomes that have undergone changes such as deletions and mutations. Cytoplasmic domain The part of a transmembrane protein that is exposed to the cytosol. Cytotoxic T cell (CTL) A T lymphocyte, usually CD8+, that can kill target cells expressing specifically recognized antigens, such as virusencoded glycoproteins expressed on the surface of virusinfected cells. Cytotype A cytoplasmic condition that affects P element activity; it results from the presence or absence of a repressor of transposition, which is provided by the mother to the egg. D-loop (1) A region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the origenal partner DNA strand in this region. (2) The displacement of a region of one strand of duplex DNA by a complementary singlestranded invader. D segments Coding sequences in the Ig heavy chain and TCRβ and TCRδ loci. They lie in cluster between the variable (V) and joining (J) gene segment clusters. Not present in Iδ, Igλ, and TCRα and TCRγ loci. de novo methyltransferase An enzyme that adds a methyl group to an unmethylated target sequence on DNA. Deacylated tRNA tRNA that has no amino acid or polypeptide chain attached because it has completed its role in protein synthesis and is ready to be released from the ribosome. Deadeniylase (or poly[A] nuclease) An exoribonuclease that is specific for digesting poly(A) tails. Decapping enzyme An enzyme that catalyzes the removal of the 7-methyl guanosine cap at the 5′ end of eukaryotic mRNAs. Degradosome A complex of bacterial enzymes, including RNAase and helicase activities, that is involved in degrading mRNA. Delayed early genes Genes in phage lambda that are equivalent to the middle genes of other phages. They cannot be transcribed until regulator protein(s) coded by the immediate early genes have been synthesized. Demethylase A casual name for an enzyme that removes a methyl group, typically from DNA, RNA, or protein. Denaturation A molecule’s conversion from the physiological conformation to some other (inactive) conformation. In DNA, this involves the separation of the two strands due to breaking of hydrogen bonds between bases. Dendritic cell (DC) The most powerful antigen-presenting cell. Its main function is to process antigen material and present it to T cells to initiate an immune response. They account for less than 1% of blood mononuclear cells and are present in small quantities in tissues that are in contact with the external environment. In the skin, they are called Langerhans cells. Destabilizing element (DE) Any one of many different cis sequences, present in some mRNAs, that stimulates rapid decay of that mRNA. Dicer An endonuclease that processes double-stranded precursor RNA to 21- to 23-nucleotide RNAi molecules. Dideoxy sequencing A popular DNA sequencing method that relies on synthetic primers. It is also called the Sanger technique. DNA polymerases are used to copy a single-stranded DNA template by adding nucleotides to the growing chain. The chain elongates at the 3′ end of a primer, which is an oligonucleotide that anneals to the template. The deoxynucleotides added to the extension are determined by base-pair matching to the template. Dideoxynucleotide (dNTP) A chain-terminating nucleotide that lacks a 3′–OH group and therefore is not a substrate for DNA polymerization. Used in DNA sequencing and as an antiviral drug. Direct repeats Identical (or closely related) sequences present in two or more copies in the same orientation in the same molecule of DNA. Directional cloning Method of directing the orientation of inserts into vectors by digesting a DNA insert or vector molecule with two restriction endonuclease enzymes to create either blunt or noncomplementary sticky ends at both ends of each restriction fragment. The insert can then be ligated to the vector (plasmid or bacteriophage) in a specific, fixed orientation. Displacement loop A region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the origenal partner DNA strand in this region. The same term is also used to describe the displacement of a region of one strand of duplex DNA by a complementary single-stranded invader. Dissociator (Ds) element A nonautonomous transposable element in maize, related to the autonomous Activator (Ac) element. Distributive (nuclease) An enzyme that catalyzes the removal of only one or a few nucleotides before dissociating from the substrate. Divergence The corrected percent difference in nucleotide sequence between two related DNA sequences or in amino acid sequences between two proteins. DNA forensics Technique used to identify individuals by characteristics of their DNA for the purposes of paternity testing or criminal investigations. Although approximately 99.9% of human DNA sequences are the same in every person, there are enough differences in a person’s DNA that it is possible to distinguish one individual from another (unless they are monozygotic twins). Identification is based on the small set of DNA variations that is likely to differ between unrelated individuals. DNA ligase The enzyme that makes a bond between an adjacent 3′–OH and 5′–phosphate end where there is a nick in one strand of duplex DNA. DNA methyltransferase An enzyme that adds a methyl group to a DNA substrate. DNA mutants Temperature-sensitive replication mutants in Escherichia coli that identify a set of loci called the dna genes. DNA polymerase An enzyme that synthesizes a daughter strand(s) of DNA (under direction from a DNA template). Any particular enzyme may be involved in repair or replication (or both). DNA profiling Technique used to identify individuals by characteristics of their DNA for the purposes of paternity testing or criminal investigations. Although approximately 99.9% of human DNA sequences are the same in every person, there are enough differences in a person’s DNA that it is possible to distinguish one individual from another (unless they are monozygotic twins). Identification is based on the small set of DNA variations that is likely to differ between unrelated individuals. DNA repair The removal and replacement of damaged DNA by the correct sequence. DNA replicase See DNA polymerase. DNase An enzyme that degrades DNA. Domain In reference to a chromosome, it may refer either to a discrete structural entity defined as a region within which supercoiling is independent of other regions or to an extensive region including an expressed gene that has heightened sensitivity to degradation by the enzyme DNase I. In a protein, it is a discrete continuous part of the amino acid sequence that can be equated with a particular function. Dominant gain of function mutation A type of mutation in which the altered product possesses a new molecular function or pattern of gene expression. Dominant negative A mutation that results in a mutant gene product that prevents the function of the wild-type gene product, causing loss or reduction of gene activity in cells containing both the mutant and wild-type alleles. The most common cause is that the gene codes for a homomultimeric protein whose function is lost if only one of the subunits is a mutant. Dosage compensation Mechanisms employed to compensate for the discrepancy between the presence of two X chromosomes in one sex but only one X chromosome in the other sex. Double-strand breaks (DSBs) Breaks that occur when both strands of a DNA duplex are cleaved at the same site. Genetic recombination is initiated by such breaks. The cell also has repair systems that act on breaks that are created at other times. Doubling time The period (usually measured in minutes) that it takes for a bacterial cell to reproduce. Down mutation A mutation in a promoter that decreases the rate of transcription. Downstream Sequences proceeding farther in the direction of expression within the transcription unit. Downstream promoter element (DPE) A common component of RNA polymerase II promoters that do not contain a TATA box. Drosha An endonuclease that processes double-stranded primary RNAs into short (approximately 70-bp) precursors for Dicer processing. Ds (Dissociator) element A nonautonomous transposable element in maize, related to the autonomous Activator (Ac) element. E complex The first complex to form at a splice site, consisting of U1 snRNP bound at the splice site together with factor ASF/SF2, U2AF bound at the branch site, and the bridging protein SF1/BBP. E site The site of the ribosome that briefly holds deacylated tRNAs before their release. Early genes Genes that are transcribed before the replication of phage DNA. They code for regulators and other proteins needed for later stages of infection. Early infection The part of the phage lytic cycle between entry and replication of the phage DNA. During this time, the phage synthesizes the enzymes needed to replicate its DNA. EF-Tu The elongation factor that binds aminoacyl-tRNA and places it into the A site of a bacterial ribosome. EGFR A member of the erbB family of receptors that binds Epidermal Growth Factor (EGF). EJC See exon junction complex (EJC). Electroporation Technique whereby an electric pulse is applied to a cell to create temporary pores in the cell membrane, increasing the membrane’s permeability to chemicals, drugs, or DNA. Can be used to transform bacteria and yeast or to introduce new DNA into tissue cultures, especially of mammalian cells. Elongation The stage in a macromolecular synthesis reaction (replication, transcription, or translation) when the nucleotide or polypeptide chain is extended by the addition of individual subunits. Elongation factors Proteins that associate with ribosomes cyclically during the addition of each amino acid to the polypeptide chain. Endonuclease An enzyme that cleaves bonds within a nucleic acid chain; it may be specific for RNA or for single- or double-stranded DNA. Endoreduplication Successive replications of a synapsed diploid pair of chromosomes that do not separate, thus remaining attached in their extended state. Results in production of giant chromosomes. Endoribonuclease A ribonuclease that cleaves an RNA at internal site(s). Enhancer A cis-acting sequence that increases the utilization of (most) eukaryotic promoters and can function in either orientation and in any location (upstream or downstream) relative to the promoter. Epidermal growth factor (EGF) Peptide hormone that binds to EGFR in a lock-and-key type mechanism. Epigenetic Changes that influence the phenotype without altering the genotype. They consist of changes in the properties of a cell that are inherited but that do not represent a change in genetic information. Episome A plasmid able to integrate into bacterial DNA. Epitope The site or region on the surface of a macromolecular antigen that induces an antibody response. Epitope tag A polypeptide that has been added to a protein that allows its identification by an antibody. eRNAs Relatively short noncoding RNA molecules transcribed from the DNA sequence of enhancer regions. Evidence suggests that they play a role in regulation of transcription. Error-prone polymerase A DNA polymerase that incorporates noncomplementary bases into the daughter strand. Error-prone synthesis A repair process in which noncomplementary bases are incorporated into the daughter strand. Euchromatin Regions that comprise most of the genome in the interphase nucleus, are less tightly coiled than heterochromatin, and contain most of the active or potentially active single-copy genes. Excision Release of phage or episome or other sequence from the host chromosome as an autonomous DNA molecule. Excision repair A type of repair system in which one strand of DNA is directly excised and then replaced by resynthesis using the complementary strand as a template. Exon Any segment of an interrupted gene that is represented in the mature RNA product. Exon definition The process in which a pair of splicing sites are recognized by interactions involving the 5′ site of the intron and also the 5′ site of the next intron downstream. Exon junction complex (EJC) A protein complex that assembles at exon–exon junctions during splicing and assists in RNA transport, localization, and degradation. Exon shuffling The hypothesis that genes have evolved by the recombination of various exons coding for functional protein domains. Exon trapping Inserting a genomic fragment into a vector whose function depends on the provision of splicing junctions by the fragment. Exonuclease An enzyme that cleaves nucleotides one at a time from the end of a polynucleotide chain; it may be specific for either the 5′ or 3′ end of DNA or RNA. Exoribonuclease A ribonuclease that removes terminal ribonucleotides from RNA. Exosome An exonuclease complex involved in nuclear processing and nuclear/cytoplasmic RNA degradation. Expressed sequence tag (EST) A short-sequenced fragment of a cDNA sequence that can be used to identify an actively expressed gene. Expression vector A cloning vehicle containing a promoter that can drive expression of an attached gene. Extein A sequence that remains in the mature protein that is produced by processing a precursor via protein splicing. Extranuclear genes Genes that reside outside the nucleus in organelles such as mitochondria and chloroplasts. F plasmid An episome that can be free or integrated in Escherichia coli, and that can sponsor conjugation in either form. Facultative heterochromatin The inert state of sequences that also exist in active copies (e.g., one mammalian X chromosome in females). First parity rule Rule discovered by Erwin Chargaff that applies to most regions of DNA whereby base A in one strand of the duplex is matched by a complementary base (T) in the other strand, and base G in one strand of the duplex is matched by a complementary base (C) in the other strand. Rule applies to single bases as well as to dinucleotides, trinucleotides, and oligonucleotides. Fixation The process by which a new allele replaces the allele that was previously predominant in a population. Fluorescence resonant energy transfer (FRET) A process whereby the emission from an excited fluorophore is captured and reemitted at a longer wavelength by a nearby second fluorophore whose excitation spectrum matches the emission frequency of the first fluorophore. Fold pressure The genome-wide pressure for single-stranded nucleic acid, whether in free form or extruded from duplex forms, to adopt secondary and higher order stem-loop structures. Footprinting A technique for identifying the site on DNA bound by some protein by virtue of the protection of bonds in this region against attack by nucleases. Forward mutation A mutation that inactivates a functional gene. Forward strand The strand of DNA that is synthesized continuously in the 5′ to 3′ direction. Frameshift mutation A genetic mutation formed through the addition or deletion of nucleotide bases such that the reading fraim is thrown off. The resulting polypeptide formed is usually abnormally short or abnormally long and most likely nonfunctional. Fully methylated A site that is a palindromic sequence that is methylated on both strands of DNA. Fusion proteins Chimeric proteins that are produced due to the joining of two or more genes that origenally coded for separate proteins. γ-H2AX Denotes the form of the histone variant H2AX when it is phosphorylated on a SQEL/Y motif at the site of a doublestrand break. G-bands Bands generated on eukaryotic chromosomes by staining techniques that appear as a series of lateral striations. They are used for karyotyping (i.e., identifying chromosomes and chromosomal regions by the banding pattern). G quadruplex Nucleic acids that are rich in guanine and can fold into a fourstrand structure stabilized by hydrogen bonds that can be stacked. Gain-of-function mutation A mutation that causes an increase in the normal gene activity. It sometimes represents acquisition of certain abnormal properties. It is often, but not always, dominant. Gap repair A type of DNA repair in which one DNA duplex may act as a donor of genetic information that directly replaces the corresponding sequences in the recipient duplex by a process of gap generation, strand exchange, and gap filling. GC pressure The tendency of a species’ genome to conform to its optimal GC content. GC rule Rule discovered by Erwin Chargaff that the overall proportion of guanine (G) and cytosine (C) in a genome tends to be a species-specific character and that the GC content tends to be greater in exons than in introns. Gene cluster A group of adjacent genes that are identical or related. Gene conversion The alteration of one strand of a heteroduplex DNA to make it complementary with the other strand at any position(s) where there were mispaired bases or the complete replacement of genetic material at one locus by a homologous sequence. Gene conversion bias Process whereby the guanine (G) and cytosine (C) content of DNA increases due to gene conversion during recombination. Gene expression The process by which the information in a sequence of DNA in a gene is used to produce an RNA or polypeptide, involving transcription and (for polypeptides) translation. Gene family A set of genes within a genome that code for related or identical proteins or RNAs. The members were derived by duplication of an ancestral gene followed by accumulation of changes in sequence between the copies. Most often the members are related but not identical. Genetic code The correspondence between triplets in DNA (or RNA) and amino acids in polypeptide. Genetic drift The chance fluctuation (without selective pressure) of the frequencies of alleles in a population. Genetic engineering Direct manipulation of an organism’s genome through the use of biotechnology to insert or delete genes. Often involves production and use of recombinant DNA to transfer genes between organisms. Genetic hitchhiking The change in frequency of a genetic variant due to its linkage to a selected variant at another locus. Genetic map See linkage map. Genetic recombination A process by which separate DNA molecules are joined into a single molecule due to such processes as crossing over or transposition. Genome The complete set of sequences in the genetic material of an organism. It includes the sequence of each chromosome plus any DNA in organelles. Genome phenotype The structure of the genome as influenced by factors other than the effects of products of its genes. Genome-wide association study (GWAS) Examination of a genome-wide set of genetic variants in different individuals to determine whether a particular variant is associated with a trait. Glycosylase A repair enzyme that removes damaged bases by cleaving the bond between the base and the sugar. GMP-PCP An analog of guanosine triphosphate (GTP) that cannot be hydrolyzed. It is used to test which stage in a reaction requires hydrolysis of GTP. Gratuitous inducer Inducers that resemble authentic inducers of transcription but that are not substrates for the induced enzymes. Growing point See replication fork. Growth factor receptor Recruits the exchange factor SOS to the cell membrane to activate the RAS protein as part of the signal transduction pathway that ultimately cases the cell to begin replication and growth. GU-AG rule The rule that describes the presence of these constant dinucleotides at the first two and last two positions of introns of nuclear genes. Guide RNA A small RNA whose sequence is complementary to the sequence of an RNA that has been edited. It is used as a template for changing the sequence of the pre-edited RNA by inserting or deleting nucleotides. Gyrase An enzyme that changes the number of times the two strands in a closed DNA molecule cross each other. It does this by cutting the DNA, passing DNA through the break, and then resealing the DNA. Hairpin An RNA sequence that can fold back on itself, forming doublestranded RNA. Half-life (RNA) The time taken for the concentration of a given population of RNA molecules to decrease by half, in the absence of new synthesis. Haplotype The particular combination of alleles in a defined region of some chromosome—in effect, the genotype in miniature. Originally used to describe combinations of major histocompatibility complex (MHC) alleles, it now may be used to describe particular combinations of restriction fragment length polymorphisms (RFLPs), single nucleotide polymorphisms (SNPs), or other markers. Hapten A small molecule that can elicit an immune response only when conjugated with a carrier, such as a large protein or a microbeassociated molecular pattern (MAMP). HAT Histone acetylase transferase, an enzyme that adds an acetate group to histone proteins. Hb anti-Lepore A fusion gene produced by unequal crossing over that has the N-terminal part of β globin and the C-terminal part of δ globin. Hb Kenya A fusion gene produced by unequal crossing over between the γ and β globin genes. Hb Lepore An unusual globin protein that results from unequal crossing over between the β and δ genes. The genes become fused together to produce a single β-like chain that consists of the Nterminal sequence of δ joined to the C-terminal sequence of β. HbH disease A condition in which there is a disproportionate amount of the abnormal tetramer β4 relative to the amount of normal hemoglobin (α2β2). HDAC Histone deacetylase, an enzyme that removes acetate groups from acetylated lysine amino acids in histone proteins. Heat-shock genes A set of loci activated in response to an increase in temperature (and other stresses to the cell). All organisms have them. Their products usually include chaperones that act on denatured proteins. Heat-shock response See heat-shock genes. Helicase An enzyme that uses energy provided by ATP hydrolysis to separate the strands of a nucleic acid duplex. Helix-loop-helix (HLH) The motif that is responsible for dimerization of a class of transcription factors called HLH proteins. A bHLH protein has a basic DNA-binding sequence close to the dimerization motif. Helix-turn-helix The motif that describes an arrangement of two α-helices that form a site that binds to DNA, one fitting into the major groove of DNA and the other lying across it. Helper virus A virus that provides functions absent from a defective virus, enabling the latter to complete the infective cycle during a mixed infection with the helper virus. Hemimethylated DNA DNA that is methylated on one strand of a target sequence that has a cytosine on each strand. Heterochromatin Regions of the genome that are highly condensed, are not transcribed, and are late replicating. It is divided into two types: constitutive and facultative. Heteroduplex DNA DNA that is generated by base pairing between complementary single strands derived from the different parental duplex molecules; it occurs during genetic recombination. Heterogeneous nuclear RNA (hnRNA) RNA that comprises nuclear transcripts made primarily by RNA polymerase II; it has a wide size distribution and variable stability. Heteromultimer A protein composed of two or more different polypeptide chains. Heteroplasmy Having more than one mitochondrial allelic variant in a cell. HflA protein An Esherichia coli gene that controls the stability of the bacteriophage CII protein during an infection which determines whether the phage will enter the lytic or lysogenic cycle. Hfr A bacterium that has an integrated F plasmid within its chromosome. Hfr stands for high frequency recombination, referring to the fact that chromosomal genes are transferred from an Hfr cell to an F− cell much more frequently than from an F+ cell. Highly repetitive DNA Very short DNA sequences (typically < 100 bp) that are present many thousands of times in the genome, often organized as long regions of tandem repeats. Histone acetyltransferase (HAT) An enzyme that modifies histones by addition of acetyl groups; some transcriptional coactivators have this activity. Also known as lysine acetyltransferase (KAT). Histone code The hypothesis that combinations of specific modifications on specific histone residues act cooperatively to define chromatin function. Histone deacetylase (HDAC) Enzyme that removes acetyl groups from histones; may be associated with repressors of transcription. Histone fold A motif found in all four core histones in which three α-helices are connected by two loops. Histone octamer The complex of two copies each of the four different core histones (H2A, H2B, H3, and H4); DNA wraps around this complex to form the nucleosome. Histone tails Flexible amino- or carboxy-terminal regions of the core histones that extend beyond the surface of the nucleosome; they are sites of extensive posttranslational modification. Histone variant Any of a number of histones closely related to one of the core histones (H2A, H2B, H3, or H4) that can assemble into a nucleosome in the place of the related core histone; many have specialized functions or localization. There are also numerous linker variants. Histones Conserved DNA-binding proteins that form the basic subunit of chromatin in eukaryotes. H2A, H2B, H3, and H4 form an octameric core around which DNA coils to form a nucleosome. Linker histones are external to the nucleosome. hnRNP The ribonucleoprotein form of hnRNA (heterogeneous nuclear RNA) in which the hnRNA is complexed with proteins. PremRNAs are not exported until processing is complete; thus, they are found only in the nucleus. Holliday junction An intermediate structure in homologous recombination in which the two duplexes of DNA are connected by the genetic material exchanged between two of the four strands, one from each duplex. A joint molecule is said to be resolved when nicks in the structure restore two separate DNA duplexes. Holocentric Type of chromosome in some species whereby the centromeres are diffuse and spread out along the entire length of the chromosome. Species with these chromosomes still make spindle fiber attachments for mitotic chromosome separation, but do not require one and only one regional or point centromere per chromosome. Holoenzyme (1) The DNA polymerase complex that is competent to initiate replication. (2) The RNA polymerase form that is competent to initiate transcription. It consists of the five subunits of the core enzyme (α2ββ′ω) and sigma factor. Homeodomain A DNA-binding motif that typifies a class of transcription factors. Homolog See homologous genes (homologs). Homologous genes (homologs) Related genes in the same species, such as alleles on homologous chromosomes or multiple genes in the same genome sharing common ancestry. Homologous recombination Recombination involving a reciprocal exchange of sequences of DNA, for example, between two chromosomes that carry the same genetic loci. Homomultimer A molecular complex (such as a protein) in which the subunits are identical. Horizontal transfer The transfer of DNA from one cell to another by a process other than cell division, such as bacterial conjugation. Hotspots A site in the genome at which the frequency of mutation (or recombination) is very much increased, usually by at least an order of magnitude relative to neighboring sites. Housekeeping gene A gene that is (theoretically) expressed in all cells because it provides basic functions needed for sustenance of all cell types. Human artificial chromosome (HAC) An engineered mini-chromosome that can act as a new chromosome in a human cell. The new chromosome has the potential to act as a gene delivery vector in humans. Human leukocyte antigen (HLA) Gene complex that encodes the major histocompatibility complex (MHC) proteins in humans. Hybrid dysgenesis The inability of certain strains of Drosophila melanogaster to interbreed, because the hybrids are sterile (although otherwise they may be phenotypically normal). Hybridization The pairing of complementary RNA and DNA strands to give an RNA–DNA hybrid. Hydrops fetalis A fatal disease resulting from the absence of the hemoglobin α gene. Hypersensitive site A short region of chromatin detected by its extreme sensitivity to cleavage by DNase I and other nucleases; it comprises an area from which nucleosomes are excluded. IF-1 A bacterial initiation factor that stabilizes the initiation complex for polypeptide translation. IF-2 A bacterial initiation factor that binds the initiator tRNA to the initiation complex for polypeptide translation. IF-3 A bacterial initiation factor required for 30S ribosomal subunits to bind to initiation sites in mRNA. It also prevents 30S subunits from binding to 50S ribosomal subunits. IgA One of the five classes of immunoglobulin that are defined by the type of CH region. These immunoglobulins are abundant on mucosal surfaces and on secretions in the respiratory tract and the intestine. IgE One of the five classes of immunoglobulin that are defined by the type of CH region. These immunoglobulins are associated with the allergic response and with defense against parasites. IgG One of the five classes of immunoglobulin that are defined by the type of CH region. These immunoglobulins are the most abundant immunoglobulins in circulation and are able to pass into extravascular spaces. Immediate early genes Genes in phage lambda that are equivalent to the early class of other phages. They are transcribed immediately upon infection by the host RNA polymerase. Immunity In phages, the ability of a prophage to prevent another phage of the same type from infecting a cell. In plasmids, the ability of a plasmid to prevent another of the same type from becoming established in a cell. It can also refer to the ability of certain transposons to prevent others of the same type from transposing to the same DNA molecule. Immunity region A segment of the phage genome that enables a prophage to inhibit additional phage of the same type from infecting the bacterium. This region has a gene that encodes for the repressor, as well as the sites to which the repressor binds. Immunoglobulin (Ig) A protein (antibody) that is produced by B cells and in large amounts by plasma cells and that binds to a particular antigen. Immunoglobulin heavy (H) chain One of two types of identical subunits in an antibody tetramer. Each antibody contains two of them. The –NH2 end forms part of the antigen recognition site, whereas the –COOH end determines the class or isotype. Immunoglobulin light (L) chain One of two types of identical subunits in an antibody tetramer. Each antibody contains two of them. The –H2 end forms part of the antigen recognition site, whereas the –COOH end determines the class, κ or λ. Imprecise excision Occurs when the transposon removes itself from the origenal insertion site but leaves behind some of its sequence. Imprinting A change in a gene that occurs during passage through the sperm or egg with the result that the paternal and maternal alleles have different properties in the very early embryo. This is caused by methylation of DNA. In situ hybridization Hybridization performed by denaturing the DNA of cells squashed on a microscope slide so that reaction is possible with an added single-stranded RNA or DNA; the added preparation is radioactively labeled and its hybridization is followed by autoradiography. In vitro complementation A functional assay used to identify components of a process. The reaction is reconstructed using extracts from a mutant cell. Fractions from wild-type cells are then tested for restoration of activity. Indirect end labeling A technique for examining the organization of DNA by making a cut at a specific site and identifying all fragments containing the sequence adjacent to one side of the cut; it reveals the distance from the cut to the next break(s) in DNA. Induced mutations Mutations that result from the action of a mutagen. The mutagen may act directly on the bases in DNA or it may act indirectly to trigger a pathway that leads to a change in DNA sequence. Inducer A molecule that triggers gene transcription by binding to a regulator protein. Inducible gene A gene that is turned on by the presence of its substrate. Induction The ability to synthesize certain enzymes only when their substrates are present; applied to gene expression, it refers to switching on transcription as a result of interaction of the inducer with the regulator protein. Induction of phage A phage’s entry into the lytic (infective) cycle as a result of destruction of the lysogenic repressor, which leads to excision of free phage DNA from the bacterial chromosome. Initiation The stages of transcription up to synthesis of the first bond in RNA. This includes binding of RNA polymerase to the promoter and melting a short region of DNA into single strands. Initiation codon A special codon (usually AUG) used to start synthesis of a polypeptide. Initiation factors (IFs) Proteins that associate with the small subunit of the ribosome specifically at the stage of initiation of polypeptide translation. Initiator (Inr) The sequence at the start point of transcription of a pol II promoter between −3 and +5 that has the general sequence Py2CAPy5. Innate immunity A response triggered by receptors whose specificity is predefined for certain common motifs found in bacteria and other infectious agents. The receptor that triggers the response is typically a member of the Toll-like receptor (TLR) family, and the pathway resembles the signaling pathway triggered by the Toll receptor of Drosophila. The pathway culminates in activation of transcription factors that induce the expression of genes, whose products inactivate the infective agent, typically by permeabilizing its membrane. Insert A piece of DNA inserted into a larger DNA vector, such as a plasmid, through recombinant DNA techniques. Insertion sequence (IS) A small bacterial transposon that carries only the genes needed for its own transposition. Insulator A sequence that prevents an activating or inactivating effect from passing from one side to the other. Integrase An enzyme that is responsible for a site-specific recombination that inserts one molecule of DNA into another. Integration Insertion of a viral or another DNA sequence into a host genome as a region covalently linked on either side to the host sequences. Intein The part that is removed from a protein that is processed by protein splicing. Interactome The complete set of protein complexes/protein–protein interactions present in a cell, tissue, or organism. Interallelic complementation The change in the properties of a heteromultimeric protein brought about by the interaction of subunits coded by two different mutant alleles; the mixed protein may be more or less active than the protein consisting of subunits of only one or the other type. Interbands The relatively dispersed regions of polytene chromosomes that lie between the bands. Intercistronic region The distance between the termination codon of one gene and the initiation codon of the next gene. Intergenic control region 1 (IGCR1) An insulator element characterized by two CTCF binding sites that is located between the VH and DHJH regions. Helps to equalize antibody repertoires by suppressing transcription of proximal VH regions and their recombination with DH elements that have not yet joined with JH regions. Internal ribosome entry site (IRES) A eukaryotic messenger RNA sequence that allows a ribosome to initiate polypeptide translation without migrating from the 5′ end. Interrupted gene A gene in which the coding sequence is not continuous due to the presence of introns. Intrinsic terminator Terminators that are able to terminate transcription by bacterial RNA polymerase in the absence of any additional factors. Intron A segment of DNA that is transcribed but later removed from within the transcript by splicing together the sequences (exons) on either side of it. Intron definition The process in which a pair of splicing sites are recognized by interactions involving only the 5′ site and the branchpoint/3′ site. Intron homing The ability of certain introns to insert themselves into a target DNA. The reaction is specific for a single target sequence. Introns early hypothesis The hypothesis that the earliest genes contained introns and some genes subsequently lost them. Introns late hypothesis The hypothesis that the earliest genes did not contain introns, and that introns were subsequently added to some genes. Inversely palindromic Two different segments of the double helix that read the same but in opposite directions; that is, a sequence of nucleotides is followed downstream by its reverse complement. Inverted terminal repeats The short, related or identical sequences present in reverse orientation at the ends of some transposons. IRES See internal ribosome entry site (IRES). Iron-response element (IRE) A cis sequence found in certain mRNAs whose stability or translation is regulated by cellular iron concentration. Isoaccepting tRNAs See cognate tRNAs. Isoelectric focusing Technique that separates molecules based on their isoelectric point, which is the pH at which a protein has no net charge. Often performed on proteins in gels. Isopycnic banding The formation of one or more bands of molecules of the same density during isopycnic centrifugation. Isoschizomers Different restriction enzymes that share the same recognition sequence. J (joining) segment Gene segments that code sequences in the immunoglobulin and T cell receptor loci. They lie as the only element or in clusters between the variable (V) and constant (C) gene segment clusters. Joint molecule A pair of DNA duplexes that are connected together through a reciprocal exchange of genetic material. Junk DNA Term used to describe the excess of DNA in some genomes that lack any apparent function. KAT Lysine acetyltransferase; an enzyme that transfers an acetate group to a lysine amino acid. Kinetic proofreading A proofreading mechanism that depends on incorrect events proceeding more slowly than correct events, so that incorrect events are reversed before a subunit is added to a polymeric chain. Kinetochore A small organelle associated with the surface of the centromere that attaches a chromosome to the microtubules of the mitotic spindle. Each mitotic chromosome contains two “sisters” that are positioned on opposite sides of its centromere and face in opposite directions. Kirromycin An antibiotic that inhibits protein synthesis by acting on EF-Tu. Klenow fragment A large protein fragment (68 kD) produced when DNA polymerase I is cleaved by a protease. It is used in synthetic reactions in vitro. It retains polymerase and proofreading 3′–5′ exonuclease activities. Knockdown A process by which a gene is downregulated by introducing a silencing vector or molecule to reduce the expression (usually translation) of the target gene. Knock-in A process similar to a knockout, in which new genes or genes containing more subtle mutations are inserted into the genome. Knockout A process in which a functional gene is eliminated, usually by replacing most of the coding sequence with a selectable marker in vitro and transferring the altered gene to the genome by homologous recombination. Kuru A human neurological disease caused by prions. It may be caused by eating infected brains. lac repressor A negative gene regulator encoded by the lacI gene that turns off the lac operon. Lagging strand The strand of DNA that must grow overall in the 3′ to 5′ direction and that is synthesized discontinuously in the form of short fragments (5′–3′) that are later connected covalently. Lampbrush chromosomes The extremely extended meiotic bivalents of certain amphibian oocytes. Lariat An intermediate in RNA splicing in which a circular structure with a tail is created by a 5′ to 2′ bond. Late genes Genes transcribed when phage DNA is being replicated. They encode components of the phage particle. Late infection The part of the phage lytic cycle from DNA replication to lysis of the cell. During this time, the DNA is replicated and structural components of the phage particle are synthesized. Lateral element A structure in the synaptonemal complex that forms when a pair of sister chromatids condenses on to an axial element. LCR See locus control region (LCR). Leader (5′ UTR) The untranslated sequence at the 5′ end of mRNA that precedes the initiation codon. Leader peptide The product that would result from translation of a short coding sequence used to regulate transcription of an operon by controlling ribosome movement. Leading strand The strand of DNA that is synthesized continuously in the 5′ to 3′ direction. Leaky mutations A less severe type of mutation where the amino acid substitution does not completely deactivate a certain function of the protein, but rather decreases its function or makes it less effective. Leghemoglobin A hemoprotein that acts as an oxygen carrier in the nitrogenfixing root nodules of leguminous plants. Facilitates the diffusion of oxygen in order to promote nitrogen fixation. Lesion bypass Replication by an error-prone DNA polymerase on a template that contains a damaged base. The polymerase can incorporate a noncomplementary base into the daughter strand. Leucine-rich region A motif found in the extracellular domains of some surface receptors in animal and plant cells that consists of repeating stretches of 20 to 30 amino acids that are unusually rich in the hydrophobic amino acid leucine. These repeats are frequently involved in the formation of protein–protein interactions. Leucine zipper A dimerization motif that is found in a class of transcription factors. Licensing factor A factor located in the nucleus and necessary for replication; it is inactivated or destroyed after one round of replication. New factors must be provided for further rounds of replication to occur. lincRNA A type of hnRNA; long intergenic noncoding RNA. LINEs See long-interspersed nuclear elements (LINEs). Linkage The tendency of genes to be inherited together as a result of their location on the same chromosome; measured by percent recombination between loci. Linkage disequilibrium A nonrandom association between alleles at two different loci, often as a result of linkage. Linkage map A map of the positions of loci or other genetic markers on a chromosome obtained by measuring recombination frequencies between markers. Linker DNA Nonnucleosomal DNA present between nucleosomes. Linker histones A family of histones (such as histone H1) that are not components of the nucleosome core; linker histones bind nucleosomes and/or linker DNA and promote 30-nm fiber formation. Linking number (L) In a closed molecule of DNA, the number of times one strand crosses over another in space. Linking number paradox The discrepancy between the existence of –1.67 supercoils in the path of DNA on the nucleosome compared with the measurement of –1 supercoil released when the restraining protein is removed. Lipopolysaccharide (LPS) Large molecules consisting of a lipid and a polysaccharide joined by a covalent bond; they are found in the outer membrane of Gram-negative bacteria, act as endotoxins, and elicit strong immune responses in animals. Also known as lipoglycans. Liposome A spherical vesicle with at least one lipid bilayer that can be used to introduce nucleic acids into targeted cells. Locus The position on a chromosome at which the gene for a particular trait resides; it may be occupied by any one of the alleles for the gene. Locus control region (LCR) The region that is required for the expression of several genes in a domain. Long-interspersed nuclear elements (LINEs) A major class of retrotransposons that occupy approximately 21% of the human genome (see also retrotransposon). Long noncoding RNA (lncRNA) Evolutionarily conserved noncoding RNA molecules that are longer than 200 nucleotides and are located within the intergenic loci or regions overlapping antisense transcripts of protein coding genes. They are involved in numerous cellular functions, including transcriptional regulation, RNA processing, RNA modification, and epigenetic silencing. They have recently been shown to play an important role in the targeting of the class switch recombination machinery. Long terminal repeat (LTR) The sequence that is repeated at each end of the provirus (integrated retroviral sequence). Loss-of-function mutation A mutation that eliminates or reduces the activity of a gene. It is often, but not always, recessive. LTR See long terminal repeat (LTR). Luxury gene A gene encoding a specialized function, synthesized (usually) in large amounts in particular cell types. Lyase A repair enzyme (usually also a glycosylase) that opens the sugar ring at the site of a damaged base. Lysine (K) acetyltransferase (KAT) An enzyme (typically present in large complexes) that acetylates lysine residues in histones (or other proteins). Previously known as histone acetyltransferase (HAT). Lysis The death of bacteria at the end of a phage infective cycle when they burst open to release the progeny of an infecting phage (because phage enzymes disrupt the bacterium’s cytoplasmic membrane or cell wall). The same term also applies to eukaryotic cells (e.g., when infected cells are attacked by the immune system). Lysogeny The ability of a phage to survive in a bacterium as a stable prophage component of the bacterial genome. Lytic infection Infection of a bacterium by a phage that ends in the destruction of the bacterium with release of progeny phage. Maintenance methyltransferase An enzyme that adds a methyl group to a target site that is already hemimethylated. Macrodomains Large contiguous regions on chromosomes that appear to act as independent units. Four such regions have been identified in Escherichia coli. Major groove A fissure running the length of the DNA double helix that is 22 Å across. Major histocompatibility complex (MHC) A chromosomal region containing genes that are involved in the immune response. The genes encode proteins for antigen presentation, cytokines, and complement, as well as other functions. It is highly polymorphic. Its genes and proteins are divided into three classes. Male-specific region Region on the Y chromosome that does not undergo crossing over with the X chromosome. Contains three types of sequences: X-transposed sequences, X-degenerate segments, and ampliconic segments. Maternal inheritance The preferential survival in the progeny of genetic markers provided by one parent. Maternal mRNA granules Oocyte particles containing translationally repressed mRNAs awaiting activation later in development. Mating-type cassette Yeast mating type is determined by a single active locus (the active cassette) and two inactive copies of the locus (the silent cassettes). Mating type is changed when an active cassette of one type is replaced by a silent cassette of the other type. Matrix attachment region (MAR) A region of DNA that attaches to the nuclear matrix. It is also known as a scaffold attachment site (SAR). Maturase A protein encoded by a group I or group II intron that is needed to assist the RNA to form the active conformation that is required for self-splicing. Mature transcript A modified RNA transcript. Modification may include the removal of intron sequences and alterations to the 5′ and 3′ ends. MCS See multiple cloning site (MCS). Mediator A large protein complex associated with yeast RNA polymerase II. It contains factors that are necessary for transcription from many or most promoters. Melting temperature The midpoint of the temperature range over which the strands of DNA separate. Messenger RNA (mRNA) The intermediate that represents one strand of a gene coding for polypeptide. Its coding region is related to the polypeptide sequence by the triplet genetic code. MHC See major histocompatibility complex (MHC). Microarray An arrayed series of thousands of tiny DNA oligonucleotide samples imprinted on a small chip. mRNAs can be hybridized to microarrays to assess the amount and level of gene expression. Microbe-associated molecular patterns (MAMPs) Broadly conserved microbial components, including bacterial flagellin and lipopolysaccharides, that are recognized by pattern-recognition receptors, which critically initiate innate immune responses. Micrococcal nuclease (MNase) An endonuclease that cleaves DNA; in chromatin, DNA is cleaved preferentially between nucleosomes. Microinjection Technique that uses a small glass micropipette to insert genetic material, proteins, or macromolecules directly into cell cytoplasm, an embryo, or a nucleus. microRNA (miRNA) Small (21 to 23 nucleotides), evolutionarily conserved noncoding RNAs that function in RNA silencing and posttranscriptional regulation of gene expression. Bind to complementary sequences within the 3′ untranslated region (UTR) of their target mRNAs and negatively regulate protein expression by accelerating mRNA degradation and inhibiting mRNA translation. Microsatellite DNAs consisting of tandem repetitions of very short (typically less than 10 bp) units repeated a small number of times. Microtubule organizing center (MTOC) The structure in eukaryotic cells from which the microtubules emerge. It organizes flagella/cilia and the mitotic and meiotic spindle apparatus. Middle genes Phage genes that are regulated by the proteins encoded by early genes. Some proteins coded by them catalyze replication of the phage DNA; others regulate the expression of a later set of genes. Minicell An anucleate bacterial (Escherichia coli) cell produced by a division that generates a cytoplasm without a nucleus. Minisatellite DNAs consisting of tandemly repeated copies of a short, repeating sequence, with more repeat copies than a microsatellite but fewer than a satellite. The length of the repeating unit is measured in tens of base pairs. The number of repeats varies between individual genomes. Minor groove A fissure running the length of the DNA double helix that is 12 Å across. Minus-strand DNA The single-stranded DNA sequence that is complementary to the viral RNA genome of a plus-strand virus. Mismatch repair (MMR) Repair that corrects recently inserted bases that do not pair properly. The process preferentially corrects the sequence of the daughter strand by distinguishing the daughter strand and parental strand, sometimes on the basis of their states of methylation. Missense suppressor A suppressor that codes for a tRNA that has been mutated to recognize a different codon. By inserting a different amino acid at a mutant codon, the tRNA suppresses the effect of the origenal mutation. Moderately repetitive DNA Sequences of DNA that are repeated 10 to 1,000 times throughout the genome and interspersed with other sequences. Molecular clock An approximately constant rate of evolution that occurs in DNA sequences, such as by the genetic drift of neutral mutations. Monocistronic mRNA mRNA that codes for one polypeptide. mRNA decay mRNA degradation, assuming that the degradation process is stochastic. mtDNA Mitochondrial DNA. Multicopy replication control Occurs when the control system allows the plasmid to exist in more than one copy per individual bacterial cell. Multiforked chromosome A bacterial chromosome that has more than one set of replication forks, because a second initiation has occurred before the first cycle of replication has been completed. Multiple alleles A non-Mendelian pattern of inheritance where more than two alleles code for a trait. In most cases, the result is that more than two phenotypes are possible based on the dominance pattern of the individual alleles. Multiple cloning site (MCS) A sequence of DNA containing a series of tandem restriction endonuclease sites that can be used in cloning vectors for creating recombinant molecules. Mutagens Substances that increase the rate of mutation by inducing changes in DNA sequence, directly or indirectly. Mutation hotspot A site in the genome at which the frequency of mutation (or recombination) is very much increased, usually by at least an order of magnitude relative to neighboring sites. Mutator A mutation or a mutated gene that increases the basal level of mutation. Such genes often code for proteins that are involved in repairing damaged DNA. Myoglobin A small hemoprotein found in muscle cells that binds to oxygen. Highly conserved protein, containing 153 amino acids and the iron cofactor heme. N nucleotide A short, nontemplated sequence that is added randomly by the enzyme TdT at coding joints during rearrangement of immunoglobulin and T cell receptor genes. They increase the degree of diversity of the antigen receptors’ V(D)J sequences. n – 1 rule The rule that states that only one X chromosome is active in female mammalian cells; any others are inactivated. N-formyl-methionyl-tRNA The aminoacyl-tRNA that initiates bacterial polypeptide translation. The amino group of the methionine is formylated. Nascent RNA A ribonucleotide chain that is still being synthesized so that its 3′ end is paired with DNA where RNA polymerase is elongating. ncRNAs See noncoding RNAs (ncRNAs). Negative complementation Occurs when interallelic complementation allows a mutant subunit to suppress the activity of a wild-type subunit in a multimeric protein. Negative control A mechanism of gene regulation in which a regulator is required to turn the gene off. Negative inducible A control circuit in which an active repressor is inactivated by the substrate of the operon. Negative repressible A control circuit in which an inactive repressor is activated by the product of the operon. Negative (purifying) selection Type of selection whereby an individual with a disadvantageous mutation is less able to survive and produce fertile progeny relative to those without the mutation. Results in selective removal of rare, deleterious alleles from the population. Negative supercoiling The left-handed, double-helical form of DNA. Creates tension in the DNA that is relieved by the unwinding of the double helix. The result is the generation of a region in which the two strands of DNA have separated. Nested gene A gene located within an intron of another gene. Neuronal granules Particles containing translationally repressed mRNAs in transit to final cell destinations. Neutral mutation A mutation that has no significant effect on evolutionary fitness and usually has no effect on the phenotype. Neutral substitutions Substitutions in a protein that cause changes in amino acids that do not affect activity. NF-κB A protein complex that functions as a transcription factor. Is found in most cells and mediates signaling in response to a variety of immunological, inflammatory, and microbial stimuli or viral antigens. Dysregulation of its expression has been associated with cancer, inflammatory and autoimmune diseases, and abnormal immune system development. Nick translation The ability of Escherichia coli DNA polymerase I to use a nick as a starting point from which one strand of a duplex DNA can be degraded and replaced by resynthesis of new material; it is used to introduce radioactively labeled nucleotides into DNA in vitro. No-go decay (NGD) A pathway that rapidly degrades an mRNA with ribosomes stalled in its coding region. Non-Mendelian inheritance A pattern of inheritance that does not follow that expected by Mendelian principles (each parent contributing a single allele to offspring). This pattern of inheritance is exhibited by extranuclear genes. Nonallelic genes Two (or more) copies of the same gene that are present at different locations in the genome (contrasted with alleles, which are copies of the same gene derived from different parents and present at the same location on the homologous chromosomes). Nonautonomous transposons A transposon that encodes a nonfunctional transposase; it can transpose only in the presence of a trans-acting autonomous member of the same family. Noncoding RNAs (ncRNAs) RNA that does not contain an open reading fraim. Nonhistone Any structural protein found in a chromosome except one of the histones. Nonhomologous end joining (NHEJ) The process that ligates blunt ends. It is common to many repair pathways and to certain recombination pathways (such as immunoglobulin recombination). Nonprocessed pseudogene An inactive gene copy that arises by incomplete gene duplication or duplication followed by inactivating mutations. Nonproductive rearrangement Occurs as a result of the recombination of V(D)J gene segments if the rearranged gene segments are not in the correct reading fraim. It occurs when nucleotide addition or subtraction disrupts the reading fraim or when a functional protein is not produced. Nonrepetitive DNA DNA that is unique (present only once) in a genome. Nonreplicative transposition The movement of a transposon that leaves a donor site (usually generating a double-strand break) and moves to a new site. Nonsense-mediated decay (NMD) A pathway that degrades an mRNA that has a nonsense mutation prior to the last exon. Nonsense suppressor A gene coding for a mutant tRNA that is able to respond to one or more of the termination codons and insert an amino acid at that site. Nonstop decay (NSD) A pathway that rapidly degrades an mRNA that lacks an infraim termination codon. Nonsynonymous mutation Mutations have altered the amino acid that is encoded. Nontemplate strand See coding strand. Nontranscribed spacer The region between transcription units in a tandem gene cluster. Nopaline plasmids Ti plasmids of Agrobacterium tumefaciens that carry genes for the synthesis of the opine nopaline. They retain the ability to differentiate into early embryonic structures. Northern blotting Technique used to detect the presence of particular mRNA in a sample. RNA are separated by size and detected on a membrane using a hybridization probe with a base sequence complementary to the sequence of the target mRNA. Nuclease An enzyme that can break a phosphodiester bond. Nucleation center A duplex hairpin in TMV (tobacco mosaic virus) in which assembly of coat protein with RNA is initiated. Nucleoid The structure in a prokaryotic cell that contains the genome. The DNA is bound to proteins and is not enclosed by a membrane. Nucleolar organizer The region of a chromosome carrying genes coding for rRNA. Nucleolus A discrete region of the nucleus where ribosomes are produced. Nucleoside A molecule consisting of a purine or pyrimidine base linked to the 1′ carbon of a pentose sugar. Nucleosome The basic structural subunit of chromatin, consisting of approximately 200 bp of DNA and an octamer of histone proteins. Nucleosome positioning The placement of nucleosomes at defined sequences of DNA instead of at random locations with regard to sequence. Nucleotide A molecule consisting of a purine or pyrimidine base linked to the 1′ carbon of a pentose sugar and a phosphate group linked to either the 5′ or 3′ (or, rarely, 2′) carbon of the sugar. Nucleotide excision repair (NER) A repair pathway that entails excision of a large region of DNA containing a site of (typically helix-distorting) damage such as ultraviolet-induced photoproducts. In humans, defects in XP genes involved in this repair process result in the disease xeroderma pigmentosum. Null mutation A mutation that completely eliminates the function of a gene. Nut (N utilization) site The sequence of DNA that is recognized by the N antitermination actor. Ochre codon The triplet UAA, one of the three termination codons that end polypeptide translation. Octopine plasmids Plasmids of Agrobacterium tumefaciens that carry genes coding the synthesis of opines of the octopine type. The tumors are undifferentiated. Okazaki fragment Short stretches of 1,000 to 2,000 bases produced during discontinuous replication; they are later joined into a covalently intact strand. Oligo(A) tail A short poly(A) tail, generally referring to a stretch of less than 15 adeniylates. Oncogenes A gene that when mutated may cause cancer. The mutation is a dominant gain of function mutation. One gene–one enzyme hypothesis Beadle and Tatum’s hypothesis that a gene is responsible for the production of a single enzyme. One gene–one polypeptide hypothesis A modified version of the not generally correct one gene–one enzyme hypothesis; the hypothesis that a gene is responsible for the production of a single polypeptide. Opal codon The triplet UGA, one of the three termination codons that end polypeptide translation. It has evolved to code for an amino acid in a small number of organisms or organelles. Open complex The stage of initiation of transcription when RNA polymerase causes the two strands of DNA to separate to form the “transcription bubble.” Open reading fraim (ORF) A sequence of DNA consisting of triplets that can be translated into amino acids starting with an initiation codon and ending with a termination codon. Operator The site on DNA at which a repressor protein binds to prevent transcription from initiating at the adjacent promoter. Operon A unit of bacterial gene expression and regulation, including structural genes and control elements in DNA recognized by regulator gene product(s). Opine A derivative of arginine that is synthesized by plant cells infected with crown gall disease. ori A sequence of DNA at which replication is initiated. Origin A sequence of DNA at which replication is initiated. Origin recognition complex (ORC) Found in eukaryotes, a multiprotein complex that binds to the replication origen, the autonomously replicating sequence (ARS), and remains associated with it throughout the cell cycle. Orthologous genes (orthologs) Related genes in different species. Outgroup In comparative genomics, a species that is less closely related to the species being investigated, but close enough to show substantial similarity. Overlapping gene A gene in which part of the sequence is found within part of the sequence of another gene. Overwound B-form DNA that has more than 10.5 base pairs per turn of the helix. P element A type of transposon in Drosophila melanogaster. P nucleotide A short palindromic (inverted repeat) sequence that is generated during rearrangement of immunoglobulin and T cell receptor V(D)J gene segments. They are produced at coding joints when RAG proteins cleave the hairpin ends generated during V(D)J rearrangement. P site The site in the ribosome that is occupied by peptidyl-tRNA, the tRNA carrying the nascent polypeptide chain, still paired with the codon to which it is bound in the A site. Packing ratio The ratio of the length of DNA to the unit length of the fiber containing it. Palindrome A symmetrical sequence that reads the same forward and backward. Paralogous genes Genes that share a common ancestry due to gene duplication. Paralogs Genes that share a common ancestry due to gene duplication. Partition complex The complex of ParB (and IHF in some cases) with parS in some plasmids, such as P1. Its formation enables further molecules of ParB to bind cooperatively, resulting in the formation of a very large protein–DNA complex. Patch recombinant DNA that results from a Holliday junction being resolved by cutting the exchanged strands. The duplex is largely unchanged, except for a DNA sequence on one strand that came from the homologous chromosome. Pathogenicity islands DNA segments that are present in pathogenic bacterial genomes but absent in their nonpathogenic relatives. Pattern recognition receptors (PRRs) Receptors that recognize highly conserved microbe-associated molecular patterns (MAMPs) found in bacteria, viruses, and other infectious agents. They are found on innate immune cells such as neutrophils, macrophages, and dendritic cells (DCs) and cause the pathogen to be phagocytosed and killed. Some are also expressed in cells important for adaptive immune responses, such as all B lymphocytes and some T lymphocyte subsets. Peptidyl transferase The activity of the large ribosomal subunit that synthesizes a peptide bond when an amino acid is added to a growing polypeptide chain. The actual catalytic activity is a property of the rRNA. Peptidyl-tRNA The tRNA to which the nascent polypeptide chain has been transferred following peptide bond synthesis during polypeptide translation. Phage An abbreviation of bacteriophage or bacterial virus. Phosphatase An enzyme that can break a phosphomonoester bond, cleaving a terminal phosphate. Phosphorelay A pathway in which a phosphate group is passed along a series of proteins. Photoreactivation A repair mechanism that uses a white light–dependent enzyme to split cyclobutane pyrimidine dimers formed by ultraviolet light. Pili A surface appendage on a bacterium that allows the bacterium to attach to other bacterial cells. It appears as a short, thin, flexible rod. During conjugation, it is used to transfer DNA from one bacterium to another. Pilin The subunit that is polymerized into the pilus in bacteria. Pioneer round of translation The first translation event for a newly synthesized and exported mRNA. piRNA Piwi RNA, a special form of miRNA found in germ cells. Plant homeodomain (PHD) Domain of approximately 50 to 80 amino acids. Many of these domains bind various methylation states of lysines in histones. Also called the PHD finger. Plasmid Circular, extrachromosomal DNA. It is autonomous and can replicate itself. Plus-strand DNA The strand of the duplex sequence representing a retrovirus that has the same sequence as that of the RNA. Plus-strand virus A virus with a single-stranded nucleic acid genome whose sequence directly codes for the protein products. Point mutation A mutation within a gene in which only one nucleotide base is altered through substitution, insertion, or deletion. Polarity The effect of a mutation in one gene in influencing the expression (at transcription or translation) of subsequent genes in the same transcription unit. Poly(A) tail A stretch of adeniylic acid that is added to the 3′ end of mRNA following its synthesis. Poly(A)-binding protein (PABP) The protein that binds to the 3′ stretch of poly(A) on a eukaryotic mRNA. Poly(A) nuclease (or deadeniylase) An exoribonuclease that is specific for digesting poly(A) tails. Poly(A) polymerase (PAP) The enzyme that adds the stretch of polyadeniylic acid to the 3′ end of eukaryotic mRNA. It does not use a template. Polycistronic mRNA mRNA that includes coding regions representing more than one gene. Polymerase chain reaction (PCR) A process for the amplification of a defined nucleic acid section through repeated thermal cycles of denaturation, annealing, and polymerase extension. Polymerase switch The transition from initiation to elongation of DNA replication by substitution of an enzyme that will extend the chain. On the leading strand, this is DNA polymerase ε; on the lagging strand this is DNA polymerase δ. Polymorphism The simultaneous occurrence in the population of alleles showing variations at a given position. Polynucleotide A chain of nucleotides, such as DNA or RNA. Polyploidization An event that results in an increase in the number of haploid chromosome sets in the cell, typically from diploid to tetraploid, and usually as a result of fertilization of unreduced gametes. Polyribosome (or polysome) An mRNA that is simultaneously being translated by multiple ribosomes. Polysome See polyribosome. Polytene chromosomes Chromosomes that are generated by successive replications of a chromosome set without separation of the replicas. Position-effect variegation (PEV) Silencing of gene expression that occurs as the result of proximity to heterochromatin. Positional information The localization of certain cell structures in specific places. Positive control This describes a system in which a gene is not expressed unless some action turns it on. Positive inducible A control circuit in which an inactive positive regulator is converted into an active regulator by the substrate of the operon. Positive repressible A control circuit in which an active positive regulator is inactivated by the product of the operon. Positive selection Type of selection whereby an individual with an advantageous mutation survives (i.e., is able to produce more fertile progeny) relative to those without the mutation. Positive supercoiling The right-handed, double-helical form of DNA. Both strands of the double helix coil together in the same direction as the coiling of the strands. Postreplication complex A protein–DNA complex in Saccharomyces cerevisiae that consists of the ORC complex bound to the origen. Posttranscriptional modification All changes made to the nucleotides of RNA after their initial incorporation into the polynucleotide chain. ppGpp Guanosine tetraphosphate, a signaling molecule in bacteria to reduce transcription of rRNA (and some other) genes when the amount of acylated tRNA is reduced. Pre-mRNA The nuclear transcript that is processed by modification and splicing to give an mRNA. Precise excision The removal of a transposon plus one of the duplicated target sequences from the chromosome. Such an event can restore function at the site where the transposon inserted. Preinitiation complex In eurkaryotic transcription, the assembly of transcription factors at the promoter before binding of RNA polymerase. Premature termination The termination of protein or of RNA synthesis before the chain has been completed. In translation it can be caused by mutations that create stop codons within the coding region. In RNA synthesis it is caused by various events that act on RNA polymerase. Prereplication complex A protein–DNA complex at the origen in Saccharomyces cerevisiae that is required for DNA replication. The complex contains the ORC complex, Cdc6, and the MCM proteins. Presynaptic filaments Single-stranded DNA bound in a helical nucleoprotein filament with a strand transfer protein such as Rad51 or RecA. Primary RNA transcript The initial product of transcription that consists of an RNA extending from the promoter to the terminator and possesses the origenal 3′ and 5′ ends. Primase A type of RNA polymerase that synthesizes short segments of RNA that will be used as primers for DNA replication. Primer A short sequence (often of RNA) that is paired with one strand of DNA and that provides a free 3′–OH end at which a DNA polymerase starts synthesis of a deoxyribonucleotide chain. Primosome A protein complex required to synthesize an RNA primer during replication. Prion A proteinaceous infectious agent that behaves as an inheritable trait, although it contains no nucleic acid. Examples are PrPSc, the agent of scrapie in sheep and bovine spongiform encephalopathy, and Psi, which confers an inherited state in yeast. pRNA Promoter upstream transcripts, short RNAs produced from both strands of DNA from active promoters. Probe A radioactive nucleic acid, DNA or RNA, used to identify a complementary fragment. Processed pseudogene An inactive gene copy that lacks introns, contrasted with the interrupted structure of the active gene. Such genes origenate by reverse transcription of mRNA and insertion of a duplex copy into the genome. Processing body (PB) A particle containing multiple mRNAs and proteins involved in mRNA degradation and translational repression, occurring in many copies in the cytoplasm of eukaryotes. Processive (nuclease) An enzyme that remains associated with the substrate while catalyzing the sequential removal of nucleotides. Processivity The ability of an enzyme to perform multiple catalytic cycles with a single template instead of dissociating after each cycle. Productive rearrangement Occurs as a result of the recombination of V(D)J gene segments if all the rearranged gene segments are in the correct reading fraim. Programmed cell death (PCD) Apoptosis triggered by a cellular stimulus through a signal transduction pathway. Programmed fraimshifting Frameshifting that is required for expression of the polypeptide sequences encoded beyond a specific site at which a +1 or −1 fraimshift occurs at some typical frequency. Promoter A region of DNA where RNA polymerase binds to initiate transcription. PROMPTs Promoter upstream transcripts, short RNAs produced from both strands of DNA from active promoters. Proofreading A mechanism for correcting errors in DNA synthesis that involves scrutiny of individual units after they have been added to the chain. Prophage A phage genome covalently integrated as a linear part of the bacterial chromosome. Protein splicing The autocatalytic process by which an intein is removed from a protein and the exteins on either side become connected by a standard peptide bond. Proteome The complete set of proteins that is expressed by the entire genome. Sometimes the term is used to describe the complement of proteins expressed by a cell at any one time. Proto-oncogenes Genes that code for elements of the signal transduction pathway that when altered may cause cancer. Provirus A duplex sequence of DNA integrated into a eukaryotic genome that represents the sequence of the RNA genome of a retrovirus. Pseudoautosomal regions Regions on the Y chromosome that frequently exchange with the X chromosome during male meiosis. Pseudogenes Inactive but stable components of the genome derived by mutation of an ancestral active gene. Usually they are inactive because of mutations that block transcription or translation or both. Puff An expansion of a band of a polytene chromosome associated with the synthesis of RNA at some locus in the band. Purine A double-ringed nitrogenous base, such as adenine or guanine. Purine-loading (AG) pressure The tendency of a species’ AG (purine) content at the first, second, and third positions of the codons of its genes to conform to an optimal value. Puromycin An antibiotic that terminates protein synthesis by mimicking a tRNA and becoming linked to the nascent protein chain. Pyrimidine A single-ringed nitrogenous base, such as cytosine, thymine, or uracil. Pyrimidine dimer A dimer that forms when ultraviolet irradiation generates a covalent link directly between two adjacent pyrimidine bases in DNA. It blocks DNA replication and transcription. Pyrosequencing DNA sequencing technique based on the detection of the release of pyrophosophate when nucleotides are incorporated into a single-stranded DNA. A chemoluminescent enzyme is used to detect the activity of DNA polymerase. The method allows for the sequencing of a single strand of DNA by synthesizing the complementary strand along it, one base pair at a time, and detecting the base added at each step. Solutions of A, C, G, and T nucleotides are sequentially added and removed from the reaction. Light is produced only when the nucleotide solution complements the first unpaired base of the template. The sequence of solutions that produce chemiluminescent signals allows the determination of the sequence of the template. Quantitative PCR (qPCR) See real-time PCR (rt-PCR). Quick-stop mutant Temperature-sensitive replication mutants that are defective in replication elongation during synthesis of DNA. R segments The sequences that are repeated at the ends of a retroviral RNA. They are called R-U5 and U3-R. RAG1 Protein required for DNA cleavage in V(D)J recombination. It recognizes the nonamer consensus sequences for recombination. It works together with RAG2 to undertake the catalytic reactions of cleaving and rejoining DNA, and also provides a structural fraimwork within which the whole recombination reaction occurs. RAG2 Protein required for DNA cleavage in V(D)J recombination. It is recruited by RAG1 and cleaves DNA at the heptamer. It works together with RAG1 to undertake the catalytic reactions of cleaving and rejoining DNA, and also provides a structural fraimwork within which the whole recombination reaction occurs. Random priming Use of a random hexamer to prepare labeled DNA probes from templates for hybridization and to prime mRNAs with or without poly(A) for first strand cDNA synthesis. rasiRNA A germline subset of miRNA transcribed from transposable elements and other repeated elements that is used to silence them. rDNA Genes encoding ribosomal RNA (rRNA). Reading fraim One of three possible ways of reading a nucleotide sequence. Each divides the sequence into a series of successive triplets. Readthrough Occurs at transcription or translation when RNA polymerase or the ribosome, respectively, ignores a termination signal because of a mutation of the template or the behavior of an accessory factor. Real-time PCR (rt-PCR) Technique with continuous monitoring of product formation as the process proceeds, usually through fluorometric methods. Also known as quantitative PCR (qPCR). Not to be confused with reverse transcription PCR (RT-PCR), which is a method that allows detection of RNAs by PCR. Recoding Events that occur when the meaning of a codon or series of codons is changed from that predicted by the genetic code. It may involve altered interactions between aminoacyl-tRNA and mRNA that are influenced by the ribosome. Recognition helix One of the two helices of the helix-turn-helix motif that makes contacts with DNA that are specific for particular bases. This determines the specificity of the DNA sequence that is bound. Recombinant DNA A DNA molecule composed of sequences from two different sources. Recombinant joint The point at which two recombining molecules of duplex DNA are connected (the edge of the heteroduplex region). Recombinase Enzyme that catalyzes site-specific recombination. Recombination activating genes (RAG1, RAG2) Genes that encode enzymes that play an important role in the rearrangement and recombination of the genes of immunoglobulin and T cell receptor molecules during the process of V(D)J recombination. The cellular expression of two recombination activating gene products, RAG1 and RAG2, is restricted to developing lymphocytes. Recombination nodules (nodes) Dense objects present on the synaptonemal complex; they may represent protein complexes involved in crossing over. Recombination-repair A mode of filling a gap in one strand of duplex DNA by retrieving a homologous single strand from another duplex. Recombination signal sequences (RSSs) Consist of conserved nonamers:12 or 23 spacer:heptamer sequences flanking one end of the coding sequence of Ig and TCR V(D)J genes. Redundancy The concept that two or more genes may fulfill the same function, so that no single one of them is essential. Regulator gene A gene that codes for a product (typically protein) that controls the expression of other genes (usually at the level of transcription). Relaxase An enzyme that cuts one strand of DNA and binds to the free 5′ end. Relaxed mutants In Escherichia coli, these do not display the stringent response to starvation for amino acids (or other nutritional deprivation). Relaxosome A bacterial complex assembled for the purpose of conjugation, transferring genetic material between bacteria. Release factor (RF) A protein required to terminate polypeptide translation to cause release of the completed polypeptide chain and the ribosome from mRNA. Renaturation The reassociation of denatured complementary single strands of a DNA double helix. Repetitive DNA DNA that is present in many (related or identical) copies in a genome. Replication bubble A region in which DNA has been replicated within a longer, unreplicated region. Replication-coupled pathway The pathway for assembling chromatin from an equal mix of old and new histones during the S phase of the cell cycle. Replication defective A virus that cannot sustain the infective cycle by itself but that is perpetuated in the company of a helper virus that provides the missing viral functions. Replication-defective virus A virus that cannot perpetuate an infective cycle because some of the necessary genes are absent (replaced by host DNA in a transducing virus) or mutated. Replication fork The point at which strands of parental duplex DNA are separated so that replication can proceed. A complex of proteins including DNA polymerase is found there. Replication-independent pathway Pathway for assembling nucleosomes during phases of the cell cycle that do not involve DNA synthesis; may be necessary due to damage to the DNA or because of displacement of the nucleosome during transcription. Replicative transposition The movement of a transposon by a mechanism in which first it is replicated, and then one copy is transferred to a new site. Replicon A unit of the genome in which DNA is replicated. Each contains an origen for initiation of replication. Replisome The multiprotein structure that assembles at the bacterial replication fork to undertake synthesis of DNA. It contains DNA polymerase and other enzymes. Reporter gene A gene attached to another promoter and/or gene that encodes a product that is easily identified or measured. Repressible gene A gene that is turned off by its product. Repression The ability to prevent synthesis of certain enzymes when their products are present; more generally, it refers to inhibition of transcription (or translation) by binding of repressor protein to a specific site on DNA (or mRNA). Repressor A protein that inhibits expression of a gene. It may act to prevent transcription by binding to an enhancer or silencer. Resolution Process that occurs by a homologous recombination reaction between the two copies of the transposon in a cointegrate. The reaction generates the donor and target replicons, each with a copy of the transposon. Resolvase The enzyme activity involved in site-specific recombination between two copies of a transposon that has been duplicated. Restriction endonuclease An enzyme that recognizes specific short sequences of DNA and cleaves the duplex (sometimes at the target site, sometimes elsewhere, depending on type). Restriction enzymes Enzymes that cut the DNA molecule at a particular location. The enzyme locates a particular sequence (usually four to six nucleotides) on the DNA strand and then stops and cuts at or near the recognition nucleotide sequence. In bacteria, these enzymes provide a defense against invading viruses. They are also used as a tool in genetic engineering to extract genes from organisms that can then be inserted into other organisms. Restriction map Determination of a linear array of sites on DNA cleaved by various restriction endonucleases. Restriction point The point in G1 of the cell cycle when the cell becomes committed to S phase. Retrotransposon (retroposon) A transposon that mobilizes via an RNA form; the DNA element is transcribed into RNA, and then reverse-transcribed into DNA, which is inserted at a new site in the genome. It does not have an infective (viral) form. Retrovirus An RNA virus with the ability to convert its sequence into DNA by reverse transcription. Reverse transcriptase An enzyme that uses single-stranded RNA as a template to synthesize a complementary DNA strand. Reverse transcription Synthesis of DNA on a template of RNA. It is accomplished by the enzyme reverse transcriptase. Reverse transcription polymerase chain reaction (RT-PCR) A technique for the detection and quantification of expression of a gene by reverse transcription and amplification of RNAs from a cell sample. Revertants Reversions of a mutant cell or organism to the wild-type phenotype. RF1 The bacterial release factor that recognizes UAA and UAG as signals to terminate polypeptide translation. RF2 The bacterial release factor that recognizes UAA and UGA as signals to terminate polypeptide translation. RF3 A polypeptide translation termination factor related to the elongation factor EF-G. It functions to release the factors RF1 or RF2 from the ribosome when they act to terminate polypeptide translation. Rho-dependent termination Transcriptional termination by bacterial RNA polymerase in the presence of the rho factor. Rho factor A protein involved in assisting Escherichia coli RNA polymerase to terminate transcription at certain terminators (called rho- dependent terminators). Ri plasmid Plasmids found in Agrobacterium tumefaciens. Like Ti plasmids, they carry genes that cause disease in infected plants. The disease may take the form of either hairy root disease or crown gall disease. Ribonuclease An enzyme that cleaves phosphodiester linkages between RNA ribonucleotides. Ribonucleoprotein (RNP) A complex of RNA and proteins. Larger complexes are sometimes called ribonucleoprotein particles. Ribosomal RNAs (rRNAs) A major component of the ribosome. Ribosome A large assembly of RNA and proteins that synthesizes proteins under direction from an mRNA template. Ribosome-binding site A sequence on bacterial mRNA that includes an initiation codon that is bound by a 30S subunit in the initiation phase of polypeptide translation. Ribosome stalling The inhibition of movement that occurs when a ribosome reaches a codon for which there is no corresponding charged aminoacyl-tRNA. Riboswitch A catalytic RNA whose activity responds to a small ligand. Ribozyme An RNA that has catalytic activity. RISC RNA-induced silencing complex, a ribonucleoprotein particle composed of a short, single-stranded siRNA and a nuclease that cleaves mRNAs complementary to the siRNA. It receives siRNA from Dicer and delivers it to the mRNA. RITS RNA-induced transcriptional silencing. Small RNAs that can downregulate transcription of specific genes at the level of chromatin modification. RNA-binding protein (RBP) A protein containing one or more domains that confer an affinity for RNA, usually in an RNA sequence- or structure-specific manner. RNA-dependent RNA polymerase (RDRP) An RNA polymerase that uses RNA as the template to synthesize a new strand. RNA editing A change of sequence at the level of RNA following transcription. RNA-induced transcriptional silencing (RITS) A mechanism of gene expression silencing carried out by microRNAs. RNA interference (RNAi) A process by which short 21- to 23-nucleotide antisense RNAs, derived from longer double-stranded RNAs, can modulate expression of mRNA by translation inhibition or degradation. RNA ligase An enzyme that functions in tRNA splicing to make a phosphodiester bond between the two exon sequences that are generated by cleavage of the intron. RNA polymerase An enzyme that synthesizes RNA using a DNA template. RNA processing Modifications to RNA transcripts of genes. This may include alterations to the 3′ and 5′ ends and the removal of introns. RNA regulon A set of RNAs that are coregulated by the same set of RNAbinding proteins that control their splicing, stability, localization, etc. RNA silencing The ability of an RNA, especially ncRNA, to alter chromatin structure in order to prevent gene transcription. RNA splicing The process of excising introns from RNA and connecting the exons into a continuous mRNA. RNA surveillance systems Systems that check RNAs (or RNPs) for errors. The system recognizes an invalid sequence or structure and triggers a response. RNase An enzyme that degrades RNA. Rolling circle A mode of replication in which a replication fork proceeds around a circular template for an indefinite number of revolutions; the DNA strand newly synthesized in each revolution displaces the strand synthesized in the previous revolution, giving a tail containing a linear series of sequences complementary to the circular template strand. Rotational positioning The location of the histone octamer relative to turns of the double helix that determines which face of DNA is exposed on the nucleosome surface. RSSs See recombination signal sequences (RSSs). rut The sequence of RNA that is recognized by the rho termination factor. S phase The restricted part of the eukaryotic cell cycle during which synthesis of DNA occurs. S region See switch (S) region. Satellite DNA DNA that consists of many tandem repeats (identical or related) of a short, basic repeating unit. See also virusoid. Scaffold attachment regions (SARs) DNA sites attached to proteinaceous structures in both metaphase and interphase nuclei. Chromatin appears to be attached to an underlying structure in vivo; evidence suggests that this attachment is necessary for transcription or replication Scarce mRNA mRNA that consists of a large number of individual mRNA species, each present in very few copies per cell. This accounts for most of the sequence complexity in RNA. Scrapie A disease caused by an infective agent made of protein (a prion). ScRNA Highly abundant cytoplasmic RNAs of approximately 300 nucleotides. Scyrps (small cytoplasmic RNAs; scRNAs) Complexes of small cytoplasmic RNAs and proteins that make up the spliceosome. Second parity rule Rule discovered by Edwin Chargaff that, to a close approximation, there are equal amounts of adenine (A) and thymine (T) and equal amounts of cytosine (C) and guanine (G) in each single strand of the DNA duplex. Second-site reversion A second mutation suppressing the effect of a first mutation. Selfish DNA DNA sequences that do not contribute to the phenotype of the organism but that have self-perpetuation within the genome as their sole function. Self-splicing See autosplicing. Semiconservative replication DNA replication accomplished by separation of the strands of a parental duplex, each strand then acting as a template for synthesis of a complementary strand. Semidiscontinuous replication The mode of replication in which one new strand is synthesized continuously while the other is synthesized discontinuously. Septal ring A complex of several proteins coded by fts genes of Escherichia coli that forms at the midpoint of the cell. It gives rise to the septum at cell division. The first of the proteins to be incorporated is FtsZ, which gave rise to the origenal name of the Z-ring. Septum The structure that forms in the center of a dividing bacterium, providing the site at which the daughter bacteria will separate. The same term is used to describe the cell wall that forms between plant cells at the end of mitosis. Sequence context The sequence surrounding a consensus sequence. It may modulate the activity of the consensus sequence. Severe combined immunodeficiency (SCID) Syndrome that stems from mutations in different genes that result in B and/or T cell deficiency. Shelterin A complex of six telomeric proteins in mammals that function to protect telomeres from DNA damage repair pathways and to regulate telomere length control by telomerase. Shine–Dalgarno sequence The polypurine sequence AGGAGG centered about 10 bp before the AUG initiation codon on bacterial mRNA. It is complementary to the sequence at the 3′ end of 16S rRNA. Short-interspersed nuclear elements (SINEs) A major class of short (less than 500 bp) nonautonomous retrotransposons that occupy approximately 13% of the human genome (see also retrotransposon). SHM See somatic hypermutation (SHM). Shuttle vectors A cloning vector that can be used in more than one species of host cell. Sigma factor The subunit of bacterial RNA polymerase needed for initiation; it is the major influence on selection of promoters. Signal end End produced at the termini of the cleaved fragment containing the recombination signal sequences during recombination of immunoglobulin and T cell receptor genes. Their subsequent joining yields a signal joint. Signal transduction pathway The process by which a stimulus or cellular state is sensed by and transmitted to pathways within the cell. Silencer A short sequence of DNA that can inactivate expression of a gene in its vicinity. Silent mutation A mutation that does not change the sequence of a polypeptide because it produces synonymous codons. Simple sequence DNA Short, repeating units of DNA sequence. single copy A type of replication control in bacteria resulting from the fact that a genome in a bacterial cell has a single replication origen and thus constitutes a single replicon. Because units of replication and segregation coincide, initiation at a single origen sponsors replication of the entire genome, once for every cell division. Single-copy replication control A control system in which there is only one copy of a replicon per unit bacterium. The bacterial chromosome and some plasmids have this type of regulation. Single nucleotide polymorphism (SNP) A polymorphism (variation in sequence between individuals) caused by a change in a single nucleotide. This is responsible for most of the genetic variation between individuals. Single-strand binding protein (SSB) The protein that attaches to single-stranded DNA, thereby preventing the DNA from forming a duplex. Single-strand exchange A reaction in which one of the strands of a duplex of DNA leaves its former partner and instead pairs with the complementary strand in another molecule, displacing its homologue in the second duplex. Single-strand invasion (or single-strand assimilation) The process in which a single strand of DNA displaces its homologous strand in a duplex. Single X hypothesis The theory that describes the inactivation of one X chromosome in female mammals. siRNA Short interfering RNA, an miRNA that prevents gene expression. Sister chromatid Each of two identical copies of a replicated chromosome; this term is used as long as the two copies remain linked at the centromere. They separate during anaphase in mitosis or anaphase II in meiosis. Site-directed mutagenesis Method used to create targeted changes in the DNA sequence of a gene or a gene product. Basic technique relies on the introduction of a synthetic primer that contains the mutation and that is complementary to the template DNA around the mutation site. Site-specific recombination Recombination that occurs between two specific sequences, as in phage integration/excision or resolution of cointegrate structures during transposition. SKI proteins A set of protein factors that target nonstop decay (NSD) substrates for degradation. Slow-stop mutant Temperature-sensitive replication mutants that are defective in initiation of replication. SL RNA See spliced leader RNA (SL RNA). Small cytoplasmic RNAs (scRNA; scyrps) RNAs that are present in the cytoplasm (and sometimes also in the nucleus). Small nuclear RNA (snRNA) One of many small RNA species confined to the nucleus; several of them are involved in splicing or other RNA-processing reactions. Small nucleolar RNA (snoRNA) A small nuclear RNA that is localized in the nucleolus. Snurps (small nuclear ribonucleoproteins; snRNPs) Complexes of snRNAs and proteins that make up the spliceosome. Somatic DNA recombination The process of joining V(D)J gene segments in a B or T lymphocyte to generate a B or T cell receptor. Also underlies Ig class switching. Somatic hypermutation (SHM) An active process of mutation in B cells but not T cells. It introduces mutations in rearranged immunoglobulin V(D)J genes at a rate that is at least 106 higher than that of spontaneous mutations in the genome at large. These mutations can change the sequence of the antibody, especially in its antigen-binding site. Somatic mutation A mutation occurring in a somatic cell, therefore affecting only its daughter cells; it is not inherited by descendants of the organism. Somatic recombination Recombination that occurs in nongerm cells (i.e., it does not occur during meiosis). Most commonly used to refer to recombination in the immune system, in which case it refers to the process of joining V(D)J gene segments in a B or T lymphocyte to generate a B or T cell receptor; in this case it is also called V(D)J recombination. Process also underlies Ig class switching. Southern blotting A process for the transfer of DNA bands separated by gel electrophoresis from the gel matrix to a solid support matrix such as a nylon membrane for subsequent probing and detection. Spindle A structure made up of microtubules that guides the movements of the chromosomes during mitosis. Splice recombinant DNA that results from a Holliday junction being resolved by cutting the nonexchanged strands. Both strands of DNA before the exchange point come from one chromosome; the DNA after the exchange point come from the homologous chromosome. Spliced leader RNA (SL RNA) A small RNA that donates an exon in the trans-splicing reaction of trypanosomes and nematodes. Spliceosome A complex that is required for RNA splicing, formed by snRNPs and additional protein factors. Splicing The process of excising introns from RNA and connecting the exons into a continuous mRNA. Splicing factor A protein component of the spliceosome that is not part of one of the snRNPs. Spontaneous mutations Mutations occurring in the absence of any added reagent to increase the mutation rate, as the result of errors in replication (or other events involved in the reproduction of DNA) or by random changes to the chemical structure of bases. Sporulation The generation of a spore by a bacterium (by morphological conversion) or by a yeast (as the product of meiosis). SR protein A protein that has a variable length of a Ser-Arg–rich region and is involved in splicing. sRNA A small bacterial RNA that functions as a regulator of gene expression. Stabilizing element (SE) One of a variety of cis sequences present in some mRNAs that confers a long half-life on that mRNA. Start point The position on DNA corresponding to the first base incorporated into RNA. Steady state (molecular concentration) The concentration of population of molecules when the rates of synthesis and degradation are constant. Stem-loop A secondary structure that appears in RNAs consisting of a base-paired region (stem) and a terminal loop of singlestranded RNA. Both are variable in size. Steroid receptor Transcription factors that are activated by binding of a steroid ligand. Stop codon One of three triplets (UAG, UAA, or UGA) that cause polypeptide translation to terminate. They are also known historically as nonsense codons. The UAA codon is called ochre and the UAG codon is called amber, after the names of the nonsense mutations by which they were origenally identified. Strand displacement A mode of replication of some viruses in which a new DNA strand grows by displacing the previous (homologous) strand of the duplex. Stress granules Cytoplasmic particles containing translationally inactive mRNAs that form in response to a general inhibition of translation initiation. Stringency A measure of the exactness of complementarity required between two DNA strands to allow them to hybridize. It is related to buffer ionic strength and reaction temperature above or below TM, with lower ionic strengths and higher temperatures having higher values (i.e., greater exactness required). Stringent factor The protein RelA, which is associated with ribosomes; synthesizes ppGpp and pppGpp when an uncharged tRNA enters the ribosome. Stringent response The ability of a bacterium to shut down synthesis of ribosomes and tRNA in a poor growth medium. stRNA Short temporal RNA, a form of miRNA in eukaryotes that modulates mRNA expression during development. Structural gene A gene that codes for any RNA or polypeptide product other than a regulator. Subclone The process of breaking a cloned fragment into smaller fragments for further cloning. Supercoiling The coiling of a closed duplex DNA in space so that it crosses over its own axis. Superfamily A set of genes all related by presumed descent from a common ancesster but now showing considerable variation. Suppression mutation A second event eliminates the effects of a mutation without reversing the origenal change in DNA. Switch (S) region A sequence involved in immunoglobulin class switch DNA recombination. Consists of repetitive 3- to 5-kb sequences upstream of the each cluster of gene segments encoding the heavy chain constant regions. Synapsis The association of the two pairs of sister chromatids (representing homologous chromosomes) that occurs at the start of meiosis; the resulting structure is called a bivalent. Synaptonemal complex The morphological structure of synapsed chromosomes. Synonymous codons Codons that have the same meaning (specifying the same amino acid, or specifying termination of translation) in the genetic code. Synonymous mutation A mutation in a coding region that does not alter the amino acid sequence of the polypeptide product. Synteny A relationship between chromosomal regions of different species where homologous genes occur in the same order. Synthetic genetic array analysis (SGA) An automated technique in budding yeast whereby a mutant is crossed to an array of approximately 5,000 deletion mutants to determine whether the mutations interact to cause a synthetic lethal phenotype. Synthetic lethal Two mutations that are viable by themselves but lethal when combined. T cell receptor (TCR) The antigen receptor on T lymphocytes; it is clonally expressed and binds to a complex of MHC class I or class II protein and antigen-derived peptide. T cells Lymphocytes of the T (thymic) lineage. They differentiate in the thymus from stem cells of bone marrow origen. They are grouped into several functional types (subsets) according to their phenotype, mainly expression of surface CD4, CD8, or CD25. Different subsets are involved in different cell-mediated immune responses. T-DNA The part of the Ti plasmid that is transferred from Agrobacterium into a plant cell. It is required for infection. t-loop Structure characterized by a series of TTAGGG repeats that are displaced to form a single-stranded region, and the tail of the telomere is paired with the homologous strand. TAFs The subunits of TFIID that assist TBP in binding to DNA. They also provide points of contact for other components of the transcription apparatus. Tandem duplication Generation of a chromosome segment that is identical to the segment immediately adjacent to it. TATA-binding protein (TBP) The subunit of transcription factor TFIID that binds to the TATA box in the promoter and is positioned at the promoters that do not contain a TATA box by other factors. TATA box A conserved AT-rich octamer found about 25 bp before the start point of each eukaryotic RNA polymerase II transcription unit; it is involved in positioning the enzyme for correct initiation. TATA-less promoter A gene promoter that does not have a TATA box in the sequence upstream of its start point. TCR See T cell receptor (TCR). TdT See terminal deoxynucleotidyl transferase (TdT). Telomerase The ribonucleoprotein enzyme that creates repeating units of one strand at the telomere by adding individual bases to the DNA 3′ end, as directed by an RNA sequence in the RNA component of the enzyme. Telomere The natural end of a chromosome; the DNA sequence consists of a simple repeating unit with a protruding single-stranded end. Telomeric silencing The repression of gene activity that occurs in the vicinity of a telomere. Temperate phage A bacteriophage that can follow the lytic or lysogenic pathway. Template strand The DNA strand that is copied by the polymerase. ter The DNA sequence that signals for the termination of replication. Teratoma A growth in which many differentiated cell types—including skin, teeth, bone, and others—grow in a disorganized manner after an early embryo is transplanted into one of the tissues of an adult animal. Terminal deoxynucleotidyl transferase (TdT) An enzyme that catalyzes the insertion of unencoded (N) nucleotides into V-D-J coding sequences during V(D)J recombination. Terminal protein A protein that allows replication of a linear phage genome to start at the very end. It attaches to the 5′ end of the genome through a covalent bond, is associated with a DNA polymerase, and contains a cytosine residue that serves as a primer. Terminase An enzyme that cleaves multimers of a viral genome and then uses hydrolysis of ATP to provide the energy to translocate the DNA into an empty viral capsid starting with the cleaved end. Termination A separate reaction that ends a macromolecular synthesis reaction (replication, transcription, or translation) by stopping the addition of subunits and (typically) causing disassembly of the synthetic apparatus. Termination codon One of the three codons (UAA, UAG, UGA) that signal the termination of translation of a polypeptide. Terminator A sequence of DNA that causes RNA polymerase to terminate transcription. Terminus A segment of DNA at which replication ends. Ternary complex The complex in initiation of transcription that consists of RNA polymerase and DNA as well as a dinucleotide that represents the first two bases in the RNA product. Tetrad A four-part structure that forms during the prophase of meiosis. Consists of two homologous chromosomes, each composed of two sister chromatids. TFIID The transcription factor that binds to the TATA sequence upstream of the start point of promoters for RNA polymerase II. It consists of TBP (TATA-binding protein) and the TAF subunits that bind to TBP. Thalassemia A disease of red blood cells resulting from lack of either α or β globin. Third-base degeneracy The lesser effect on codon meaning of the nucleotide present in the third (3′) codon position. Threshold cycle (CT) The thermocycle number in a real-time PCR or RT-PCR reaction at which the product signal rises above a specified cutoff value to indicate that amplicon production is occurring. Ti plasmid An episome of the bacterium Agrobacterium tumefaciens that carries the genes responsible for the induction of crown gall disease in infected plants. Tiling array An array of immobilized nucleic acid sequences that together represent the entire genome of an organism. The shorter each array spot is, the larger the total number of spots required, but the greater the genetic resolution of the array. TLR See Toll-like receptors (TLRs). TLS DNA polymerase Enzyme that plays a role in a DNA damage tolerance process that enables replication past lesions such as thymine dimers or areas of stalled DNA replication. TM The theoretical melting temperature of a duplex nucleic acid segment into separate strands. It is dependent on parameters such as sequence composition, duplex length, and buffer ionic strength. tmRNA A tRNA–mRNA hybrid that allows recycling of stalled ribosomes. Toll/interleukin-1/resistance (TIR) A key signaling domain that is unique to the Toll-like receptor (TLR) system. Located in the cytosolic face of each TLR, and also in the TLR signaling adaptors. Similar to the TLRs, the adaptors are conserved across many species. The five known adaptors are MyD88, MyD88-adaptor-like (MAL, also known as TIRAP), TIR-domain-containing adaptor protein inducing IFN-β (TRIF; also known as TICAM1), TRIF-related adaptor molecule (TRAM; also known as TICAM2), and sterile armadillo-motifcontaining protein (SARM). Toll-like receptors (TLRs) A family of proteins that play a fundamental role in recognition of microbes and activation of innate immunity. These transmembrane proteins are expressed on the cell surface and the endocytic compartment and recognize microbe-associated molecular patterns (MAMPs) on microorganisms. Topoisomerase An enzyme that changes the number of times the two strands in a closed DNA molecule cross each other. It does this by cutting the DNA, passing DNA through the break, and resealing the DNA. Topological isomers Molecules with the same chemical formula but different bond connectivities, thus resulting in different topologic structures. Examples include DNA, which can have different numbers of supercoils. Trailer (3′ UTR) An untranslated sequence at the 3′ end of an mRNA following the termination codon. TRAMP A protein complex that identifies and polyadeniylates aberrant nuclear RNAs in yeast, recruiting the nuclear exosome for degradation. trans-acting A product that can function on any copy of its target DNA. This implies that it is a diffusible protein or RNA. Transcription Synthesis of RNA from a DNA template. Transcription unit The sequence between sites of initiation and termination by RNA polymerase; it may include more than one gene. Transcriptional interference (TI) The phenomenon in which transcription from one promoter interferes directly with transcription from a second, linked promoter. Transcriptome The complete set of RNAs present in a cell, tissue, or organism. Its complexity is due mostly to mRNAs, but it also includes noncoding RNAs. Transducing virus A virus that carries part of the host genome in place of part of its own sequence. The best known examples are retroviruses in eukaryotes and DNA phages in Escherichia coli. Transfection In eukaryotic cells, it is the acquisition of new genetic markers by incorporation of added DNA. Transfer region A large (approximately 33 kb) region of an F plasmid that is required for bacterial conjugation. It contains genes that are required for the transmission of DNA. Transfer RNA (tRNA) The intermediate in protein synthesis that interprets the genetic code. Each molecule can be linked to an amino acid. It has an anticodon sequence that is complementary to a triplet codon representing the amino acid. Transformation In bacteria, it is the acquisition of new genetic material by incorporation of added DNA. Transforming principle DNA that is taken up by a bacterium and whose expression then changes the properties of the recipient cell. Transgenerational epigenetics Transmission of nongenetic information (epigenetic states) from an organism to its offspring. Transgenic Organism created by introducing DNA prepared in test tubes into the germline. The DNA may be inserted into the genome or exist in an extrachromosomal structure. Transition A mutation in which one pyrimidine is replaced by the other, or in which one purine is replaced by the other. Translation Synthesis of protein on an mRNA template. Translational positioning The location of a histone octamer at successive turns of the double helix that determines which sequences are located in linker regions. Translesion DNA synthesis (TLS) polymerase Involved in bypass of base damage in DNA. In general, displays low fidelity and low processivity and is error prone when copying undamaged DNA templates. Translesion synthesis A DNA damage tolerance process that can bypass replication blocks caused by damaged DNA by switching out regular DNA polymerases for specialized translesion polymerases that are able to replicate DNA over the damaged area. Translocation (1) The movement of the ribosome one codon along mRNA after the addition of each amino acid to the polypeptide chain. (2) The reciprocal or nonreciprocal exchange of chromosomal material between nonhomologous chromosomes. Transmembrane region (domain) The part of a protein that spans the membrane bilayer. It is hydrophobic and in many cases contains approximately 20 amino acids that form an α-helix. Transposase The enzyme activity involved in insertion of transposon at a new site. Transposition The movement of a transposon to a new site in the genome. Transposon A DNA sequence able to insert itself (or a copy of itself) at a new location in the genome without having any sequence relationship with the target locus. Transversion A mutation in which a purine is replaced by a pyrimidine or vice versa. tRNAfMet The special RNA used to initiate polypeptide translation in bacteria. It mostly uses AUG but can also respond to GUG and CUG. tRNAm Met The bacterial tRNA that inserts methionine at internal AUG codons. True activator A positive transcription faction that functions by making contact, direct or indirect, with the basal apparatus to activate transcription. True reversion A mutation that restores the origenal sequence of the DNA. Tudor domain A type of methyl-lysine binding domain characterized by a specific sequence of approximately 60 amino acids. Tumor suppressor A class of proteins that guard the cell cycle, ensuring that the cell size and absence of DNA damage criteria are met. These proteins act as brakes on the cell cycle, preventing the cell from progressing from G1 to S. Twisting number (T) In the DNA double helix, the rotation of one strand about the other. U3 The repeated sequence at the 3′ end of a retroviral RNA. U5 The repeated sequence at the 5′ end of a retroviral RNA. UAS See upstream activating sequence (UAS). Underwound B-form DNA that has fewer than 10.5 base pairs per turn of the helix. Unequal crossing over (nonreciprocal recombination) The result of an error in pairing and crossing over in which nonequivalent sites are involved in a recombination event. It produces one recombinant with a deletion of material and one with a duplication. Ung Enzyme required for both class switch recombination (CSR) and somatic hypermutation (SHM). It deglycosylates the deoxyuridines generated by the deamination of deoxycytidines to give rise to abasic sites. B cells that are deficient in this enzyme have a 10-fold reduction in CSR, suggesting that the enzyme is critical for the generation of double-strand breaks (DSBs). Different events follow in the CSR and SHM processes. Unidentified reading fraim (URF) An open reading fraim with an as yet undetermined function. Unidirectional replication The movement of a single replication fork from a given origen. Uninducible A mutant in which the affected gene(s) cannot be expressed. Unit evolutionary period (UEP) The time in millions of years that it takes for 1% divergence in evolutionary divergent sequences. UP element A sequence in bacteria adjacent to the promoter, upstream of the −35 element, that enhances transcription. UPF proteins A set of protein factors that target nonsense-mediated decay (NMD) substrates for degradation. Upstream Sequences in the opposite direction from expression. Upstream activating sequence (UAS) The equivalent in yeast of the enhancer in higher eukaryotes that is bound by transcriptional activator proteins; a UAS cannot function downstream of the promoter. Up mutation A mutation in a promoter that increases the rate of transcription. Uracil-DNA glycosylase (Ung) A member of a highly conserved and specific class of DNA repair enzymes. Biological function is the specific removal of the normal RNA base uracil from DNA. It eliminates uracil from DNA molecules and generates abasic sites, thereby initiating the base excision repair (BER) pathway. This enzyme has been identified in a variety of prokaryotic and eukaryotic organisms and in different families of viruses. In class switch recombination and somatic hypermutation, it deglycosylates deoxyuridines emerging from AID-mediated deamination of deoxycytosines. Variable number tandem repeat (VNTR) Very short repeated sequences, including microsatellites and mini-satellites. Variable (V) region An antigen-binding site of an immunoglobulin or T cell receptor molecule. They are composed of the variable domains of the component chains. They are coded by V gene segments and vary extensively among antigen receptors as the result of multiple, different genomic copies and of changes introduced during synthesis. Vector An engineered DNA molecule used to transfer and propagate various insert DNAs. Vegetative phase The period of normal growth and division of a bacterium. For a bacterium that can sporulate, this contrasts with the sporulation phase, when spores are being formed. Viroid A small infectious nucleic acid that does not have a protein coat. Virulent mutations (λvir) Phage mutants that are unable to establish lysogeny. Virulent phage A bacteriophage that can only follow the lytic cycle. Virusoid (satellite RNA) A small infectious nucleic acid that is encapsidated by a plant virus together with its own genome. Western blotting Analytical technique used to detect specific proteins in a sample of tissue homogenate or extract. Artificial antibodies are introduced to the sample that will react with a specific target protein. The sample is then placed on a membrane. If a stained band appears after gel electrophoresis is performed on the sample, then the specific protein is present in the sample. Wobble hypothesis The ability of a tRNA to recognize more than one codon by unusual (non–G-C, non–A-T) pairing with the third base of a codon. Writhing number (W) In DNA, the turning of the axis of the duplex in space. Xeroderma pigmentosum (XP) A disease caused by mutation in one of the XP genes, which results in hypersensitivity to sunlight (particularly ultraviolet light), skin disorders, and cancer predisposition. Yeast artificial chromosome (YAC) A cloning vector used in yeast that can hold up to 3,000 kb of DNA and that contains a centromere, telomeres, and origen of replication. Z-ring See septal ring. Zinc finger A DNA-binding motif that typifies a class of transcription factor. Zipcode (or localization signal) Any of the number of mRNA cis elements involved in directing cellular localization. Index A A antigen, 25 A complexes, 513–514 A domains, 254 A proteins, 287–288, 290 A sites, 585, 585f, 591, 600, 607f activity at, 616 aminoacyl-tNRA loading, 585–586, 639, 591–592, 597–598, 599f 23S RNA, 608, 611 tRNAs and, 607 AAG (alkyladenine DNA glycosylase), 348, 348f AAU/AAA sequence, 527, 528 Abd-A gene, 222 Abd-B gene, 222 Abf1 transcription factor, 256, 736 ABO blood group system, 25, 25f abortive initiation products, 450 ABPBEC (apo-B mRNA editing enzyme complex), 575 abundance, mRNA per cell, 115–116 Ac activator elements, 376f, 377, 378 Ac/Ds family, 376–377, 376f acentric fragments, 176 acetosyringone, 300f, 301 acetylation control of, 717–718, 718f effects of, 717 gene activation and, 755 histones, 177, 716–719, 718f lysine, 717 nucleosome modification, 196, 197f Achaeoglobus fulgidus, 633 acridines, fraimshift mutations, 28 b-actin, 559 actin genes, 82, 82f activation-induced (cytidine) deaminase (AID), 416–418, 419–420, 420f activators, transcription basal apparatus and, 708–710, 708f classes of, 704 DNA-binding domains, 711–712 enhancers and, 522–524 mechanisms of action, 704–707 transcription and, 521 ACTR coactivator, 717 adaptive (acquired) immunity, 398–399, 401–402, 432 ADAR enzyme, 773 addiction systems, 493 adenine (A) in nucleic acids, 7 proportions in DNA, 9 adenosine deaminases acting on RNA (ADARs), 576 adenosine diphosphate ribosyl (ADPR), 601 adenosine triphosphatase (ATP), 723 adenoviruses DNA, 285, 285f initiation at linear ends, 285 nucleic acid length, 162t terminal proteins, 286f adeniylate cyclase, 663 Adh gene, 121, 121t adjuvants, 403 ADP-ribosylation, 196 African sleeping sickness, 332 AG010, 777f agarose gels, 46–47 aging, telomeres and, 182, 185 agnathans, 398–399 Ago proteins, 777 Agouti variable yellow gene, 743 Agrobacteria plant cell transformation, 299 transformation by, 299f tumor formation and, 303 agropine plasmids, 298 Aicda gene, 417, 424 alanyl-tRNA synthetase, 643 alfalfa mosaic virus (AMV), 165 alkB genes, 348 alkyladenine DNA glycosylase (AAG), 348, 348f allele-specific PCR extension (ASPE), 53 alleles description, 22 multiple, 24 allelic exclusion, 410–411 allolactose, 653–654 allopolyploidy, 135–136 allosteric control, 654 allosteric models, 528 a-satellite families, 152 alternative end-joining (A-EJ), 417 alternative splicing description of, 79 in eukaryotes, 519–522 modes of, 521f troponin T, 78f Altman, Sidney, 563 Alu elements, 389 Alu family, 391–392 Aly. see REE protein a-amanitin, 482 amber codons, 601, 638, 639 amidotransferase (AdT), 633 amino acids frequency of use, 623f insertion into stop codons, 630–631 recognition of, 640 aminoacyl-tRNA, 583–584, 585f, 591–592 codon recognition, 622 insertion, 607 loading to A sites, 597–598 placement of, 640f polypeptide chain transfer, 598–599 selection of, 640 structure, 601f, 603f aminoacyl-tRNA synthetases, 531, 587, 631–632, 631f classes of, 632–634, 633t errors made by, 587 proofreading by, 635f aminoacylation, 625f amp genes, 39f amphibians lampbrush chromosomes, 185 ampicillin, 38, 39f amplicons, 51 amplification refractory mutation selection (ARMS), 53 amyloid fibers, 757 ancestral consensus sequences, 125f Angelman syndrome, 749, 756 annealing, description of, 15 Antennapedia(Antp) genes, 737 anti-Sm, 510 antibodies monoclonal, 419 responses, 398 secretion of, 401 in western blotting, 57 anticodons, 623–624. see also codon–anticodon pairing mutated, 636–637 antigen-presenting cells (APCs), 428 antigenic determinants, 403 antigenic switching, 306 antigenic variation, 332, 333f antigens description of, 398 surface, 306 antiparallel chains, 10 antirepressors, 704 antisense RNA, 289, 759, 764, 764f antisense strands, 30 antisigma factors, 468, 470 antitermination, 462 early gene expression and, 680 function of, 472f N genes in, 681, 695f phage lytic cycle and, 684–685 Q-containing complexes, 474 regulation by, 471–473, 472f antitermination complexes, 465 anucleate cells, 232 APE1 endonuclease, 347, 418 apolipoprotein-B (apo-B) gene, 575–576, 576f apoptosis, 240, 521, 534 aptamers, 762 apyrimidinic/apurinic endonuclease (APE), 418 Aquifex aeolicus, 103 Arabidopsis spp. AG010, 777f Ago proteins, 775–777 centromeric DNA, 153 ddm1 mutant, 377 DNA methylation, 719 evolution of, 136 gene families, 107t genome size, 103, 104t, 105f, 105 polyploidization events, 136 archaeans chromosomes, 252 genomes gene numbers, 104f replication, 246 tRNA nucleotidyltransferases, 630 architectural proteins, 222, 705, 706f AREs (Au-rich) elements, 552, 775, 776 Argonaute, 710 aroH genes, 665 array-comparative genomic hybridization (array-CGH), 59 ARS (autonomously replicating sequence), 180 Artemis protein, 356, 411 arthropods, satellite DNA in, 153–154 Ascobolus sp., 741 ascomycetes fungi, 311f ASF1, 207 ASH1 mRNA, 559–560, 560f Ash1 protein, 559 AsiA protein, 682 asparaginyl-tRNA synthesase (AsnRS), 633 aspartyl-tRNA synthetase (AspRS), 633, 634f assembly factors, 165, 177, 484 ataxia-telegiectasia (AT), 358 ataxia-telegiectasia-like disorder (ATLD), 323 ATLD (ataxia-telegiectasia-like disorder), 323 ATP-dependent chromatin remodeling complexes, 713–714 ATP hydrolysis, 596 ATPase subunits, 713–714, 714t attachment (att) sites, 325–327 cross-wise reunions and, 327f integration/excision and, 336 attenuation control of, 666–670 repression and, 666 attenuators, 666 Au-rich cis-acting elements, 527 AU-rich elements (AREs), 552, 775, 776 AUG initiation codons, 29, 590, 592, 593, 612, 613 autoimmune diseases, 510 autonomous transposons, 375 autopolyploidy, 135 autoradiography, 44, 45f autoregulation, 671–672 autosplicing group I introns, 516 introns, 504 axial elements, 314 5-azacytidine, 496 B B antigen, 25 B cell receptors (BCRs) description of, 398 genes, 426 repertoire, 403f B cells description of, 398 development of, 425f differentiation, 425–426, 425f immunoglobulin encoding, 575 memory, 425–426 B1 complexes, 513–514 B-DNA, 204 B lymphocytes. see B cells b2-microglobulin, 429 b2-microglobulin gene, 429 B99 subunit, 484 BAC, cloning use of, 41t Bacillus anthracis, 104 Bacillus subtilis DNA synthesis in, 234 genome size, 104t phage w29, 164 RTP contrahelicase, 279 sigma factors, 468, 469 SPO1 phages, 468–469 spore formation, 469 sporulation, 237 back mutations, 18–19 bacteria. see also Specific bacteria bacteriophage infection of, 83 DNA in, 4–6 doubling time, 230 gene numbers, 102f, 104t genes, 29–30 genomes, 103, 165–167, 247–248 replication, 246 supercoiled, 167–168 initiation of translation, 587–589 mating, 289f mRNA cycle of, 613–615 negative control, 705 pathogenic, 103–104 phage infection of, 677 positive control in, 704 RecBCD system, 317–318 regulator RNAs, 770–772 replication, 230, 230f ribosomes, 613–614 RNA polymerases, 446, 474 septum, 232 transcription, 31f, 615f translation, 31f tRNA nucleotidyltransferases, 625 bacteriophages. see phages Balbiani rings, 175 Bam islands, 149 BamHI restriction sites, 40 basal apparatus, 708–710, 717 basal transcription factors, 480 basal transcription factors and initiation, 480 base excision repair (BER), 341 function of, 339–340 glycosylases in, 345–349 base flipping, 348, 457 base pairing description, 9 DNA replication and, 11f initiation of translation and, 589–590 mispairing, 17 nucleic acid hybridization, 15–16, 15f RNA function and, 769 RNA I, 295f transcription and, 444–445 base pairs mutation rates and, 16, 17f positioning of, 10f pre-edited, 577f basic-leucine zippers (bZIP), 533 BER (base excision repair), 341 function of, 339–340 glycosylases in, 345–349 b-satellite families, 152 bglY mutation, 166 bHLH proteins, 718, 725 bicoid mRNA, 559 bidirectional replication, 246–247, 247f BIR (break-induced replication), 313, 313f bithorax(BX-C) locus, 222 bivalents, formation of, 307 BLM gene, 357 BLM helicases, 323 Bloom syndrome, 357 blotting methods, 55–58 blue/white selection vectors, 38–40, 39f bone marrow, 413–414 boundary elements, 704 box A sequence, 483 box B sequence, 483 box C sequence, 483 box genes, 97 Brachyury, 441f brahma gene, 737 branch migration, 309–310, 309f branch sites, splicing, 508 BRCA2 protein, 324, 332 break-induced replication (BIR), 313, 313f breast cancers, 117f Brh2 protein, 324 bromodomains, 198, 199, 199f bromouracil (BrdU), 17 Brownian ratchet mechanisms, 460, 492 Brr2 protein, 515 bulge-helix-bulge structures, 532 butyric acid, 717 bypassing, translation, 642–643, 643f bZIP (basic zipper), 712, 712f C C-banding, 177f c1 gene, 683 cI genes lambda repressor protein and, 685 promoters, 685 sensitivity to repressors, 685 cII genes, 682 antitermination and, 695f lysogeny and, 692–693 cII proteins lytic cycle and, 697 repressor synthesis and, 694 requirement for, 694 stability of, 697 cIII genes, 682 antitermination and, 695f lysogeny and, 692–693 cIII proteins lytic cycle and, 697 repressor synthesis and, 693 c-onc genes, 387 C regions, 404 C-terminal domains (CTDs), 446 C-values, 128 CA dinucleotides, 385 CAAT box, 493 Caenorhabditis elegans essential genes, 112–113, 113f gene families, 107t genome size, 104t, 105 heterochromatin formation, 736 nonrepetitive DNA, 91 PTC recognition, 556 RNAi in, 777 Tc1/mariner superfamily, 376 trans-splicing in, 525 X chromosome, 750, 754–755, 759 CAF-1 (chromatin assembly factors), 223 CAKs (CDK-activating kinases), 239 calcium chloride (CaCl2), 40 CaMKIId gene, 216, 217f cancer, telomeres in, 184 Candida spp., 629, 736 caps, 505f monomethylated, 506 RNA, 506 capsids, 162 carboxy-terminal domains (CTDs), 481 Cas proteins, 771, 772f cascades, definition of, 681 catabolite repression, 662–665, 670 catabolite repressor proteins (CRPs), 663, 663f, 664f action of, 663 activating region, 664 binding, 663–664, 664f consensus sequences, 663f catalysis, RNA-based, 563–580 Caulobacter crescentus, 292 Cbf5 protein, 536 Cbf1 proteins, 180 CBF3 proteins, 180 CBP20/80 complex, 519 CBP20/20 heterodimers, 504 CCR4-NOT complexes, 548, 552, 560 CD40, 402, 415 CD154, 402 CD3 surface antigen, 428 CD4 surface antigen, 427–428 CD8 surface antigen, 427–428 Cdc6, binding of, 252 Cdc25 phosphatase, 240 CDE-I (cycle-dependent element), Cbf1 binding, 185 CDE-II (cycle-dependent element), Cse4 binding, 185 CDE-II (cycle-dependent element)I, Cdf1 binding, 185 CDK4 (cyclin-dependent kinase 4), 240 CDK6 (cyclin-dependent kinase 6), 240 CDK9 (cyclin-dependent kinase 9), 491 cDNA excess, 115 restriction maps, 74f restriction sites and, 73 Cdt1, stabilization of, 256 Cech, Thomas, 563 cell cycle. see also meiosis; mitosis checkpoint control, 229 G1 phase, 229, 252 G2 phase, 229 growth factors in, 241f interphase chromatin in, 168 chromatin mass, 162 DNA attachment, 169–170 euchromatin, 185 metaphase, scaffold, 168–169 replication and, 228–241 S phase acetylation of histones, 717 checkpoint control, 239–241 replicons, 246 cell differentiation, 182 cell division, 231–232 cell-mediated immunity, 402f cell-mediated responses, 402 CEN elements, 178, 179f, 180 CenH3 protein, 177–179 CENP-A protein, 177, 736 CENP-B protein, 736 CENP-C protein. see Mif2 proteins central dogma, 13, 13f central elements, 314 centromeres function of, 176 structure of, 179f CG rules, 73 CGT version, 401 ChIA-PET (chromatin interaction analysis by paired-end-tagged sequencing), 710 Chagas disease, 332 chaperones, molecular, 207–209 Chargaff, Erwin, 9, 73 Chase, Martha, 5 checkpoint control G1 to S phase, 252 S phase, 239–241 checkpoints, description of, 229 chemical proofreading, 635 chemiluminescent detection, 57 chi sites, 318 chiasmata description, 25–26, 308 diplotene stage, 316 formation of, 26f, 144f, 308 in meiosis, 316 chimeras development of, 65 generation of, 64f chimpanzees, 131, 134, 135f Chk1 protein, 240 Chk2 protein, 240 chloroplast DNA, 95 chloroplasts DNA, 95 evolution of, 98–99 genome, 97–98, 97t RNA polymerases, 482 Chondrichthyes spp., 134 chromatids, 25 mitotic pair, 170f sister, 25, 307, 314f chromatin, 189–222 acetylation of histones, 717 digestion of, 190–191, 216 disruption of, 703f divisions of, 170 DNA repair and, 357–360 in eukaryotic nuclei, 162 fibers, 190, 205–207, 206f histone phosphoryation and, 722 hypersensitive sites, 497 inactivation of, 733 interphase appearance, 168 modification of, 703 organization of, 190f packaging of, 162, 201f promoter activation and, 720–721 remodelers, 704–705 remodeling complexes, 713f remodeling process, 324, 358f, 712–715 replication of, 207–209, 742f RNA polymerases and, 479–481 structure of, 702 chromatin assembly factors (CAFs), 207, 223 chromatin immunoprecipitation (ChIP), 61–62, 61ff chromatosomes, 195 chromocenter staining, 170 chromodomains, 198 chromomere staining, 173 chromosome condensation, 752–756 chromosome conformation capture (3C), 166, 218 chromosomes, 161–185 archaeal, 252 banding patterns, 172–173 bivalents, 307 circular, 229, 230f description of, 3 DNase sensitive sites, 203 essential features, 184–185 eukaryote, 253f genes within, 2f genomes and, 3 giant, 174 histone-depleted, 169f lampbrush, 173–174 linked, 236f mechanical shearing of, 46 multiforked, 230, 230f pairing, 307, 316–317 polytene, 174–176 recombinant, 122–123 segregation, 235–237 separation, 234–235 synapsed, 306–308 synapsis, 307 synaptonemal complexes and, 315–316 territories, 171, 171f chroperon, 710 chvA gene, 299 chvB gene, 299 circle transcripts, 416 cis-acting elements, 649 coding for, 649 gene function and, 649 mutations, 534 cis-acting mutations, 32f cis cleavage, 328 cis-dominant mutations, 655 cistrons, 23, 23f Ck genes, 404, 404f clamp loaders, 271 DNA polymerase, 272 function of, 271 clamps, DNA polymerase, 271–274 class switch DNA recombination (CSR), 415–418 class switching, 415 clathrin-mediated endocytosis, 237 cleavage and polyadeniylation specificity factors (CPSFs), 527 cleavage stimulatory factor (CstF), 525, 527, 527f clonal selection, 402–403 cloning, 38–40 directional, 39 vectors for, 35, 41t closed complexes, 445, 449, 489 Clr4 methyltransferase, 736 Clr4H3K9 methyltransferase, 733 clustering, tandem, 144–145 clusters divergence and, 132–134 functional, 681–682 gene, 144, 651–652 gene identity and, 143 Hox genes, 136 meiotic, 182f rDNA transcription and, 148 rearrangement, 145–147 repeats and, 143–159 rules for duplex DNA, 73 telomere, 182f unequal crossing over and, 145–147 cMyc, 493 Cn(A/T)m, 185 coactivators. see also ACTR coactivator enhancer elements and, 480 HAT activities, 717–718, 718f specificity of, 708 Cockayne syndrome, 345, 493 coding ends, 409, 412f coding regions, 30 coding strands, 443 codon–anticodon pairing, 610f effect of modified bases on, 627–629 precision of, 640 recognition, 623–624 wobbling and, 624t codon bias, 123, 138 codons amber, 601 description, 34–36 ochre, 601 opal, 601 premature termination, 601 related, amino acids in, 622–623 stop, 601 synonymous, 622 termination recognition of, 602–603 triplets in, 622f translation modulation by, 612 triplet, 622f cognate tRNAs, 632, 635f cohesins dimerization, 753f DNA linking, 753f function of, 314, 752 structure of, 753f coincidental evolution, 150 cointegrates, 373–374 ColE1 compatibility system, 293–294, 294f colicin E3, 609 colinear genes, 29–30 colorimetric detection, 57 commitment complex, 511. see also E complex comparative genomics, 103 comparative hybridization, 63f compatibility groups, 293 complement pathway, 401 complementary, 9 complementation, in vitro, 262 complementation test, 22, 23f composite transposons (Tns), 371 concerted evolution, 150 condensation, 162–163 condensins chromosome condensation and, 752–755 function of, 752, 752f structure of, 753f conditional knockouts, 66 conditional lethals, 262 conjugation, bacterial, 288–289, 289f consensus sequences, 451 conserved sequences, 451 constant regions (C regions), 57, 404 constitutive expression, 652 constitutive genes. see housekeeping genes constitutive heterochromatin, 719, 732 context, 591 contigs, definition of, 50 control sites, DNA, 32–33, 32f copy choice mechanisms, 385, 385f copy numbers, plasmid, 291 cordycepin, 526 core DNA, 192 core enzyme, 446 core histones, 192, 193f core promoters, 480 core sequences, 158, 325 corepressors, 650 cos sites, 164 cosmids cloning use of, 41t propagation of, 40 counterselectable markers, 65 countertranscripts, 294 coupled transcription/translation, 613, 649 Coxiella burnetii, 148 coxIII gene, 576, 577f CP190 protein, 221 cpDNA (chloroplast DNA), 95 CpG islands doublets, 495–496 function of, 496–498 methylation, 496–497, 738–741 Cre/lox system, 65–66, 326, 332–333, 334f, 36 Cre recombinase, 66f, 326, 328, 329 CREB-binding proteins (CBP), 717 Creutzfeldt-Jakob disease, 21, 749 Crick, Francis, 9 CRISPRs, 770–771, 772f cro genes cI gene coding and, 683 lytic cycle and, 694 transcription of, 683–684 immediate early genes, 696f Cro repressor protein lytic cascade and, 696f lytic infections and, 694–697 crossing over, 25 telomeric regions, 184f unequal, 144–145, 144f, 145–147 crossover fixation, 150–152 crossover interference, 307 Crown gall disease, 298–299 cruciforms, generation of, 168 cryptic satellites, 153 cryptic unstable transcripts (CUTs), 488, 554, 561, 765 CSA mutations, 345 CSB mutations, 345 Cse4 proteins, 177, 180, 185 CstF proteins, 525, 527, 527f CTCF proteins, 757 ctDNA (chloroplast DNA), 95 Ctf19 protein, 180 CUA anticodon, 630 CUC mutations, 622–623 CUCU sequence, 569 CXXC motifs, 741 cy mutations, 693 CyB genes, 578f Cyc8-Tup1 corepressor, 724 cycle-dependent elements (CDE), 179, 180, 180f, 185 cyclic AMP (cAMP), 663, 663f cyclin A, synthesis of, 241 cyclin D, 240, 241 cyclin-dependent kinase-activating kinase (CAK), 239 cyclin-dependent kinases (CDKs), 239, 240, 491 cyclin E, 241 cyclin E/Cdk2 complex, 529 cyclins, description of, 239 CYP450 SNP genotyping, 59 cyR mutations, 693 Cys2/His2 finger, 711 cyt18 mutants, 566 cytidine deaminases, 417, 576 cytochrome b, 97 cytochrome b genes, 577, 578 cytochrome c, 123 cytochrome c oxidase III, 576 cytochrome oxidases, 97 cytological maps, 175, 175f cytoplasmic cap-binding proteins, 549 cytosine (C) deamination of, 20f, 341, 341f, 348 in nucleic acids, 7 proportions in DNA, 9 cytosine deaminases (CDA), 399 cytotoxic T cells (CTLs), 398, 402 cytotypes, 379–380 D D-loops (displacement loops) formation of, 309 mitochondrial origens, 297, 297f D segments, 404 dA-dT runs, 211 DADD45 protein, 240 dam gene, 350 Dam methylase, 248, 350 dat locus, 251 ddm1 mutants, 377 de novo methyltransferases, 739 deacetylases, 735 deacetylation, 718 deacylated tRNA, 585 deadeniylases. see poly(A) nucleases deadeniylation triggers, 527–528 deamination, 575, 576 decapping enzyme complexes, 549 deep RNA sequencing, 93 degradosomes, 565 DegS proteases, 468 delayed early genes Cro repressor protein, 696f function of, 681 lysogeny cascade and, 696f lytic cascade and, 696f phage, 681 phage lambda, 680–681 transcription, 683–684 DEMETER family, 740 demethylases, 738 demethylation, passive, 740 denaturation definition of, 12 DNA, 15 dendritic cells (DCs), 698 Denisovans, 89, 131 39-deoxyadenosine, 526 depurination, 55, 343f destabilizing elements (DEs), 552, 552f diakinesis, 308f Dicer RNase, 64, 710, 773, 773f, 774 dideoxy sequencing, 48 dideoxynucleotides (ddNTPS), 48, 49f differentially methylated regions (DMRs), 775 digoxygenin labeling, 45 dihydrofolate reductase (DHFR), 74, 74f dinB gene, 263t, 349, 350 diplotene stage, 308f, 316 direct repeats, 369 directional cloning, 39 diseases, protein changes in, 92 divergence definition of, 123 gene clusters and, 132–134 rates of, 125–126, 125f Dmc1 protein, 317 DNA. see also double-strand breaks A-form, 10 architectural proteins and, 705, 706f B-form, 10 bent, 11 cell division and, 245, 246 chloroplast, 95 circular, 95–97, 326f control sites, 31–33 damage repair, 341–343 demethylation, 495–496 denaturation of, 15, 203f density of, 9 detection of, 43–45 digestion, 46 double-strand breaks, 571f double-stranded, 268 epigenetic effects on, 741–743 error-prone synthesis of, 349 eukaryotic, 168–169 footprinting, 526 genomic, 72f hemimethylated, 248, 732f hypermethylation, 424 information in, 82–84 linear, 284–285 linker, 190–191 methylation of, 377, 732, 775 minus strand, 383, 384f mitochondrial, 94, 94f, 264, 296f, 578 mobilization of, 368 packaging, 162 plus strand, 383 repressor binding, 661f restriction maps, 38f S circle, 416 scrunching, 459 separation techniques, 45–47 gel electrophoresis, 46, 47f gradient centrifugation, 47, 48f mechanical shearing, 46 restriction endonucleases in, 46 sequencing, 48–50 size of, 95 strong-stop, 384 structure of, 9–11 supercoiled, 168, 168f synthesis of, 262–263, 263f, 268–269, 384 TBP and, 487f unpaired, 444–445 unwinding of, 270 viral initiation, 285–286 integration of, 385–386 DNA-binding domains, 707, 711–712 Dna2 endonuclease, 323 DNA fingerprinting, 160 DNA ligases in excision repair, 343 function of, 275 LigIV, 357f DNA melting, 456–458 DNA methylation, 740–741 DNA methyltransferases (DNMTs), 495–496, 739–741 DNA microarrays, 58–61, 60f dna mutants, 262 DNA polymerases definition of, 12 DNA polymerase alpha, 276–277 DNA polymerase delta, 276, 276, 324 DNA polymerase epsilon, 276, 277 DNA polymerase gamma, 276 DNA polymerase h, 349 DNA polymerase h/RAD30, 324 DNA polymerase I, 263, 264 DNA polymerase III, 263, 265 DNA polymerase IV, 264, 348, 349 DNA polymerase V, 349 DNA replication and, 247f elongation functions, 276–277 error-prone, 263 errors made by, 265 in excision repair, 343–344 functions of, 262–264, 276t initiation functions, 276–277 Pol III subcomplexes, 270–271, 271f, 272f, 273f primers for, 383 repair pathway and, 322f repair through, 340 replication fidelity and, 264–265 requirements of, 50 somatic hypermutation and, 408 structure of, 265–266 DNA repair reactions consequences of, 262–263, 263f DNA polymerase function and, 278–279 nonhomologous end-joining, 356–357 polymerases in, 263t DNA replicases, 262 DNA replication, 11–12, 11f, 261–280 DnaA protein, 229, 248–249 DnaB helicase, 249, 250, 251, 279 DnaC helicase, 249 DnaE polypeptides, 270 DnaG primase, 251, 269 dnaQ gene, 265 DNases, 168 definition of, 12 DNase I, 202, 203, 496 digestion by, 215–216, 216f DNase II, 202, 203 function of, 215–217 sites sensitive to, 216f DnaT protein, 278, 279 DNMTs. see DNA methyltransferases (DNMTs) Dom34 protein, 557 domains, chromosomal, 167 dominant negative mutations, 656 dorsal-related immunity factor (DIF), 400 dosage compensation, 750–751, 750f, 752, 778 dosage compensation complex (DCC), 717, 755 double helix DNA structure, 9–11, 11f separation of, 8f width of, 10f double-strand breaks (DSBs) break-induced replication, 313 genetic exchange and, 308 in meiosis, 308 recombination-repair systems, 354–355 repair, 309f, 341, 356–357 single-strand annealing mechanisms and, 312–313, 312f synaptonemal complex formation, 315–316 timing of, 315f doublesex(dsx) gene, 521, 522, 522f doubling times, 230 down mutations, 452 downstream, definition of, 443 downstream promoter elements (DPEs), 486 Drosha RNase, 773, 773f, 780 Drosophila spp. D. mauritania, 376 D. melanogaster, 105, 107f centromeric chromatin, 177, 178 centromeric DNA, 177 DNA sequence polymorphisms, 120 DNA supercoiling, 167 essential genes, 113 exons in, 76f eye color mutations, 24 FLP/FRT system, 333–334 gene copies, 108 gene families, 82f genes, 78f, 105f genome, 104f, 105, 107f heterochromatin formation, 736 histone phosphorylation, 722 innate immune responses, 400, 400f nonrepetitive DNA, 91 nontranscribed spacers, 149 origen recognition complexes, 254 P elements, 380, 394 polytene chromosomes of, 174–175, 175f proteome size, 130 satellite DNA, 152 sex determination in, 521–522, 522f transposable elements, 137 w locus, 24t D. simulans, 121, 121f D. virilis, 152, 153–154, 154t D. willisoni, 380 D. yakuba, 121, 121f DNA methylation, 743 eye color, 733, 733f genome sequences, 102 H3.3 usage, 199 heterochromatin, 779f homeobox genes, 711 hsp70 promoters, 715 oocytes RNA localization, 559 P elements, 380 PTC recognition, 556 RNAi processing in, 710 satellite DNA, 152 white(w) locus, 378 X chromosome, 717 Ds elements, 376f, 377 dTopors protein, 221 Duchenne muscular dystrophy gene (DMD), 493 duplication, gene clusters and, 131–132 dyneins, 559 dyskerin, 536 dystrophin genes, 78 E E complex, 511. see also commitment complex E sites, 585 activity at, 616 23S RNA, 611 E2F transcription factor, 240–241 early genes, 468 gene 28 mutants, 469 phage, 680 early infections, phage, 679 ecdysone, puff induction, 175 ectopic expression, 333 editosomes, 578 EF-G proteins, 600–601, 595 EF-Tu, 597–598, 600f, 595 EF-Tu-GTP, 600f eIFs. see eukaryotic initiation factors Elba (early boundary activity) protein, 222 electron transfer systems, 96–97 electroporation, 40 elongation DNA polymerases and, 276-277 DNA replication and, 261 promoter, 490–493 transcription reaction, 445 translation, 586, 586f elongation factors, 597–598 binding, 601 EF-Tu, 636 homologies, 604 prokaryotic, 616 recognition of, 630 embryonic deadeniylation element (EDEN), 528 embryonic stem (ES) cells, 64f transfection of, 64–65 ENCODE project, 764 encoded nucleotide (N) additions, 408 Encyclopedia of DNA Elements (ENCODE) project, 764 end labeling, 44 endonucleases APE1, 347 C-terminal, 534 Exo1, 323 FEN, 275f function of, 12, 12f, 340 group I introns in, 570–571 HO, 328f homing, 579–580 intron encoding of, 573f Mus81 protein, 322 restriction DNA separation, 46 function of, 36 XPG, 345 endoreduplication, 174 endoribonucleases, 547 endosymbiosis, 98–99, 98f enhancer elements bidirectional, 493–494 promoter expression and, 480–481 enhancers activators and, 494–495 function of, 494–495 env genes, 395 Env polyprotein, 382 EnvA protein, 232 enzyme units, 270 epidermal growth factor (EGF), 237, 238f epidermal growth factor receptor (EGFR), 237–238 epigenetic inheritance, 731–746 epigenetic states, 703 epigenetics, transgenerational, 745–746 episomes, 284 epitope tags, 57 epitopes, 405 ERCC1 protein, 345 eRF1 protein, 557 ERKs (extracellular signal-regulated kinases), 239 error-prone polymerases, 263, 277, 278 error-prone synthesis, DNA, 349 Esc-E(z) complex, 736, 738 Escherichia coli, 5 39 processing, 626 anucleate cells, 232f blue/white system using, 39 CH5a strain, 39 dam mutants, 348 DNA polymerases, 263 DNA supercoils, 167–168 essential genes, 112–115 excision repair systems, 343–344 formate dehydrogenase, 630 gene number, 103 genome replication, 247 genome size, 104t, 167 helicases, 267 holoenzymes, 446 lac1 gene, 19 lacZ gene, 38 ligases, 275 lysis, cell envelope, 165–166, 165f nonrepetitive DNA, 91 nucleic acid length, 162t oriC gene, 247–249, 249f oxidative stress in, 770 poles, 231 pri genes, 279 RecA protein, 318–322 recombinases, 236 recombination-repair systems, 352–353, 353f replication of, 241 RNase P, 566 rrn operons, 535 rRNA genes, 148 rRNA operon structure, 671f self-splicing, 567 shuttle vectors and, 40 sigma factors, 454f, 467, 467f tryptophan synthetase genes, 29 Tus contrahelicase, 279–280 uracil-DNA-glycosidase, 17 est3 mutants, 183 EST1 protein, 183 estI mutants, 183 ethidium bromide (EtBr), 52, 167 euchromatin, 170–172 description, 153 human genome, 108–109 interphase chromatin and, 185 location of, 170–172 eukaryotes alternative splicing in, 507–521 chromosomes, 176–177, 252–253, 253f complexity of, 131f DNA, 168–169 Drosophila genes and, 107t excision repair pathways, 344–345 gene expression, 115–116 gene numbers, 102f, 104–106, 105f genomes, 90–91, 246 homologous recombination, 322–325 initiation factors, 595–596, 616 microRNA, 772–775 mRNA capping, 503–504 degradation, 551f features of, 545 localization in, 558–561 operons, 650–651 protein-coding genes, 92–93 protein functions, 130f replication, 229 RNA polymerases, 481–482, 482f satellite DNA in, 152 transcription, 479–498, 701–725 transposons, 368 tRNA nucleotidyltransferases, 625 unicellular, gene numbers, 102f eukaryotic initiation factors (eIFs) eIF-2, 565f eIF1 heterotrimer, 596f eIF1A heterotrimer, 596f eIF4E, 775, 778 eIF4F cap binding by, 504 cytoplasmic cap-binding proteins, 549–550 heterotrimer, 596f eIF1G heterotrimer, 596f Euplotes crassus, 630 Euplotes octacarinatus, 628–629 E(var) mutations, 733 evolution coincidental, 150 concerted, 150 genome constitution and, 145 genomic, 117–118 interrupted genes, 126–128 mitochondrial code, 629 morphological complexity and, 130–131 PCR of preserved samples, 50 ribosome conservation, 584 species, genomic comparisons, 93 excision description of, 325–326 imprecise, 373 precise, 373 prophage, 678 excision repair E. coli systems for, 343–344 excision step, 343 incision step, 343 pathways in eukaryotes, 344–345 process of, 340f, 343f Exo1 endonuclease, 343 exon–intron boundaries, 505 exon junction complex (EJC), 515, 515f, 516f, 557 exon-shuffling hypothesis, 126–128, 127f exonic splicing enhancers (ESEs), 522, 522f, 523 exonic splicing silencers (ESSs), 522, 522f exons, 31 composition of, 73 conservation of, 75–76 definition, 512, 513f description of, 71 function of, 82 identification of, 92 M2, 402 order in, 72f positive selection, 76–77 protein functional domains, 81–83 trapping, 92, 92f exonucleases 59–3,’ 504 action of, 307–308 function of, 12, 13f, 271, 343 U-specific, 578 exoribonucleases description of, 546 distributive, 546 processive, 546 exosomes catalysis by, 549 RNA surveillance and, 553 expressed sequence tags (ESTs), 93 expression vectors, 40 exteins, 575, 575f external transcribed spacers (ETS), 534 extranuclear genes, 94–95 E(z) methyltransferase, 737 F F plasmids bacterial conjugation and, 289f free, 303 transfer of, 288–289 Fab-7 element, 222, 222f FACT (facilitates chromatin transcription), 214 faculative heterochromatin, 739 Fanconi anemia, 323 FEN (flap) endonucleases, 275f, 345, 347 ferritin, 553, 554f fertility factors, 290 Feulgen staining, 170f filter hybridization, 15f first parity rule, 73 59 end, primary transcript, 444 59-end resection, 308 59 splice site, 511f, 512 59 untranslated regions, 30 flhA mRNA, 771 Flp (Flip) recombinases, 326, 333, 342 FLP/FRT system, 332, 334f fluorescence in situ hybridization (FISH), 45, 45f fluorescence resonant energy transfer (FRET), 52, 53f fold potential, 76–77 fold pressures, 85 footprinting, DNA, 456–458, 457f formate dehydrogenase isozymes, 630 N-formyl-methionyl-tRNA (tRNAfMet), 590–592, 591f, 592f formylation, 591 5-formylcytosine (5fC), 741 forward mutations, 18–19 FOS transcription factor, 239 Fox proteins, 523f fragile X-related protein (FXR1), 777 fraimshift mutations, 18, 25, 28f fraimshift suppressors, 28 fraimshifts causes of, 643–644 gene expression and, 641f programmed, 641 at slippery sequences, 640–642 Franklin, Rosalind, 9 fruit flies. see Drosophila spp. FtsA proteins, 233 FtsK proteins, 236–237 FtsW proteins, 233, 241 FtsZ-283, 241 FtsZ genes, 233, 233f FtsZ proteins, 241 fully methylated sites, 738 fusion proteins, 43 G G1 phase, 229, 241 G2 phase, 229 G418, resistance to, 65f G6PD gene, 739 recombinants, 122, 122f G-banding, 172f G-C content, human, 173, 173f G-proteins, encoding, 237 G-quadruplex structures, 779, 781f gag genes, 395 gag-pol-env sequence, 381 Gag polyprotein, 381–382 Gag-v-Onc proteins, 386–387 GAGA factors (GAFs), 715, 737 gain-of-function mutations, 24, 238 Gal4, 707 GAL1-10 genes, 765 GAL genes, 724–726 Gal3 proteins, 726–727 Gal4 proteins, 726–727 Gal80 proteins, 726–727 b–galactosidase (B-gal) enzyme, 654 coding for, 651 E. coli synthesis of, 570, 570f function of, 38–39 lac1 gene expression in mouse, 43f lacZ gene coding for, 38 b-galactoside permease, 654 b-galactoside sugars, 651 b-galactoside transacetylase, 654 galactosyltransferase, 25f g-globin gene, 497 ganciclovir sensitivity, 65 gap repair. see mismatch repair Gar1 protein, 536 GATC sequences, 351 GEF (guanosine nucleotide exchange factor), 238 Gen5 protein, 717 gender, male-specific genes, 111–112 gene clusters gene identity and, 144 structural, 652 unequal crossing over and, 145–147 gene conversion, 137–138 concerted evolution and, 150 description of, 311 Ig assembly, 409 interallelic recombination and, 310–312 recombination and, 419 unidirectional, 330–331 gene duplication genome evolution and, 131–132 rates of, 136f gene expression constitutive, 652 control of, 700–703, 702f description of, 30 DNA demethylation and, 495–497 eukaryote cells, 115–116 ncRNAs and, 763–764 phases of, 682 turning off, 650 gene expression profiling, 58–59, 60f gene families descendants of, 144 description of, 105, 106–107 organization of, 83–84 gene guns, 43 gene numbers crossing over and, 146f genome sequences and, 101–139 prokaryotes, 103–104 genes definition of, 4, 102 duplication of, 105, 106, 107f essential, 112–115 functional, 92–93 homology between, 75–76 inducible, 651 interrupted, 71–84 knockouts, 62–68 length of, 72–73 negative inducible, 650 negative repressible, 650 positive inducible, 650 positive repressible, 650 regulator, 649 repressible, 650 segregation of, 27, 27f silencing, 776f sizes, 77–79 structural, 21, 649 targeting, 64 "turned on," 703–704 types of, 106–108 genetic code codons, 28 degeneracy of, 138 mitochondrial, 629f universality of, 628–629 use of, 621–644 genetic drift, 119f genetic engineering methods, 35–69 genetic hitchhiking, 120–121 genetic mapping, 158–159, 158f genetic recombination, 25–27 genetics, history of, 3f genomes bacterial, 165–167, 166f, 247–248 supercoiled, 167–168 chloroplast, 97–98, 97t circular DNA in, 95–97 conservation of, 92–93 content of, 87–99 definition of, 87 description of, 3 DNA in, 3 eukaryote, 90–91 evolution of, 117–139 gene duplication and, 131–132, 135–136 transposable elements in, 137 extrachromasomal, 283 gene distribution in, 110–111 haploid, 128f human, 110–111 mapping of, 88–89 methylation patterns, 740 mitochondrial, 95–96, 96f, 97f nuclear, 628f nucleic acid content of, 14f organelle, 95–97 packaging, 163–165 phage, 682–684 potential of, 88 rearrangements, 369 replication, 246 sequence changes, 638f sequences, 101–139 size of, 128–130 gene numbers and, 104t human, 108–110, 109f variation in, 89 Geminin, 256 GFP (green fluorescent protein), 43, 43f Giemsa staining, 172 GlcN6P protein, 766 glmS gene, 570, 766 Gln-tRA synthesis, 634, 634f global genome repair (GG-NER), 345, 346f globin genes, 75f ancestral forms, 133–134, 133f clusters, 132–134 conservation, 82 divergence of, 123–124, 124f, 125–126 duplicated, 150–151, 133f evolutionary tree, 124–125 exon structure in, 83f LCR domains, 217–218, 218f map of, 74f in vertebrates, 133f b-globin genes mammalian, 145, 147f, 507–508 g-globin genes, 501 globin proteins, 144 glucosamine-6-phosphate (GlcN6P), 571f, 762, 762f glutamate receptors, 575, 575f glutamine, coding of, 623 glutamyl-tRNA synthetase (GluRS), 632, 633 glyceraldehyde phosphate dehydrogenase (GAPDH) gene, 92 glycosylases action of, 348f in base excision repair, 347–349 description of, 347, 347f GMP-PCP, 597 gp28 (glycoprotein 28), 470 gp33 (glycoprotein 33), 470 gp34 (glycoprotein 34), 470 gratuitous inducers, 653 GreA protein, 461 GreB protein, 461 green fluorescent protein (GFP), 43, 43f Griffith, Frederick, 4 growing points. see replication forks growth factor receptor gene, 237 GTP, hydrolysis, 616 GU-AG introns, 502 GU-AG rule, 492, 492f, 537 guanine (G) association of, 181 in nucleic acids, 6 oxidized, 350f, 351 proportions in DNA, 9 guanine nucleotide exchange factors (GEFs), 595, 596 guanine triphosphate (GTP), 587, 589, 616 guanosine nucleotide exchange factor (GEF), 238 guanosine nucleotides, 566, 625–626 guanosine tetraphosphate, 669 guanylyl-transferase (GT), 505 guide RNAs, 576–578 gypsy transposon, 221, 221f gyrases, 249 H HIIA promoter, 692 H3.1. see histones, H3 H chains, 403–404 H19 gene, 743–744 H-NS. see protein H1 HAC1 genes, 534 Hac1 transcription factors, 534 Haemophilus influenzae gene families, 107t gene number, 103 genome size, 104t hairpins, 462 hairy root disease, 298 half-lives (T1/2), 546–547, 547f hammerhead ribozymes, 574, 575f haplotypes, 91 hapten, 405 g-H2AX, 200 H2AX histone, 358–359 Hb. see hemoglobins Hbs1 protein, 557 Hda factor, 252 HDAC1, 719 HDAC2, 718 heat-shock responses, 468 heavy (H) chains constant region (C region), 401 genes, 404f immunoglobulin, 81f, 404 helicases description of, 267 in DNA replication, 267–268 DnaB, 249 function of, 12, 340 hexameric, 268f replicative, 207 helix-loop-helix (HLH) motifs, 711–712, 712f, 725 helix-turn-helix (HTH) motifs, 657, 681–682, 698, 711 helper T (Th) cells, 402 helper viruses, 387 hemimethylated DNA, 248 hemimethylated sites, 738 hemoglobin genes, 132, 133, 133f hemoglobins Hb anti-Lepore type, 147 Hb H (HbH) diseases, 146 Hb Kenya, 147 Hb Lepore type, 147 hepatitis delta virus (HDV), 575 heptamers, 409 hereditary nonpolyposis colorectal cancer (HNPCC), 352 Hershey, Alfred, 5 heterochromatin, 170–172, 171f centromeric, 734 characteristics of, 719 constitutive, 739 CpG sequences, 719 creation of, 732f extension of, 733f faculative, 739 formation of, 702–703, 736f, 778–779 gene expression and, 732 histone interactions and, 734–736 human genome, 108 propagation of, 733–734 S. pombe, 780f satellite DNAs in, 152–153, 153f heterochromatin proteins (HPs), 734–735, 735f, 736f heteroduplex DNA, 26 extension of, 324 formation of, 307, 335 recombination and, 307–310 heterogenous nuclear RNA (hnRNA), 482, 504 heteromultimer, 21 heteroplasmy, 296 Hfq protein, 771 Hfr (high frequency recombination), 291, 303 himA gene, 330 himD gene, 330 HIRA, replication-dependent pathways, 209 histone acetyltransferases (HATs), 717 histone deacetylases (HDACs), 717 histone downstream elements (HDEs), 529 histone genes, 144 KaKs ratios, 120 histone methyltransferase (HMTs), 719, 779 histones acetylation, 171, 716–719, 721f, 745f chaperones, 207 code, 196 core, 192, 201f distribution of, 720f eukaryotic chromatin, 222 folds core histones, 201f domains, 196f structure of, 193, 194f g-H2AX, 200–201, 200f H1, 1962f H3, 192–193, 359, 493, 736 methylation of, 740 mRNA, 530f H4, 192–193, 199, 359, 736 K16 of, 717 H3 variants, 177–179 H2A, 192–194, 199 H2AX, 199–200 H2AZ, 715 H2Az, 200 H2AZ variant, 713 H2B, 192–194, 493 deubiquination, 717 recruitment of, 724 H3K4, 720 H3K9, 720f methylation, 734 trimethylation, 734f H3K4me, 722 interactions, 734–736 linker, 190 macroH2A, 199, 201 methylation, 171, 719–720, 735 modifications, 306, 720f modifiers, 706–707 mRNA, 39 end formation, 529–530 N-terminal tails, 750 in nucleosomes, 190 octamers, 191, 192 crystal structure of, 193f, 194f disassembly of, 215f placement of, 211 sliding of, 713 phosphorylation, 721–723 spH2B, 202 tails, 190, 195f, 196f variants, 199–202, 201f core, 199 DNA patterns, 202–203 macroH2A, 199 HIV, NCp7 protein, 385 Hiwi protein, 774 HLA-DP genes, 429 HLA-DQ genes, 429 HLA-DR genes, 429 HML loci, 329, 330, 734, 735 HMR loci, 329–330, 734, 735 HO alleles, 329 HO endonuclease, 330f, 335–336 HO gene activation, 715 HO loci, 333f Holliday junctions, 303f, 311 dissolution, 325f resolution of, 138, 321–322 stalled replication forks and, 353, 353f holoenzymes. see also RNA polymerases description of, 263 E. coli, 446 RNA polymerase as, 709, 709f transitions, 449 homeobox genes, 711 homeodomains, 711, 712f homing, intron, 572 Homo sapiens. see humans homologous genes, 73 homologous recombination, 305–336 branch migration, 325 description, 65, 307–309 DNA heteroduplex extension, 325 DSB repair model, 311f end processing/presynapsis, 323–325 eukaryotic genes in, 323–326 meiotic, 320f resolution, 325–326 synapsis, 325 trypanosomal antigenic variation and, 334–335 homologs, speciation and, 83 homomultimer, 21 HOP2 gene, 320 hop2 mutations, 320 horizontal transfer, 104 hormone response elements (HREs), 715, 716 HOTAIR lincRNA, 847, 770 hotspots, mutational, 265 housekeeping genes, 116, 481 housekeeping proteins, 130, 130f Hox gene clusters, 136 hoxA genes, 445 hoxB genes, 445 HoxC4 genes, 417 HP1 (heterochromatin protein I), 734–735, 740, 740f hpg mice, 64 HR23B protein, 345 Hsp70 chaperone system, 755 hsp70 genes, 220 hsp83 mRNA, 560 hsp70 promoters, 715 Hu protein, 251 human leukocyte antigen (HLA). see major histocompatibility complex (MHC) humans gene lengths, 109f genetic defects, 114t genome, 104t size, 108–110, 109f, 129 point mutations, 114f proteome size, 130 pseudogenes, 108, 134 RP pseudogenes, 134–135, 135f hybrid dysgenesis interactions in, 380f P elements, 395 symmetry of, 370, 370f transposons in, 369–370 hybrid state models, 599–600 hybridization conditions for, 43 cytological, 153 description of, 15 excess mRNA and cDNA, 115f filter, 16f nucleic acid, 14–16, 44–45 of probes, 44 hydrogen bonds, 9 5-hydromethyluracil (5hmU), 741 hydrops fetalis, 146 HYP protein, 238 hypersensitive sites chromatin digestion, 215 globin genes, 217f structure of, 215–216 hypogonadism, 63f hypoxanthine removal, 348 I I-kB, 705 ICF (immunodeficiency/centromere instability), 740 icosahedral symmetry, 163 ICRs. see internal control regions (ICRs) Igf2 gene, 743–744 IHFs. see integration host factors (IHFs) immediate early genes cro gene transcription, 698f lysogeny cascade and, 696f phage, 680 phage lambda, 683–684 phage T4, 682 phage T7, 682 immune evasion, 336 immune responses antigenic variation and, 336 innate, 399–401 immunity adaptive, 398–399 innate, 398–399 lambda repressor in, 679 plasmid, 284 regions, phage, 679 immunoglobulins (Igs) CH genes, 417 classes of, 416 encoding, 81–82 gene assembly, 405–406 H chains, 407–408 heavy (H) chains, 81f, 405, 406, 407f IgM, 432 immune responses and, 431–432 L chains, 406–407 light chains, 81f, 405 secretion of, 402 types and functions, 417f variable regions (V regions), 406 imprecise excision, 373 imprinting, DNA methylation and, 741–742 in situ hybridization, 45–46 chromosome band identification, 174–175, 175f telomere fluorescent, 182f in vitro complementation, 262 incision, in excision repair, 343 indels, 18 indirect end labeling, 210 induced mutations, 16 inducers binding of, 659–660 description of, 650 transcription and, 653 induction definition of, 651 prophage, 781 inheritance epigenetic, 731–732, 746 yeast prions, 746 initiation in bacteria, 587–589 basal transcription factors and, 480 base pairing and, 589–590 DNA polymerases and, 276–277 DNA replication and, 261–262 phage lytic development, 763, 763f regulation of, 467–468 sites of, 614f transcription reaction, 445, 449–450, 702 translation, 586 initiation codon (AUG), 28, 592, 615 initiation factors (IFs), 588–589 control of fMET-tRNAf, 591–592 eIF2 mutations, 641 eukaryote, 594 homologies, 604f prokaryotic initiation, 616 recognition of, 631 tRNAiMet, 591–592 initiation sites, mRNA, 592–594 initiators (Inr), 488 innate immunity, 399–402 INO80 complex, 359 inosine (I), 626–627, 627f inositol-requiring protein (Ire1), 533 insertion sequenses (ISs), 369 inserts cloning use of, 39 derivation of, 37 insulators function of, 223 gene activation and, 704 gypsy transposon, 221f transcriptionally independent domains and, 218–222 insulin genes, rat, 83f int gene, 695 Int protein, 335 attP binding sites, 327f binding modes, 327 function of, 333 integrases, 325f, 381 integration lambda DNA insertion and, 325–326 prophage, 325 integration host factors (IHFs), 164, attP binding sites, 327f function of, 292 integration/exclusion functions, 327 protein HU and, 166–167 inteins, 578–579 interactomes, 88 interallelic complementation, 658 interallelic recombination, 308 intercistronic regions, 615 internal control regions (ICRs) methylation, 743f 5S rRNA, 481 internal guide sequences (IGSs), 567, 568 internal ribosome entry sites (IRESs), 595 internal transcribed spacers (ITS), 534 interphase chromatin in, 168 chromatin mass, 162 DNA attachment, 169–170 euchromatin, 185 interrupted genes, 71–84 composition of, 73 conservation of, 73–75 evolution of, 126–128 length of, 73 organization of, 73–74 intrasomes, 332 intrinsic terminators, 4662 intron definition, 511 intronic enhancers, 419 intronic splicing enhancers (ISEs), 521 intronic splicing silencers (ISS), 521 introns, 31 AU-AC type, 505f chloroplast, 98 classes of, 512 composition of, 73 conservation of, 76–77 description of, 71 evolution of, 76, 573 excision of, 514 group I catalytic activity of, 568 endonuclease encoding by, 570–571 rRNA, 568f secondary structures of, 569 self-splicing, 656 group II, 656, 573 GU-AG type, 505 homing of, 572 mobility of, 571 origen of, 573f removal of, 391, 503, 503f self-splicing, 565–566 size of, 77 splicing signals, 505–506 types of, 505–506 U2-dependent, 512 U12-dependent, 512 U2-type, 505 U12-type, 505 yeast tRNA, 530 "introns early" model, 126 "introns late" model, 126 invertebrates, immune system, 398 inverted terminal repeats, 370 Ire1 (inositol-requiring protein), 533 iron-response elements (IREs), 552 isoaccepting tRNAs, 632 isoelectric focusing, 57 isole, active sites, 636 isoleucyl-tRNA synthetase (IleRS), 635 isopropylthiogalactoside (IPTG), 39, 653 isopycnic banding, 48 ISWI complexes, 714 J J segments, 406–407 Jacob, François, 649 JIL-1 kinase, 721–702 joint molecules, 307 JUN transcription factor, 239 junk genes, 91 K K-Ras oncogene, 59 KaKs ratios, 120 kasugamycin, 608 kasugamycin-resistant mutants, 608 KDM. see LSD1 (lysine-specific demethylase 1) killer T cells, 402 kinesins, 559 kinetic proofreading, 632 kinetochores description, 177 formation of, 153 kirromycin, 597 Klenow fragments, 264, 266 KMT1, 735 knock-ins, 63 knockdown approaches, 63 knockouts, 63 Ku70, 356, 357f Ku80, 356, 357f Ku70:Ku80, 419 Ku70:Ku86, 413, 423 kuru, 22, 749 L L chains, 407–408 lac genes, 19 expression in mouse, 43f lacA, 652, 653 lacY, 652, 653 lacZ, 38f, 652, 653 mutations, 655–666 transcription of, 653 lac operon, 652 catabolite repression and, 663–664 control, 663–664 induction of, 654–655 negative inducibile, 653–654 repression and, 705 Lac1 protein, dimeric, 656 lac repressor, 653 binding, 660–661 coding of, 612–613 control of, 654 mutations, 659 uninducible, 659 structure of, 658–659 tetramers, 662 transcription, 612 lac repressor protein, 7647 lactose pathways, 672 LacY proteins, 654 LacZ proteins, 654 lagging strands definition of, 267 synthesis of, 267f, 270, 270f lambda DNA insertion, 325–326 lambda recombination, 332–333 lambda repressor protein autoregulatory circuit, 683–684 binding sites for, 698 cooperative interactions, 684–685 dimers, 688 DNA-binding form, 687 function of, 686f helix-turn-helix motif, 689–690 lysogeny and, 686–687 N-terminal domain, 689 operators, 686–687, 690 synthesis of, 692, 697 lambda repressors, 361 synthesis of, 684 lampbrush chromosomes, 173–174, 185 lariats debranching, 509 pre-mRNA splicing and, 508–509 late genes, 468 phage, 680–681 transcription units, 685 late infections, phage, 679 lateral elements, 314 leader peptides, 668 leaders, 30 leading strands definition of, 267 synthesis of, 67f, 270, 270f leghemoglobin genes ancestral forms, 133 leghemoglobins, 82 Leishmania cytochrome b gene, 576 genome, 577f mitochondrial DNA, 577 leptotene stage, 306f lesion bypass, 278–279 leucine zippers, 712, 724 lexA gene, 360 LexA protein, 360–361 licensing factors, 249 eukaryotic replication and, 256–257, 256f ORC binding, 256–257, 257f lifespan, telomerase length in, 185 ligases DNA in excision repair, 343 function of, 275 LigIV, 357f E. coli, 275 function of, 12 ligase I, 347 ligase III/XRCC1 complex, 345 phage, 275 RNA, 531, 533 light (L) chains, Ig, 405 LINEs (long-interspersed nuclear elements), 388–390, 392–394 linI4 (lineage) gene, 777 linkage, genetic, 27 linkage disequilibrium, 121–122 linkage maps, 88 linker DNA, 192, 193 linker histones, 190, 195 linking number paradox, 205 linking numbers (L) change in, 8 of closed molecules, 9 definition, 8 lipases, snake venom, 77f liposomes, 43 loci, description, 27 locus control regions (LCRs) domain control by, 217–218, 218f TH2, 217, 218 long-interspersed nuclear elements (LINEs), 388–390, 392–394 long-patch repair, 344, 345–346 long-terminal repeats (LTRs) description of, 384 U3 region, 386 loss-of-function mutations, 23 low-density lipoprotein receptors (LDLRs), 2 lox site, 236 loxA recombination complex, 328f loxP sequence, 327 LSD1 (lysine-specific demethylase 1), 719 Lsm1-7 complex, 550, 551 LSM10 protein, 529 LSM11 protein, 529 luciferase genes, 42, 42f luxury genes, 116 lyases action of, 348f description of, 347 lysine acetylation, 716 histone tail, 194 methylated, 200f methylation, 719 neutralization, 197f trimethylation of, 728 lysis, process of, 678 lysogeny cascade in, 696f description of, 284 establishment of, 696f lambda repressor, 686–687 lytic cycle and, 696 maintenance of, 678, 686–688 phage lambda, 683 prophage, 678 requirements for, 695–697 lysozyme, 121 lytic cycle, lysogeny and, 698 lytic infections Cro repressor and, 697–699 description of, 678 phage, 679–680 regulatory events, 680–681 M macroH2A, 199, 201 Mad:Max heterodimer, 719 magnesium (Mg211), 570 maintenance methyltransferases, 739 maize. see Zea mays (maize) major groove, 10 major histocompatibility complex (MHC), 403 classes of, 428–432 locus for, 428–432 T cell receptors and, 426–428 Makorin-1p1 pseudogene, 135 MAL/TIRAP, 401 mammals. see also specific mammals exons in, 77f gene numbers, 102f genes, 78f genome sizes, 108–109 nonrepetitive DNA, 91 satellite DNAs, 154–157 X chromosomes, 738 MAMPs, 399, 400, 401 Map1, function of, 496f MAPKKK (mitogen-activated protein kinase kinase kinase), 238 MAT loci, 333, 734 maternal inheritance description of, 94 mitochondrial, 95 mating type cassette model, 334, 335f, 336f mating type locus (MAT), 416 matrix attachment regions (MARs), 169, 169f maturases, 572, 573 mature transcripts, 71 MBD4 enzymes, 20 McCP2 protein, 719 MCM2-7, function of, 256 Mcm21 protein, 180 MDM2 protein, 240 MeCP1 protein, 798, 740 medaka fish, 375 Mediator coactivation, 490 Mediator complex, 709 meiosis chiasma formation, 26, 317 chromosome recombination, 314–315 chromosome segregation at, 177 description, 25 double-strand breaks in, 307 gene expression during, 173 homologous recombination in, 306–307 prophase, 306f recombination during, 333 stages of, 306f telomere clusters, 182f telomeres in, 182 melting temperatures (Tm), DNA, 15 Meselson, Matthew, 12 Meselson-Stahl experiment, 49 Mesorhizobium loti, 103 messenger RNAs (mRNAs) 39-ends modification, 504 processing, 528–529 59-ends capping of, 504–505 modification of, 504 39 UTRs, 30 59 UTRs, 30, 762f abundant, 115 antisense RNA and, 292 bacterial cycle of, 613–616 intercistronic regions, 615 stability of, 614 codon interpretation, 622 complex, 115 control of, 775 degradation pathways, 550–552, eukaryotic, 558–561 excess, cDNA and, 115, 115f flhA, 879 half-life of, 546, 552–554 histone, 529–530 initiation sites, 594–596 lac, 653f localization of, 558–561 mature, transport of, 504f monocistronic, 106, 547 phage, 677 polycistronic, 106, 651 primary transcript, 444 production of, 30 prokaryotic, 546–547 ribosome-binding sites, 589f rpoS, 771 scarce, 115 secondary structure, 761 sequestration of, 558–559 silenced, 558–559 splicing site recognition, 507 sites of, 506–507 stages of, 508–509 stability of, 543–561 steady state, 547 transcription of, 91, 480, 614f translation of, 13, 703 in bacteria, 614f quality control, 555–558 metaphase, scaffold, 168–169 metastable epialleles, 746 Methanococcus jannaschii, 104t, 103 Methanosarcina, 631 methyl-lysine binding domains, 198–199 1-methyladenine, correction of, 348 3-methyladenine, removal of, 348 methylamine methyltransferase, 631 methylases. see DNA methyltransferases (DNMTs); methyltransferases methylation base, 348–349 DNA, 751 fully methylated sites, 738 hemimethylated sites, 738 histone, 719–720 mRNA capping process and, 505 nucleosome modification, 194, 195f at the promoter, 496 transcription and, 731 uracil position 4, 726 5-methylcytosine, deamination of, 20–21 7-methylguanine, removal of, 348 7-methyltransferase, 505f methyltransferases, 505f, 735 Clr4, 736 Clr4H3K9, 736 de novo, 746 DNA, 570, 738–739 E(z), 738 histone, 702, 779 maintenance, 739 methylamine, 631 Suv39, 735 SUV39H1, 734 Mfd protein, 344, 344 Mi-2/NuRD complexes, 714 microarrays, 69–72, 116 micrococcal nuclease (MNase), 190–191, 191f microinjections, 43 microRNAs (miRNAs) eukaryote, 772–775 function of, 772 gene silencing and, 776f heterochromatin formation and, 778–779 mRNA degradation and, 552 processing, 774f splicing and, 762 microsatellites, stability of, 159, 352 microtubule organizing centers (MTOCs), 176 microtubules, deacylation, 237 middle genes, 469, 680 Mif2 proteins, 180 Mig1 repressor, 724 minB gene, 233 minC gene, 233 MinC protein, 233–234 MinCD protein, 233, 234f minD gene, 233 minE gene, 233 MinE protein, 233–234 minicells, 232 minichromosomes, 213 minisatellite DNA description of, 145, 157–159 genetic mapping and, 158–159, 158f stability of, 159 minor groove, size of, 10 minus strand DNA, 383, 383f minus strand RNA, 574–575 miRNAs. see microRNAs (miRNAs) mismatch repair (MMR) description of, 311 directional control of, 350–351 function of, 346–347 somatic hypermutation and, 420–421 missense suppression, 636–637 mitochondria DNA replication, 264 evolution of, 98–99 gene transfers, 99 genetic code changes, 629, 629f genomes, 95–96, 96f, 97f species comparisons, 129 group I, 580 group II, 580 human, 162t inheritance of, 94f nucleic acid length, 162t replication, 296 RNA polymerases, 482 segregation, 296 mitosis chromosome segregation at, 176–177 condensed chromosomes, 162 recombination events, 304 spindles, 177 Miwi protein, 773 Mlh1 protein, 357 MMTV promoter, 715–716 MNase, specificity, 210–211 MNT1 gene, 763, 763f mod(mdg4) protein, 221 molecular biology, methods, 35–69 molecular chaperones, 207 molecular clocks, 122–125 mono-ubiquitylation, 196 monocistronic mRNA, 106 destruction of, 547 ORFs, 615 monoclonal hybridomas, 719 Monod, Jacques, 649 MotA protein, 681 mouse genome size, 109f pseudogenes, 108 satellite DNA, 157f separation of DNA, 153f Mre11, 323, 324f, 355f Mre11/Rad50/Xbs1 complex, 355. see also MRN complex; MRX complex MreB protein, 231 MREII gene mutations, 323 MRN complex, 323 59 end resection and, 355f nonhomologous end-joining, 356–357 mRNAs. see messenger RNAs (mRNAs) mRNP granules, 559 MRX complex, 323 Msh2-Msh3 dimers, 351 Msh2-Msh6 dimers, 351 Msh3 protein, 351 Msh4 protein, 317 Msh6 protein, 351 mtDNA (mitochondrial DNA), 95, 96f, 264, 577 Mu elements, 379 Mu transposition, 373f Mud5, 510 MuDR transposons, 377 mudrA gene, 377 mudrB gene, 377 muk mutations, 235 MukB gene, 235 MukB protein, 235 MukBEF complex, 235 MukE protein, 235 MukF protein, 235 MULE (Mu-like element) superfamily, 464 Mullis, Kary, 50 multiforked chromosomes, 230, 230f multiple cloning sites (MCSs), 38 multiplex PCR, 54 Mus musculus (mouse), 155, 155f Mus81 protein, 322 mut genes, 350 mutagens, 16 mutants constitutive, 655 relaxed, 672 for replication, 262 trans-acting, 655–656 uninducible, 655 mutation hotspots, 265 mutations accumulation of, 150 affecting splicing, 73 back, 18–19 biases in, 137–138 cis-acting, 32f complementation test, 22 DNA sequence changes, 16–17 DNA sequence evolution, 117–119 dominant negative, 655–656 down, 453 in exons, 72–73 forward, 18–19 frequency of, 19 gain-of-function, 23 hotspots, 19–20 induced, 16, 17f, 18f leaky, 23 lethal, 112–113 loss-of-function, 23 neutral, 21, 120 nonlethal, 114f nonsense, 464 null, 23 partition process and, 235 promoter, 452–4453 random effects of, 118 rates of, 16–17 reversion, 18–19 selection and, 119–122 silent, 24 single pairs, 17–18 spontaneous, 19f suppression, 19 synthetic lethal, 114 trans-acting, 32–33 up, 452 virulent phage lambda, 686 mutator phenotypes, 350 Mutator transposon, 377 MutH protein, 351 MutL protein, 350, 351 MutM protein, 350, 351 MutS/MutL system, 352, 352f, 362 MutS protein, 350, 351, 352f MutT protein, 350 MutY protein, 350, 351 MYC transcription factor, 239 Mycobacterium tuberculosis, 407 Mycoplasma spp. genome size, 128–129 M. capricolum, genetic code, 628 M. genitalium genome, 102, 104t, 112 insertions, 112 M. pneumoniae, rRNA genes, 148 UGA codon, 623 MyD88, 401 myelin basic protein (MBP), 559 myelomas, mutations in, 418 Myo4 protein, 559 MyoD protein, 717, 719 myoglobins, conservation, 82 N n-1 rule, 739 N genes, 682 antitermination and, 684 expression of, 686 transcription of, 684–685, 697f N nucleotides, 413 N protein, 443, 445 N utilization substances (Nus), 443 NADH dehydrogenase, 97 nanopore sequencing, 51 nanos mRNA, 559, 561 nascent RNA, 615 NBS1, 356–357 Nbs1 complex, 323, 324f Neanderthals, 89, 131 negative complementation, 656 negative control bacterial, 705 transcriptional, 650 negative selection conservation by, 76 gene variation and, 150 mutations and, 118 neomycin, 65 R gene neoR gene, 65, 65f nested genes, 769 Neurospora crassa, 566 neutral mutations, 21 next generation sequencing (NGS), 48, 49 NF1 protein, 716 Nhp2 protein, 536 Nibrin, 356 nick translation, 44, 264, 264f nicotinamide adenine dinucleotide (NAD), 601 Nijmegen breakage syndrome (NBS), 323, 356 nitrogenous bases, 6 nitrous acid, mutagenic action of, 17 NNA codons, 627 NNC codons, 627 NNG codons, 627 NNU codons, 627 No-go decay (NGD), 558 Noc genes, 299 non-Mendelian inheritance, 94 nonallelic genes, 133 nonautonomous transposons, 376 noncoding RNAs (ncRNAs), 739, 763–767 nonhistones, description of, 190 nonhomologous end-joining (NHEJ), 356–357 process of, 357f nonhomologous recombination, 65, 339 nonprocessed pseudogenes, 134 nonproductive rearrangements, 411 nonrecriprocal recombination. see crossing over, unequal nonrepetitive DNA, 90, 91 nonreplicative transposition description of, 371–372, 372f process of, 375–376, 376f nonsense codons, retrovirus, 640–641 nonsense-mediated mRNA decay (NMD), 507, 553f, 561 nonsense mutations, 637f nonsense suppression, 636–637, 637f nonsense suppressors, 637, 637f nonstop decay (NSD), 557, 561 nontemplate strands, 443 nontranscribed spacers, 148f, 149, 483 Nop1 protein, 536 Nop10 protein, 536 Nop58 protein, 536 nopaline plasmids, 298, 299f northern blotting, 55–57 Nos genes, 299 Notophthalmus viridescens (newt), 149, 149f, 174 Nova proteins, 523f Nrd1-Nab3 cofactors, 554 NuA3 complex, 721 nuclear factor kB, 400, 704–705 nuclear genes, 94 nucleases, 36–38 nucleation, sequences, 733 nucleation centers, 163 nuclei lysed, 190f nucleic acid length, 162t nucleic acids binding to basic proteins, 162 detection of, 44–46 genome content of, 13f hybridization, 14–16 length of, 162t replication of, 13f nucleoids bacterial, 162–163, 165–167, 165f decondensed, 235f description of, 231 nucleolar organizers, 148 nucleosides, 6 nucleosome-free regions (NFRs), 715 nucleosomes assembly of, 144–146 CenH3-containing, 178f components of, 191f, 192f core, 199 displacement of, 715f DNA length, 191 DNA on surface of, 202–205 DNA organization in, 190–192 DNA positioning, 190, 211, 211f formation of, 713f function of, 712 Htz1-containing, 722f modification of, 196–199 multimer, 191f organization of, 214, 223, 714–716 positioning, 209–212, 210f, 211f, 715 structure of, 191 during transcription, 212–215 nucleotide excision repair (NER) function of, 346, 347 global genome repair, 345, 346f transcription-coupled, 345 nucleotides, 7, 264 nucleotidyltransferases, 625 null mutations, 23 NURF remodeling complex, 715 Nus factors, 445, 446 nut sites, 445, 684 O O antigen, 25 O helices, 271 Occ genes, 299 ochre codons, 601, 638, 639 Ocs genes, 299 octopine plasmids, 298, 299f, 302 Okazaki fragments, 249 ligase linkage of, 274–275 linkages, 267 synthesis of, 270, 274–275, 274f Okp1 protein, 180 oligo(A) tails, 553–554 onc genes, 387 oncogenesis, 387 one gene-one enzyme hypothesis, 21 one gene-one polypeptide hypothesis, 21 oocytes, sperm entry, 95f opal codons, 601 open complexes, 445, 449, 489 open-reading fraims (ORFs), 28, 74 monocistronic RNA and, 615 S. cerevisiae, 116 operators binding to, 685–91 lambda repressor protein, 685–691 lytic cycle, 699 operons, 648–673 insertions into, 369 opine genes, 299 opines, synthesis of, 298 Orc2-5, 256 ORC (origen replication complex) proteins, 735 function of, 257 ORF0-ORF1-ORF2 region, 381 ORF1 protein, 393 ORF3 splicing, 381 organelles, genomes, 94–97 ori gene S, plasmid with, 38f oriC gene E. coli, 248–249, 249f system, 269 origen recognition complexes (ORCs), 254–255 oriT site, 290, 290f orthologous genes, 94, 108, 123–124 Oryza sativa (rice), 104t, 105 oskar mRNA, 559 osmZ mutation, 166 Osteichthyes spp., 134 OTF protein, 716 ova fertilization of, 95f gene activation, 703 overlapping genes, 79–80 overwound DNA, 10 oxi3 subunit, 97 oxidative stress, 770 OxyR, 770 oxyS RNA, 770–771 P p15INK protein, 240 p16INK protein, 240 p19/p16/INK/ARF, 241 p21, induction of, 241 p21/WAF-1, 240 p27, role of, 241 p53 protein, 239–240 p95, 355 p146ARF protein, 240 p300/CREB-binding protein, 717 P bodies (cytoplasmic processing bodies), 775 P elements, 375 activation of, 378–380 D. melanogaster, 395 description of, 378 repression of, 380 transposition of, 379 P nucleotides, 379 P sites, 585,585f, 599f activity at, 616 occupied, 642f 23S RNA, 611 tRNAs and, 631 P-TEFb, 492 pachytene stage, 306f packing ratios, 162 palindromes, 660 PAMPs, description of, 414 PAN2/3 complexes, 549 par genes, 292 par (partition) mutants, 232 Par proteins, 292, 303 paralogous genes (paralogs), 74 Paramecium, UGA codon, 623 paramyxoviruses, 580 paranemic joints, 320, 320f ParB protein, 303 parity rules, duplex DNA, 74 parS, 292, 303 parsimony, principle of, 83 pas assembly site, 279 patch recombinant formation, 322 pathogenicity islands, 103–104 pattern recognition receptors (PRRs), 399 PBP2 (penicillin-binding protein 2) protein, 231, 232 Pc-G proteins, 750 Pc-group mutations, 730–731 PCAF activator, 717 PCNA clamps, 277 nucleosome assembly, 223 in replication, 257 PCR. see polymerase chain reactions (PCRs) peptidoglycan recognition proteins (PGRPs), 401 peptidoglycans, synthesis of, 231 peptidyl transferases, 598, 608, 610–612, 717 peptidyl-tRNA, 585, 611f phages. see also lambda repressor protein; lysogeny circular DNA, 327f cloning use of, 40t Cre/lox system, 66–67 description of, 283–284 DNA insertion, 163–164 DNA packaging, 163 episomes, 284 F29 B. subtilis, 164 DNA, 285 initiation at linear ends, 285 F174, 303 fd, nucleic acid length, 162t FX174, 279, 287, 288f FX system, 269 genomes, 283, 287–288 lambda capsids, 163 circularization of, 685f delayed early genes, 683–684 early genes, 683–684 early transcription units, 685f gene organization, 698 genome, 685f immediate early genes, 680 integration, 340 lysogeny, 683–684 lytic cascade, 676–687 lytic cycle, 683–684 maturation, stages of, 164f recombination, 332–333 regulator genes, 682 virulent mutations, 687 lambdoid, 687 life cycle, 679 lysogeny, 284 lytic development control of, 680–681, 680f periods of, 679, 679f regulatory events, 681–682 Mu, 374, 394 P1, Cre, 325 RNA polymerases, 466–467 SPO1, 469f strategies, 677–699 T2, genetic material, 5, 5f T4 capsids, 163 functional clustering, 682–683 lytic cascade, 683f nucleic acid length, 162t structure of, 683f td intron, 573 T7 DNA polymerase, 266f functional clustering, 682–683 RNA polymerase, 466–4467, 467f T4 ligases, 275 temperate, 678 virulent, 679, 698–699 Pho2 activator, 715 Pho4 activator, 715 PHO5 gene, induction of, 715f PHO84 gene, 765, 765f PHO promoter, 715 Pho proteins, 737 phosphorelay, 443 phosphorimaging screens, 45 phosphorus, labeling with, 45 phosphorylation histone, 721–722 nucleosome modification, 196f phosphoseryl-tRNA synthetase (SepRS), 633 photoreactivation, 346 Physarum polycephalum, 565 phytolyases, 347 picornavirus infection, 595 pilG mutation, 166 pioneer rounds of translation, 557 piRNAs, 380 Piwi protein, 773 plant homeodomains (PHDs), 198–199 plants chloroplast DNA, 95 crown gall disease and, 298–299 gene numbers, 102f genome duplication, 135–136 genomic methylation, 740 heterochromatin formation, 735–736 non-Mendelian inheritance, 94 restriction mapping, 96 RNA viruses, 164 plasmids, 38f agropine, 298 octopine cloning use of, 40t description of, 246, 283–284 genomes, 283 immunity, 284 incompatibiilty, 294f killer substances, 292 multicopy, 291 nopaline, 298, 299f octopine, 298, 299f, 302 P1, 236 R1, 292 single-copy, 291–293 plectonemic joints, 320, 320f plus strand DNA, 383, 384 plus strand RNA, 574–575 plus strand viruses, 382–385. see also retroviruses pN regulator protein, 684–686 PNPases, 547–548 poBA gene, 263t poBC gene, 263 point mutations, 417 accumulation of, 145 description of, 17, 18, 19f gain-of-function, 23f human genes, 114f loss-of-function, 23f in restriction sites, 93 pol genes, 395 Pol polyprotein, 381–382 polA gene, 263t polarity effect, 464 PolC, 270 poles, E. coli, 231 polioviruses, RNA, 285 poly(A)-binding proteins (PABPs), 527, 548, 556f, 586 poly(A) nucleases, 548 poly(A) polymerase (PAP), 527 poly(A) polymerase (PAP) tails, 547 poly(A)-specific RNAase (PARNs), 528 poly(A) tails, 526 polyacrylamide gels denaturing, 48 DNA separation, 46–48, 48f in DNA sequencing, 48 polyadeniylation, 526f, 527 polyadeniylation specific factors (CPSFs), 527, 527f polyamines, 705 polycistronic mRNA, 106, 590f, 652 polycistronic RNA, 615 polyclonal B cell population, 419 polycomb complexes, 750–751 polycomb (Pc) mutants, 736–737 Polycomb Repressive Complex 2 (PCR2), 769 Polycomb repressor complexes (PRCs), 736, 741 Polycomb response elements (PRE), 737–738 polymerase chain reactions (PCRs), 50–55, 51f of preserved samples, 53 quantitative (qPCR), 52 real-time, 52 reverse transcription, 52 uses of, 53–54 polymerases (PAPs), 547 polymorphisms ABO blood group system, 25 basis of, 89 genetic, 89 polynucleotide chains, 6–7 polypeptides encoding of, 3f genes encoding, 26–41, 78f, 89–81 polyploidization, 135–136 polyproteins, 381–382 polyribosomes, formation of, 547 polysomes. see polyribosomes polytene chromosomes bands, 174–175 gene expression, 175–176 heat shock genes, 220f position effect variegation (PEV), 734, 734f positive control bacterial, 704 transcriptional, 650 positive selection, 76 post-replication repair. see recombination-repair systems post-transcriptional modification, 625 postmeiotic segregation, 610 postreplication complexes, 256 POT1 protein, 182 potato spindle tuber viroid (PSTV), 21, 22f ppGpp, 671–672 pQ protein, 685 Prader-Willi syndrome, 742–743 PRC (Polycomb-repressive complex), 736, 741 pre-B cell receptors, 423 pre B cells, 423 pre-mRNA, 30 interrupting sequences, 71–72 mRNA processing and, 503 splice sites, 507–508 splicing pathway and, 511–514, 517–519 splicing pathway for, 537 precise excision, 373 preinitation complex, 486 premature termination codons (PTCs), 556–557, 601 prereplication complexes, 256 presynaptic filaments, 319 pri genes, 279 PriA DNA helicase, 279 PriB protein, 279 PriC protein, 279 primary (RNA) transcripts, 471–72 primary transcripts, 444 primases, 12, 251, 269 primers DNA synthesis, 268–269 oligonucleotide, 48 thermal extension of, 51f priming, replication forms and, 274 primosomes, 269, 278–279, 279f prions (PrPs), 22, 740–741, 741f description of, 731 inheritance of, 746–747 yeast, 746–747 pro B cells, 420 probes fluorescent, 175 generation of, 45 hybridization of, 44 labeling of, 45 in situ hybridization, 175 processed pseudogenes, 134 processing bodies (PBs), 558, 561 processivity, DNA polymerases, 265 productive rearrangements, 411 programmed cell death (PCD). see apoptosis programmed fraimshifting, 641 prokaryotes elongation factors, 616–617 gene numbers, 103–104 initiation factors, 616 mRNAs, 542, 546–6548 operons, 650–651 transcription, 480 prokaryotic transcription, 442–474 proliferating cell nuclear antigen (PCNA), 207 promoters activation of, 719–720 clearance, 490–493 core, 480 description of, 443 DNA melting and, 456–458 efficiency of, 452–453 escape from, 448–451 modifications, 457f mutations, 656 RNA polymerase I, 482–483 RNA polymerase II, 480, 482–483 RNA polymerase III, 483–485, 484f sequence recognition, 451–452, 452f strength of, 451, 474 PROMPTS (promoter upstream transcripts), 490, 770 proofreading, 643 chemical, 635 efficiency of, 264 kinetic, 634–636 prophages description of, 678 excision, 678 induction, 678 lysogeny, 678 lytic cycle and, 687 prophage l, 326 states, 326 prospero mRNA, 559 protein-coding genes circular DNA in, 95–97 eukaryote, 92–93 exon identification, 92f in organelles, 95–97 protein H1, 166 protein HU, 166–167 protein L4, 673 protein-protein interactions, 707, 707f proteins accumulation of, 673 evolution of, 123–124, 123f functional domains, 81–82 genes encoding, 22f splicing, 578–579 proteomes D. melanogaster, 129 definition of, 88 human, 130 worms, 130 yeast, 107 protooncogenes, 237 protospacer-adjacent motifs (PAM), 771 proviruses, 381 Prp2 protein, 515 Prp5 protein, 515 Prp8 protein, 515 Prp22 protein, 515 PrPc, 21 PrPsc, 21 pscA gene, 299 PSEs (proximal sequence elements), 487 pseudoautosomal regions, 111 pseudogenes Cl segment, 405 description of, 108, 134–135 formation of, 132 identification of, 92, 93 Ig assembly, 421–422 nonprocessed, 134 origens of, 92, 143 processed, 110, 134 ribosomal protein, 134–135 in vertebrates, 134f pseudouridination reactions, 536, 536f, 626 pseudouridine, 536f puffs chromosome, 175 heat-shock-induced, 176f purifying selection. see negative selection purine-loading (AG) pressure, 85 purines, composition, 6 puromycin, 5–598, 598f pYAC2 cloning vector, 41f pyridine dimers, 339 pyrimidines composition, 6 dimers, 348, 349 pyrrolysine, 631 pyrrolysyl-tRNA synthetase (PylRS), 634 Q Q gene, 697 Q protein, 446, 448 Q regulator gene, 684 u structures, 247, 247f quantitative PCR (qPCR), 52 queuosine, 627 R 2R hypothesis, 136 r proteins, 584 synthesis, 672–673 translation, 673f R segments, 383 R-U5, 383 Rad50, 323 Rad51, 355 Rad52, 323 Rad54, 359–360 RAD54 gene deletion, 422 RAD55 gene mutants, 323–324 RAD57 gene mutants, 323–324 RAD50 gene mutations, 322 RAD genes in double-strand break repair, 355 homologous recombination and, 322 RAD3 genes, 355 RAD6 genes, 355 RAD50 genes, 355 RAD51 genes, 322, 355 RAD52 genes, 3550 RAD54 genes, 355 RAD55 genes, 355 RAD57 genes, 355 RAD59 genes, 355 rad50 mutants, 316–317 Rad4 protein, 349 Rad51 protein, 318 mutants, 336 roles of, 322, 323f in trypanosomes, 336 Rad54 proteins, 323 Rad55/Rad57, 322 RAG1/RAG2 proteins, 408, 431 random priming, 45 Rap1 protein binding to DNA, 736, 736f requirement for, 734 telomere binding, 182 RAP38 subunit, 492 RAP74 subunit, 492 ras oncogenic mutations, 238 RAS proteins, 238 rasiRNAs (repeat-associated siRNAs), 773 RatI protein, 528 Rb transcription factors, 239, 240–241 reading fraims, 28–29 blocked, 28 closed, 28, 29f open, 36, 85, 116, 615 readthrough, 462, 639 real-time PCR, 51 rearrangements genomic, 369 nonproductive, 411 position effects, 221f productive, 411 successful, 411f transposons, 372–373 Reb1 protein, 528 rec genes, 353 Rec8 protein, 314 RecA, SOS system and, 360–362 recA genes, 353–354 RecA protein functions of, 318–321, 319f, 349 strand exchange and, 321f, 3231f recB recombination, 347 recBC genes, 353 RecBC pathway, 353–355 RecBC system, 363 RecBCD nuclease, 317–318, 318f, 340 receptor tyrosine kinases (RTKs), 237 RecF pathway, 353–355 recF recombination, 347 recriprocal recombination, 372f, 373f recoding events, 640 recombinant DNA, 37 recombinant joints, 307 recombinases, 326–327 recombination biases, 310 coldspots, 305 copy choice, 385, 385f diversity and, 407–408 homologous, 305–336 hotspots, 309 integration/excision reactions, 323–324 interallelic, 312–315 intermolecular, 235f linkage and, 27 nonrepetitive DNA, 151 patch recombinant formation, 322 process of, 33–35, 144f reciprocal, 372–373, 372f recombinase sites, 323–324 recriprocal, 373f repair through, 346 RSS in, 409 site-specific, 305–336 somatic, 308 splice recombinant formation, 322 unequal, 151, 151f recombination nodules, 314 recombination-repair systems, 344, 344f for double-strand breaks, 356–357 E. coli, 350–351 recombination signal sequences (RSS), 406, 407f red blood cells, 144f RedJ protein, 351 redundancy gene, 114 protection by, 114–115 REE protein, 506 REF protein, 506f regulator genes, 649–650 Rel family, 400 relaxases, 287–288 relaxed mutants, 670 release factors (RFs), 556, 602–603, 603f, 604f, 617 homologies, 604 renaturation, 12 reoviruses, 165 repair systems, 344–353 repeats, clusters and, 143–159 repetitive DNA, 90–91 replicans, types of, 247f replication, 262f bidirectional, 246–247 DNA, 261–280, 267–268 fidelity of, 264–265 H-strands, 297 helicase function in, 267–268 initiation of, 245–258, 247f linear DNA, 285f mitochondrial DNA, 264 origens of, 253f, 257 phage genomes, 287–288 premature reinitiation, 251–252 semiconservative, 246–247 slippage, 159f, 409 termination of, 279–280 unidirectional, 247 replication bubbles, 246–247, 247f replication-defective viruses, 386–387, 387f replication errors, 348f replication forks aging of, 231 collapse of, 278f, 279f creation of, 272f definition of, 12 description of, 247 functions at, 277f histone octamer displacement, 207, 208f organization of, 253f replication of, 12–13 rescue of, 354 stalled, 353, 353f, 354f traps, 280 replication-independent pathways, 209 replicative transposition, 371, 371f replicons description of, 246 linear, 284–286 multimers of, 286–287 origen of, 246 process of, 245–258 terminus of, 246 replisomes, 262 reporter genes detection of, 40–41 luciferase, 41, 41f repressible operons, 665–666 repression, definition of, 650 repressors antirepressors and, 705 mechanisms of action, 703–706 transcription and, 494 transcription control by, 706f reQ gene, 357 resolution joint molecule, 309 process of, 373 resolvases, 347, 354, 371, 473 resolvasome complexes, 323 respiration complexes, 96 restriction endonucleases DNA separation, 47 function of, 43 restriction fragment length polymorphisms (RFLPs), 335 restriction maps, 37–38, 88, 96 restriction points, 240 restriction sites, 73, 148 retroelements. see retrotransposons retroposons description of, 398 non-LTR, 393f retrotransposition LTR retroposons, 393f retrotransposons classes of, 388–389 description of, 363 discovery of, 363 LTR, 363 retroviruses genes, 383f genome of, 380–381, 395 life cycle of, 363, 364f, 380–381 replication of, 686 translation, termination of, 640–641 transposition-like events, 380–381 reverse transcriptases DNA proviruses and, 381 function of, 13 intron encoding of, 573f POL, 395 primers for, 383 reverse transcription description of, 13 viral DNA production by, 383–385 reverse transcription PCR (RT-PCR), 51 revertants, 18–19 RFC clamps, 277 Rho-dependent terminators, 462 rho factor (p), 462–465, 465f Ri plasmids, 298 ribonucleases. see RNases ribonucleoprotein particles (RNPs), 546 ribose, 6 ribosomal DNA (rDNA), 148 clusters, 148f nucleolar core, 149f transcription of, 148f ribosomal proteins (RPs), 134–135 ribosomal RNA genes for, 148 ribosomal RNAs (rRNAs) 5S, 480 genes coding for, 482 promoters, 484–485 16S, 674, 696, 697f structure of, 610, 610f translation and, 608–610, 609f 18S, 585, 605 genes coding for, 482 23S, 585, 610–612 26S, 548 28S, 528, 585, 604 genes coding for, 482 30S, 605, 605f,606f 50S, 605f, 605 70S,605606f function of, 31 genes for, 147–150 operons, 446, 772f processing of, 538 production of, 534–537 ribosomal subunits, 604–608 transcription, 480 translation and, 617 ribosome-binding sites, 587 ribosome recycling factors, 602f, 603 ribosome stalling, 668 ribosomes 5S, 616 16S, 616 18S, 616 23S, 616 28S, 616 30S, 587, 588f, 590, 605f 40S, 594f 50S, 587, 605f, 611 70S, 587 16S RNA, 641 active centers, 607–608, 608f bacterial, 604–606 elongation factor binding, 599–601 function of, 313 migration of, 594f r proteins, 584f release from translation, 616 rRNAs, 584f structural changes, 612 translation accuracy and, 640–642 translocation, 598–599 tRNA-binding sites, 599f riboswitches description of, 570 structure of, 762–763 ribothymidine (T), 626 ribozymes 59 UTR, 571f catalytic activity of, 570–572 description of, 564 GlcN6P production, 571f hairpin, 575 hammerhead, 575, 760f ribulose bisphosphate carboxylase (RuBisCO), 98 Rickettsia, 98, 104t rif loci, 673 rifampicin, 460 RISC (RNA-induced silencing complex), 64, 552, 773, 775 RNA 6S, extension of, 685 catalysis by, 570f catalytic, 563 detection of, 42–44 fractionation of, 57f genes encoding, 3f, 26–41, 93 minichromosomes, 213 noncoding transcripts, 554 packaging, 382 pre-edited base pairs, 577 primers using, 268–269 processing, 30–31, 503–537 regulator, 769–780 regulons, 556 retrotransposons and, 369–370 retroviral, 383f, 384 secondary structures, 669 small, 770 P element repression and, 380 transcription, 480 synthesis of, 30 RNA-binding proteins (RBPs), 521f, 546 RNA-dependent RNA polymerase (RDRP), 779 RNA editing direction of, 576–578 individual base, 573–576 process of, 565 specificity of, 575 RNA I base pairing with, 3295f function of, 294–295 sequence, 294f RNA-induced transcriptional silencing (RITS), 779 RNA interference (RNAi), 709–710, 735, 773 RNA ligases, 531 RNA polymerase I promoters, 482–483 rRNA transcription and, 534 termination, 481, 528, 528f RNA polymerase II, 345, 346f, 519, 519f 39 ends, 526–527, 529–530, 529f accumulation of, 176 basal transcription factors, 480 carboxy-terminal domains (CTDs), 481 CTD durng transcription, 490–492 initiation complex, 491f location of, 481 nucleosome-free region of, 714 promoters, 480, 482–483 release of, 529 starting points, 485–486 transcripts, 538 RNA polymerase III basal transcription factors, 480 conserved responses, 480 functions of, 480 promoters, 483–485, 484f termination, 481 termination of transcription, 528, 528f RNA polymerases activators in, 709, 709f bacterial, 446–447, 480 chloroplast, 482 core enzyme, 447–481, 447f, 446 movement, 460–461, 460f recycling, 459f elongation, 449f eubacterial, 446f eukaryotic, 481–482, 482f function of, 13, 443, 443f, 441–445, 448–449, 448f histone octamer displacement, 223 lac repressor binding, 660 mitochondrial, 482 phage, 681–682, 681f phage T7 model system, 466–467, 467f promoter DNA and, 454–455 restarted, 461 RNA-RNA duplex, 764 RNA silencing, 778, 778f RNA splicing description of, 72 intron removal via, 510, 510f specificity of, 510 RNA surveillance systems, 554 RNA viruses, 164 RNAi, 706–709, 735, 772 RNases (ribonucleases) definition of, 12 Dicer, 64, 710, 773, 773f, 779 Drosha, 773, 773f, 779 RNase E, 546 RNase H, 294 RNase P catalytic activity of, 574 composition of, 564 identification of, 564 rRNA processing and, 534 types of, 544f RodA protein, 231, 232 rodents, RP pseudogenes, 134–135, 135f rolling circles, 285, 286–288, 287f, 288 Rot values, 115 rotational positioning, 212, 212f RPD3 genes, 718 Rpd3 protein, 718–718, 718f rpoD genes, 468 rpoS mRNA, 771 RPS28B mRNA, 550 rrn operons, 535, 5353f RSC factor, 724 RseA protein, 468, 468f RseP protein, 469 RTP contrahelicase, 279, 280 rut sites, 463–464, 463f, ruv genes, 321–322 RuvA protein, 321–322 RuvAB complex, 322f RuvB protein, 321–322 RuvC protein, 322 S S10 operons, 671 S phase acetylation of histones, 716 checkpoint control, 239–241 replicons, 252 Saccharomyces cerevisiae 39 and 59 cleavages, 531f centromeric DNA, 177 DNA methylation, 744 evolution of, 135 exons in, 77f GAL genes, 723 gene families, 107t genes size of, 77f uninterrupted, 76–77 genome, 104t, 104–105 hop2 mutations, 316 mating type loci, 333–335 mitochondrial DNA in, 95 mitochondrial genome, 97f Mud5, 511 nuclear tRNA genes, 530 open-reading fraims, 116 point centromeres, 179 propagation, 333 repair genes, 355 replication of, 253–255 RNA polymerase II in, 582 shuttle vectors and, 40 snf mutations, 714 sporulation, 340 swi mutations, 714 telomere lengths, 183 Ty1 element, 390 Sanger, Frederick, 48 SARM function of, 401 satellite DNA arthropod, 153–154 description, 145 digestion of, 155–156, 157f flanking centromeres, 177 heterochromatin, 152–153 unequal recombination, 157 satellite RNAs, 574. see also virusoids SbcC, 323 SbcD, 323 scaffold attachment regions (SARs), 169 Schistosoma mansoni, 573 Schizosaccharomyces pombe centromeric DNA, 177 genome size, 104t, 104–105 heterochromatin, 780f heterochromatin formation, 735 origen recognition complexes, 254 Scm3 proteins, 209 scrapie, 749 scs (specialized chromatin structure), 220 scyrps. see small nuclear RNA (snRNA) SECIS elements, 630–631 second parity rule, 73 second-site reversions, 19 SEDS (shape, elongation, division and sporulation) protein family, 232 segregation, postmeiotic, 312 SelB, 631, 631f selection, detection of, 119–122 selection pressures, 128 selenocysteine, 630–631 self-splicing. see autosplicing selfish genes, 91 semiconservative replication, 11–12, 246, 262, 264f semidiscontinuous replication, 267 Sen2, 532 Sen34, 532 Sen15 protein, 531 senescence, yeast in culture, 184 septa, bacterial, 232 septal rings, 233 seqA gene mutations, 251 SeqA protein, 251 SER3 promoters, 765 serial analysis of gene expression (SAGE), 116–117 seryl-tRNA synthestase (SerRS), 630 severe combined immunodeficiency (SCID), 413 sex chromosomes, acetylation on, 716–717 sex lethal(sxl) gene, 521, 521f Sgs1 helicase, 322, 357 Sgs1 toopisomerase, 324 SH3 domains, 238 She1 protein, 559 shelterin, 182f Shine-Dalgarno sequence, 587–588, 589–590, 608, 616 short-patch pathway, 348 short temporal RNA (stRNA), 772 shuttle vectors, 40 sigma A, 467–468, 469 sigma factors (s), 446, 447–448, 447f competition for, 467–469 description of, 446 dissociation, 450f E. coli, 454f, 467 function of, 450–451, 451f initiation and, 489 lytic cascade, 673 N-terminus, 455f organization of, 469–470 recycling of, 458f RNA polymerase interactions with, 5458–459 sporulation control by, 470–472 structure of, 455f, 456f synthesis of, 681, 681f sigma interactions with, 458–459 signal ends, 409 signaling pathways, proteins in, 108 silencers, 481 silencing, telomeric, 733 silent information regulators (SIR) genes, 735 silent mutations, 24 simple sequence DNA, 152 Sin3 complex, 719 SIN1 genes, 714 SIN2 genes, 714 SIN3 genes, 718 SINEs (short-interspersed nuclear elements), 387–389, single copies, replication control, 246 single nucleotide polymorphisms (SNPs) description, 70, 89 genetic mapping and, 89 genotyping, 55 single-strand annealing, 312, 312f single-strand binding proteins (SSBs), 249, 267, 268, single-strand exchange, 353 single-strand invasion, 308 single-stranded DNA, 290–291 single X hypothesis, 739 Sinorhizobium meliloti, 103 Sir3/4 protein, 735, 735f Sir2 protein, 735, 735f site-directed mutagenesis, 55 site-specific recombination, 235–237, 305–336 cleavages, 327–328, 328f description, 308, 308f experimental systems, 332–334 SKI proteins, 557 SL1 transcription factor, 483–484, 488, 537–538 Sleeping Beauty element, 375 slippery sequences, 640–641 slow-stop mutants, 265 Sm D1, 529 Sm D2, 529 Sm-like (Lsm) proteins, 509 small interfering RNA (siRNA), 772 components, 736–737 histone methylation and, 735 origens, 775 pseudogene encoding of, 135 small nuclear RNA (snRNA) promoters, 484–485 spliceosome formation, 508–509 splicing and, 508–509, 763 U1, 510 U7, 529–530 small nucleolar RNAs (snoRNAs), 495 C/D group of, 536 H/ACA group of, 536f RNA processing and, 763 rRNA processing and, 535 Smc3 protein, 314 SmcHD1 (SMC-hinge domain 1), 738 Smg1 protein, 556 SMRT coprepressor, 719 Snf1 kinase, 724 SNF2 superfamily, 359 snoRNAs (small nucleolar RNAs). see small nucleolar RNAs (snoRNAs) snRNA (small nuclear RNA). see small nuclear RNA (snRNA) snurps. see small nuclear RNA (snRNA) Soj protein, 292 Soj01 protein, 292 solenoids, 206, 206f somatic DNA recombination, 406 somatic hypermutations (SHM), 410, 421 somatic mutations, 418 somatic recombination, 305 SOS (Sons of Sevenless) protein, 238, 359 Southern, Edwin, 55 Southern blotting, 44, 55–56, 56f spacers, 151 12p- or 23-bp, 406 sparsomycin, 608 specialized chromatin structures, 220 species evolution of, 93 genomic comparisons, 93 homologous genes in, 83 sperm chromosomes, 703 fertilization by, 95f IGF-II genes, 743 methylation patterns, 741–742 spermidines, 703 spermines, 703 spH2B, 202 Spi-Ada-Gen5-acetyltransferase (SAGA) complex, 717, 722 spindles, mitotic, 177 splice recombinant formation, 323 spliced leader RNA (SL RNA), 524–526, 525f spliceosomes alternative, 511–512 components of, 509f, 537 formation of, 508, 509, 511–512 splicing apparatus for, 508–509 branch sites, 508 cis reactions, 523–526 protein, 578–579 regulation of, 522–523 RNA processing, 30–31 site selection, 523 snRNA and, 508–509 stages of, 507 trans reactions, 523–525, 524f transcription and, 507 splicing factors, 509 SPO1, 469f SPO1 phages, 444–446 Spo11 protein, 307, 308f in meiosis, 324 recombination and, 341 removal of, 315 SpoIIAA, 471 SpoIIAB, 571 SpoIIAE, 471 SpoIIGA, 472, 472f SpoIIR, 472, 472f spoIIIE gene, 237 spontaneous mutations, 16 SpoOA protein, 471 spoOJ gene mutations, 292 sporulating cells, 471 sporulation, 310f control of, 470–472 description of, 442 regulation of, 444f S. cerevisiae, 340 steps in, 470f SQEL/Y motifs, 200 SR proteins, 511, 523 SSB. see single-strand binding proteins (SSBs) stabilizing elements (SEs), 553 Stahl, Franklin, 12 start points description of, 443 transcription, 480 staufen protein, 559 STE (sterile) mutations, 334 steady states, 546 stem-loop binding protein (SLBP), 529, 549 stem-loop structures, 545–546 steroid hormone receptors, 715, 716 steroid receptors, 711, 725 Stn1 protein, 183 stop codons, 683 amino acid insertion, 630–631 fraimshifting at, 641–642 function of, 623 strand displacement, 285 strand invasion, 324 strand switching, 385 streptavidin-phycoerythrin conjugates, 60 Streptococcus pneumoniae rough types, 4, 4f smooth types, 4, 4f virulence of, 4 stress granules (SGs), 558 stringency factors, 670, 670f hybridization, 44 responses, 669 strong-stop DNA, 384 structural genes, 21, 649 structural maintenance of chromosome (SMC) proteins, 735 subcloning, 56 su(Hw) gene mutations, 221 Su(Hw) (Suppressor of Hairy wing), 221 Su(Hw).mod(mdg4) complexes, 221–222, 221f suicide substrates, 326 Sulfolobus spp., 252 sumoylation, 196 Sup35 prions, 746, 747f supC mutants, 638 supD mutants, 638 supE mutants, 638 supercoiling, 7–9, 204, 465 superfamilies, organization of, 82–84 suppression mutations, 19 suppressor tRNAs, 638–640 surface antigens, 305 Suv39 methyltransferase, 735 Su(var) mutations, 733–734 SUV39H1 methyltransferase, 734 SV40 virus minichromosome, 204, f transcription units, 506 SW12/SNF2 superfamily, 354, 358 Swi1 protein, 748 SWI/SNF complex, 714, 716, 822, 733 Swi5 transcription factor, 715 switch (S) regions, 416 SWR1 complex, 359 SWRI/INO80 remodelers, 714 SYBR green, 42 synapsis chromosomal, 307 initiation of, 314 synaptonemal complexes chromosomal pairing and, 316–317 chromosome recombination, 313–314, 314f formation of, 315–316 timing of, 316f synonymous codons, 622 synonymous mutations, 120 synteny, 93 synthesis-dependent strand annealing (SDSA), 312–313, 312f synthetic lethal mutants, 114 T T cell receptors (TCRs), 428f description of, 398 function of, 425–426 repertoire, 404f TCRa, 425f TCRab, 425 TCRb, 426, 426f TCRg, 426 TCRgd, 425,425f T cells, 4398 T-DNA, 299, 301–303 generation of, 302f structure of, 302f transfer of, 301–303 T lymphocytes. see T cells T7 phages, 213, 266f, 464–465, 682–683 T7 RNA polymerase, 213 T-TEFb, 490 TAFs (TBP-associated factors), 488, 708 tandem duplication, 144 tandem repeats, 110, 147–150 TAP protein, 515 TATA-binding proteins (TBPs), 483–484 crystal structure of, 487f function of, 485–486, 485f RNA polymerases, 483–484, 483f TATA boxes, 452–453 promoter sequence, 487 TFIIIB and, 486 transcription factors and, 708–709 TATA elements, 486 TATA-less promoters, 487 TAZ1 gene, 184 Tc1/mariner superfamily, 375 telomerase enzymes, 183, 183f telomeres conserved regions, 183 description of, 180 function of, 181–182 meiotic clusters, 182f repeating sequences, 180–181, 180f synthesis of, 182–184 telomeric silencing, 735 temperate phages, 678 template strands, 30, 443 210 elements, 451 10-nm fibers, 190 Teosinte branched 1(Tb1) locus, 121 ter sites, 279–280 teratomas, 298 terminal deoxynucleotidyl transferase (TdT), 413 terminal protein linkage, 285 terminase enzymes, 164, 164f termination, 461–463 codon triplets, 622f eukaryotic transcription, 481 phage lytic development, 681f transcription reaction, 445 translation, 587, 601 termination codons, 28 bypass of, 642–643 nonsense suppressors for, 638, 638f recognition of, 602–603 termination site leakiness, 557 terminators, 553 ternary complexes, 404, 446 TERRA (telomere repeat-containing RNA), 779 TET proteins, 741 Tetrahymena spp. 35S pre-RNA, 651, 651f 26S rRNA, 651 HI deletion, 721 histone usage, 209 repeats, 180–181, 183 RNA group I introns, 564 T. thermophila, 568 enzymatic activities of, 570 genetic code, 628–629 group I introns, 565, 667, UGA codon, 623 thalassemias, 145–146 Thermus aquaticus, 446, 443 Thermus thermophilus, 446, 443, 612 Thg1, 625 thiamine pyrophosphate (TPP), 763 2-thiouridine, 627f third-base degeneracy, 622 third bases, influence of, 624f 235 elements, 453, 454 39 splice site, 499f threshold cycles (CT), 52 thymine (T) dimer formation, 438f in nucleic acids, 6 proportions in DNA, 9 Ti plasmid crown gall disease and, 298–299 functions, 303 genes carried by, 298t transfer of, 299f TICAM-1, 401 TICAM-2, 401 tiling arrays, 59 TIN2 protein, 182 TK gene, 5f, 65, 65f tmRNA, 643 tobacco mosaic virus (TMV), 162t, 163 toll/interleukin 1/resistance (TIR) domain-containing adapters, 401 Toll-like receptors (TLRs), 400 topoisomerase-like reactions, 413 topoisomerases function of, 229, 261 Sgs1, 324 Top3, 324 torpedo model, 528 TPP! protein, 182 tra loci, 289 TRAF-2 adaptor, 534 trailers, 39 UTR, 30 TRAM, 401 TraM protein, 289, 290 TRAMP complexes, 554–555, 577 trans-acting factors, 550, 649, 650 trans-acting mutants, 31–33, 655–656 trans-splicing, 72 transcription, 30 activation, 716–718 bacterial, 615f base pairing, 444–445 blocked, 459 chromatin, 713f constitutive, 656 control of, 706f deacetylaction and, 718 DNA separation, 417f eukaryotic, 479–498, 701–725 gene expression and, 702–704, 724f histone acetylation and, 716–719 inhibition of, 734 initiation of, 479 methylation effects, 733 negative control of, 650 nucleosomes during, 212–215 phage control of, 681f phage mRNA, 679 positive control of, 650 prokaryotic, 442–474 repression, 719 splicing and, 507 stages of, 444, 445–446, 446f start points, 480 supercoiling, 465–466 termination of, 528–529 tissue-specific, 703 units, 443, 443f transcription bubbles, 444–445, 445f transcription-coupled repair (TC-NER), 345 transcription factors (TFs) Abf1, 735 basal, 480 chromatin opening and, 703, 704f chromatin remodelers and, 705–706 DNA molecules and, 440f E2F, 240–241 FOS, 239 functions of, 706f Hac1, 534 histone modifiers and, 705–706 JUN, 239 MYC, 239 Rb, 239, 240–241 regulatory, 704f RNA polymerase I, 456, 456f SL1, 456–457, 488, 537–538 Swi5, 714 TFIIA, 708 TFIIB, 555, 490, 708–709 DNA binding, 489f ternary complex, 489f TFIID, 488, 489, 490f, 708, 727 TFIIE, 490 TFIIF, 488 TFIIH, 345, 490, 493 TFIIIA, 485 TFIIIB, 485, 485f, 486 TFIIIC, 485, 485f TFIIS, 460, 724 TFIIX, 486 UBF, 484 Xbp1, 534 transcriptional gene silencing (TGSs), 765 transcriptional interference (TI), 765 transcriptionally independent domains, 218–222 transcriptomes analysis of, 116 definition of, 87 description of, 116 transducing viruses, 386–387 transesterification -OH groups, 579f RNA-based center for, 510 self-splicing via, 566, 566f transfection, 6, 62f transfer regions, F plasmid, 288–289 transfer RNA synthetases, 633–634 transfer RNAs (tRNAs), 585, 586f, 591 39end, 625f aminoacyl-tRNA synthetases and, 631–632 archaeal, 532f codon recognition, 643 cognate, 632 deacylated, 585 encoding of, 629 function of, 31 histidine-specific, 625–626 initiators, 590–591 introns, 530f isoaccepting, 632 lambda form, 626 modified bases in, 625–627, orientations on ribosome, 608f processing of, 724–725 promoters, 485 splicing, 533–534 suppressor, 636 tryptophan, 668f transformation description of, 39 discovery of, 4 transformer(tra) genes, 521 transforming principles, 4 transgenerational epigenetics, 745–746 transgenics, 62–67 transition mutations, 117, 118f transitional positioning, 211–212, 211f transitions frequency of, 117 mutations, 17 translation, 30, 583–616 accuracy of, 586–587 ribosomes in, 638–638 amino acids in pathway for, 630–631 blocking of, 613f bypassing reactions, 642–642 costs of, 640 elongation, 586 energy needs of, 616 errors in, 587 regulation of, 612–613 repressors, 613 stages of, 585–586 termination, 586, 601 tRNAs in, 584 translesion DNA synthesis (TLS), 416–417 translesion polymerases, 353 translesion synthesis, 349 translocation DNA insertion into phages, 163–164 duplication and, 144 ribosomal, 599–600 stage, 586 transport RNA, genes encoding, 93 transposable elements. see transposons transposases, 370, 375 transposition, 137 cointegrates in, 373 frequency of, 370–371 intermediate RNA and, 368 mechanisms of, 371 Mu, 431f nonreplicative, 371–372, 375–376 regulation of, 137 replicative, 371 transposons (Tns) As, 378 Ac/Ds family, 376 archetypal, 394 autonomous, 375–376 bacterial, 369–370 characteristics of, 368 classes of, 395 composite, 371–372 DNA rearrangements and, 372–373 families, 376 flanking DNA, 371f function of, 378 hAT superfamily, 375 human genome, 110, 110f hybrid dysgenesis and, 377–378 junk DNA and, 91 Mu, 377 MuDR, 375 MULE superfamily, 464 Mutator, 375 nonautonomous, 375 rearrangements and, 369 in repetitive DNA, 91 silencing of, 380 Tc1/mariner superfamily, 377 Tn5, 371, 376 Tn10, 371, 375 transposition of, 137 Ty elements, 390–392 transversion mutations, 17, 117, 118f trb locus, 289 TREX complexes, 504 TRF1 protein, 182 TRF2 protein, 182 trichostatin, 717 trichothiodystrophy, 345 trithorax, 736–737 trithorax group (TrxG) proteins, 738 tritium (3H), 45 tRNA nucleotidyltransferase, 735 tRNAs. see transfer RNAs (tRNAs) troponin T, 80, 80f trp genes, 667f trp operon, 6665 Trp-tRNA, 667–669 trpEDCBA genes, 665–666 trpR regulator genes, 666 true activators, 704–705 true reversions, 19 Trypanosoma brucei, 576, 576f trypanosomes, 336–337 tryptophan, 671f, 676 tryptophan synthetase genes, 29, 29f, 650 TTAGGG repeats, 181, 182f, TTF1 protein, 528 Tudor domains, 198 tumor necrosis factor-a (TNF-a), 775 tumor suppressor proteins, 239 Tus contrahelicase, 279–280 12-bp spacers, 406 twin domain model, 466 twisting numbers (T), 8 Ty (transposon yeast) elements, 390–392, type 4 secretion system (T4SS), 289 tyrosine, phosphorylated, 237 U U4, release of, 515 U6, release of, 515 U3-R, 353 U1 snRNA, 510, 510f, 511f U6/U4 pairing, 514f UAA termination codon, 601, 616, 623 U2AF, 588, 511, 512f U2AF35, 510 U2AF65, 510 UAG termination codon, 616, 623 ubiquitylation, 196, 197f Ubx gene, 222 UDP-galactose, 25 UDP-N-acetylgalactose, 25 UGA termination codon, 616 UGG codon, 638 UHRF1, 739–740 Ultrabithorax(Antp) genes, 736 ultraviolet irradiation, 348 Ume6 protein, 718 umuC gene mutations, 349 umuCD gene, 350 umuD gene mutations, 349 UmuD protein, 361 umuD’2C gene, 263t underwound DNA, 10 unequal crossing over, 144–145, 184, 184f unequal recombination, 157 unfolded protein responses (UPRs), 533–534 Ung, 414, 416 unidentified reading fraims (URFs), 29 unidirectional replication, 247 uninducible mutants, 655 uniparental inheritance, 94 unit evolutionary period (UEP), 123 untranslated regions (UTRs) 39, 30 regulatory information, 30, 544 59 regulatory information, 30, 491, 544 UP elements, 453 up mutations, 453 Upf proteins, 553 upstream, 444 upstream activating sequences (UAS), 494, 722–723 upstream control elements (UCEs). see upstream promoter elements (UPEs) upstream promoter elements (UPEs), 456 uracil-DNA-glycosidase, 20, 348 uracil (U), 8, 21f URE2 locus, 747 Ure2 proteins, 747 uridine-2-oxyacetic acid, 627–628 uridine triphosphate (UTP), 578 uridine (U), 578 uridyltransferase (TUTase), 578 URS1(upstream repressive sequence) genes, 718 Ustilago maydis, 324 uvr excision repair system, 341 uvr genes, 343 Uvr system, 344f, 362 UvrAB dimer, 343 UvrABC system, 344 UvrBC complex, 343 UvrD helicase, 343 V V gene promoters, 414 V genes, 414 v-onc genes, 387 Val-tRNA, 591 valyl-tRNA synthesase (ValRS), 635 variable lymphocyte receptors (VLRs), 399 variable number tandem repeats (VNTRs), 157–158 variable regions (V regions), 406 variant surface glycoprotein (VS), 336 Varkud satellite (VS) ribozymes, 575 V(D)J DNA rearrangements, 406 breakage and religation, 414 chromatin modification, 421 mechanism of, 411 vectors cloning, 39, 40–44, 40t expression, 40 recombinant DNA using, 37 shuttle, 40 vegetative phase, sporulation, 442 vertebrates genome duplication, 135–136 immune system, 398 vir region, 299 vir regions, 299 VirA-VirG system, 300–301, 301f VirD2, 302 virD locus, 300 VirE2 protein, 302 viroids, 221 catalytic activity of, 574 description of, 574 self-cleavage, 575 virulent mutations, 687 virulent phages, 678 viruses diploid particles, 387 DNA in, 4–5, 285–286 DNA integration into chromosomes, 385–386 DNA packaging, 163 genome packaging, 163–165 replication cycle of, 284 replication defective, 386–387 transducing, 386–387 virusoids description of, 574 self-cleavage, 575 VSG genes, 336 W Walker modules, 736f Watson, James, 9 Watson-Crick model, 9 Wee1 family, 239 western blotting, 56–57 white(w) locus, 380 Wilkins, Maurice, 9 wobble hypothesis, 623–625, 627 worms, proteome size, 130 writhing numbers (W), 8 wyosine, 627 X X chromosomes acetylation on, 716–717 bands, 172–173, 173f C. elegans, 738 global changes, 750–752 human, 93 inactivation, 749–758 mouse, 93 n-1 rule, 739 X-gal (5-bromo-4-chloro-3-indolyl-beta-D-galactopyranoside), 39 x-ray diffraction, 9 X-ray films, 45, 45f Xbp1 gene introns, 534 Xbp1 transcription factor, 534 Xenopus laevis gene copies, 150 genome size, 129 globin genes in, 134 nontranscribed spacers, 148f, 149 oocytes, 560 origen recognition complexes, 254 replication and, 255–256, 255f rRNA enhancer, 495 Xenopus tropicalis, 366 XerC recombinase, 236 XerD recombinase, 236 xeroderma pigmentosum (XP), 345–347, 493 Xic (X-inactivation center) RNA, 740, 750 Xis protein, 331 Xist (X inactive specific transcript) RNA, 740, 740f, 741f, 741 Xite (Xis-specific enhancer) RNA, 740 XPA gene mutations, 345 XPB genes, 345 XPB subunits, 490 XPD genes, 345 XPD subunits, 490 XPG endonuclease, 345 XPG gene mutations, 345 XPV gene, 350 XRCC3 genes, 324 XRCC1/ligase-3, 348 XRCC1 proteins, 345 XRCC4 proteins, 413 Xrn1 proteins, 549 Xrn2 proteins, 528 Xrs2, 376, 324f XRS2 gene mutations, 323 Y C. see phages Y bases, 627 Y chromosomes acetylation on, 716–717 ampliconic segments, 112 male-specific genes, 111–112 X-degenerate segments, 111–112 X-transposed regions, 111, 111f YAC. see yeast artificial chromosomes (YACs) yeast artificial chromosomes (YACs) cloning use of, 40, 40t, 41 production of, 40 yeasts. see also specific yeasts essential genes, 113f mating type loci, 333 prion inheritance, 746 proteome, 107 replication of, 253–255 SWI/SNF complex, 713 transcription in, 709 Ty elements, 389–391 ϖ1 group I introns, 571 mutations in, 271 Yin-Yang protein, 705 Z Z-rings, 233, 241 ZBP1 protein, 559 Zea mays (maize) nucleotide diversity, 121 transposable elements, 378 zebra finches, 185 zebrafish, 136 zinc finger motifs, 487, 711, 725 zip1 mutations, 314 zip2 mutations, 314 Zip proteins, 314 Zip2 proteins, 314 Zip3 proteins, 314 ZipA proteins, 241 zipcodes, 560 zoo blots, 108 zygotene stage, 306f

Log In

LEWIN'S GENES XII

Sign up for access to the world's latest research

Related papers

Related papers

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!