Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Genome Informatics

UNIT-1

Whole-Genome Sequencing Methods


Sequencing technologies are unable to sequence the entire human genome at once. Thus,
the genome must be broken into smaller chunks of DNA, sequenced and then put back
together in the correct order using bioinformatics approaches. There are several methods
of DNA sequencing, including clone-by-clone and whole-genome shotgun methods.
Clone-by-clone

This method requires the genome to have smaller sections copied and inserted into bacteria.
The bacteria then can be grown to produce identical copies, or “clones,” containing
approximately 150,000 base pairs of the genome that is desired to be sequenced. Then, the
inserted DNA in each clone is further broken down into smaller, overlapping 500 base pair
chunks. These smaller inserts are sequenced. After sequencing is performed, the
overlapping portions are used to reassemble the clone. This approach was used to sequence
the first human genome using Sanger sequencing. This approach is time-consuming and
costly, but it is reliable.

Whole-genome shotgun

As the name implies, “shotgun” sequencing is a method that breaks DNA into small random
pieces for sequencing and reassembly. The pieces of DNA are also cloned into bacteria for
growth, isolation and subsequent sequencing. Because the pieces are random, th ere are
overlapping sequences that aid in reassembly into the original DNA order. This approach
was originally used in Sanger sequencing but is now also used in next-generation
sequencing methods providing rapid genome sequencing with lower costs. It is only good
for shorter “reads” (ie, sequencing on shorter DNA fragments to be put back together
again). Because it is reassembled based on overlapping regions and has shorter read lengths,
it is best utilized when a reference genome is available, and it requires sophisticated
computational approaches to reassemble the sequence. It also can be challenging for
genomes with many repetitive regions.

Assembly of sequencing reads


Because genomes are sequenced in varying lengths of DNA fragments, the resulting
sequences must be put back together. This is referred to as “assembly,” or “reassembly.”
Two common approaches are de novo assembly and assembly by reference mapping.

De novo assembly is performed by identifying overlapping regions in the DNA sequences,


aligning the sequences and putting them back together to form the genome. This is done
without any sequence with which to compare. Mapping to a reference genome uses another
genome to align new sequencing data to as a comparator.

Although de novo assembly can be challenging, this approach is the only one available for
sequencing new organisms. Additionally, de novo assembly introduces results with less
bias than mapping to a reference genome. Mapping to a reference genome is easier and
requires less contiguous reads, but new or unexpected sequences can be lost. The sequence
results obtained by this method is only as good as the reference genome chosen; however,
it can provide better identification of single nucleotide polymorphisms (SNPs). Multiple
institutions and genomic sequencing companies have invested considerable time and effort
into creating improved reference genomes. Single nucleotide polymorphisms are known to
vary by race and ethnicity, thus, multiple reference genomes have been created for various
races/ethnicities.

Examples of next-generation sequencing platforms

Several companies focus on development and marketing of next-generation sequencing


machines (often referred to as “platforms”) for use in whole-genome (and other)
sequencing. Illumina is considered by many as the leader because of the number of users
that utilize its systems. Illumina has multiple platforms depending on the need. The
Illumina HiSeq is one of the more common sequencers found in laboratories, including
major research institutions, companies providing next-generation sequencing services for
clinics and labs, and pathology laboratories. It has a high throughput, capable of sequencing
many genomes rapidly with reasonable costs. This instrument also can be used to look at
copy number variation, as well as mutations and other alterations, and RNA expression
levels to do transcriptomics. Because of the popularity in the clinic of targeted sequencing
panels, which are much smaller with clinics requiring faster turnaround times for treatment
of patients, Illumina created the MiSeq, which can provide same-day sequencing results for
very small panels. Illumina also produced multiple variations to provide sequencers for
each disease area optimizing output, turnaround time and costs for specific use cases.

Thermo Fisher Scientific’s Ion Torrent or Ion Proton uses a completely different technology
based on detection of pH differences and was once expected to provide better utility for
clinical applications because it was easier to use, cost less and provided faster turnaround
time. However, Illumina countered with new machines to fit these needs. Consequently,
both are found in research and clinical laboratories.
Other technologies developed recently use different novel approaches. A few examples are
provided below.

Oxford Nanopore Technologies introduced the MinION, which enables anyone to sequence
on a desktop computer using a USB device. The DNA is passed through a protein nanopore
membrane for sequencing and detection by creation of an ionic current that varies based on
the nucleotide.

Pacific Biosciences introduced its single molecule, real-time technology with the longest
reads to date, with average read lengths of more than 10,000 base pairs compared with more
than 150 base pairs. Single molecule, real-time technology uses a chip with single DNA
molecules attached. Zero-mode waveguide technology enables isolation of a single
nucleotide for the DNA polymerase to add fluorescent labels for detection of each base.
The error rate of this instrument is still higher than some of the prior technologies, but a lot
of interest has been generated, and there is hope that speed and costs can be further
optimized with the new approach.

Coverage breadth and depth

Coverage refers to the number of reads that show a specific nucleotide in the reconstructed
DNA sequence. A read is a string of A, T, C, G bases that correspond to the reference DNA.
There are millions of reads in a sequencing run. Increased coverage depth results in
increased confidence in variant identification.

For the human genome, a 10- to 30-times coverage depth is acceptable for detecting
mutations, SNPs and rearrangements. A next-generation sequencing approach that provides
a coverage depth of 30 times is considered to have high coverage. However, as coverag e
depth increases, coverage breadth decreases (Figure 1).
NEXT Gen/2nd generation sequencing methods
Principle of Pyrosequencing
Step 1
A DNA segment is amplified and the strand to serve as the Pyrosequencing template is
biotinylated. After denaturation, the biotinylated single-stranded PCR amplicon is
isolated and allowed to hybridize with a sequencing primer (see figure Principle of
Pyrosequencing — steps 1–3).
Step 2
The hybridized primer and single-stranded template are incubated with the enzymes
DNA polymerase, ATP sulfurylase, luciferase, and apyrase, as well as the substrates
adenosine 5' phosphosulfate (APS) and luciferin (see figure Principle of
Pyrosequencing — steps 1–3).
Step 3
The first deoxyribonucleotide triphosphate (dNTP) is added to the reaction. DNA
polymerase catalyzes addition of the dNTP to the sequencing primer, if it is
complementary to the base in the template strand. Each incorporation event is
accompanied by the release of pyrophosphate (PPi) in a quantity equimolar to the
amount of incorporated nucleotide (see figure Principle of Pyrosequencing — steps 1–
3).

Principle of Pyrosequencing — steps1–3


[1] A DNA segment is amplified and the strand to serve as the Pyrosequencing template
is biotinylated. After denaturation, the biotinylated single-stranded PCR amplicon is
isolated and allowed to hybridize with a sequencing primer. [2] The hybridized primer
and single-stranded template are incubated with the enzymes DNA polymerase, ATP
sulfurylase, luciferase, and apyrase, as well as the substrates adenosine 5' phosphosulfate
(APS) and luciferin. [3] The first deoxyribonucleotide triphosphate (dNTP) is added to
the reaction. DNA polymerase catalyzes addition of the dNTP to the sequencing primer,
if it is complementary to the base in the template strand. Each incorporation event is
accompanied by the release of pyrophosphate (PPi) in a quantity equimolar to the amount
of incorporated nucleotide.
Step 4
ATP sulfurylase converts PPi to ATP in the presence of adenosine 5' phosphosulfate
(APS). This ATP drives the luciferase-mediated conversion of luciferin to oxyluciferin
that generates visible light in amounts that are proportional to the amount of ATP. Th e
light produced in the luciferase-catalyzed reaction is detected by a charge coupled
device (CCD) camera and seen as a peak in the raw data output (Pyrogram). The height
of each peak (light signal) is proportional to the number of nucleotides incorporated
(see figure Principle of Pyrosequencing — step 4).

Principle of Pyrosequencing — step 4. ATP sulfurylase converts PPi to ATP in the


presence of APS. This ATP drives the luciferase-mediated conversion of luciferin to
oxyluciferin that generates visible light in amounts that are proportional to the amount of
ATP. The light produced in the luciferase-catalyzed reaction is detected by CCD sensors
and seen as a peak in the raw data output (Pyrogram). The height of each peak (light
signal) is proportional to the number of nucleotides incorporated.
Step 5
Apyrase, a nucleotide-degrading enzyme, continuously degrades unincorporated
nucleotides and ATP. When degradation is complete, another nucleotide is added (see
figure Principle of Pyrosequencing — step 5).
Principle of Pyrosequencing — step 5. Apyrase continuously degrades unincorporated
nucleotides and ATP. When degradation is complete, another nucleotide is added.
Step 6
Addition of dNTPs is performed sequentially. It should be noted that deoxyadenosine
alfa-thio triphosphate (dATPαS) is used as a substitute for the natural deoxyadenosine
triphosphate (dATP) since it is efficiently used by the DNA polymerase, but not
recognized by the luciferase. As the process continues, the complementary DNA strand
is built up and the nucleotide sequence is determined from the signal peaks in the
Pyrogram trace (see figure Principle of Pyrosequencing — step 6).

Principle of Pyrosequencing — step 6. Addition of dNTPs is performed sequentially. It


should be noted that deoxyadenosine alfa-thio triphosphate (dATPαS) is used as a
substitute for natural deoxyadenosine triphosphate (dATP) since it is efficiently used by
DNA polymerase, but not recognized by luciferase. As the process continues, the
complementary DNA strand is elongated and the nucleotide sequence is determined from
the signal peaks in the Pyrogram trace.
illumina sequencing:
The HeliScope™ Single Molecule Sequencer (Helicos BioSciences Corp., Cambridge, MA)
is truly a “DNA microscope” capable of sequencing billions of single DNA molecules at a
time. As part of the complete Helicos™ Genetic Analysis system, it is designed to deliver
unprecedented sequence data throughput and output, as well as direct analysis of native DNA
samples.

Single-molecule sequencing chemistry

Figure 1 - HeliScope Single Molecule Sequencer. (Reproduced with permission from Helicos.)

Zoom In
Figure 2 - tSMS process. a) Sample preparation: DNA samples are sheared into shorter
fragments, denatured to single strands, and tagged with a 3′ poly(A) tail and a terminal
fluorescent adenosine. DNA strands are then hybridized to the surface of a flow cell via poly(T)
capture sites, which also prime the sequencing reaction. b) Captured templates are imaged to
map their locations, and the fluorescent tags are removed. c) Sequencing-by-synthesis:
fluorescent nucleotides (C, G, T, or A) are added one base per cycle and incorporated into the
complementary strand in a template-dependent manner. Nonincorporated nucleotides are
washed away, and the strands are illuminated and imaged to determine base addition and DNA
sequence. The fluorescent tags are then cleaved, and the next base is added to continue the
cycle. (Reproduced with permission from Helicos.)
The HeliScope Single Molecule Sequencer (Figure 1) harnesses the power of proprietary True
Single Molecule Sequencing (tSMS)™ technology to perform single-molecule sequencing-by-
synthesis. DNA samples are fragmented and captured on a specialized surface within a flow
cell, where they then serve as single-molecule templates for sequencingby- synthesis reactions
(Figure 2). Fluorescent nucleotides are added one base at a time, and the sequencer records
each incorporation event to determine the sequence of the individual DNA strands (Figure 3).

Figure 3 - Image series illustrating templatespecific base incorporation. (Adapted with


permission from Harris, T.D., et al. Science 2008, 320, 106–9, and reproduced with permission
from AAAS.)
To enable this process, the company’s scientists and engineers developed a series of technical
and manufacturing advances. On the chemistry front, a high-fidelity DNA polymerase was
combined with proprietary Virtual Terminator™ fluorescent nucleotides, providing rapid and
accurate baseby- base synthesis. An imaging reagent was developed to enhance emission
intensity and improve fluorophore detection by an order of magnitude. In addition to enhancing
signalto- noise, it also reduces photobleaching and fluorophore “blinking.” Together, these
specialized reagents ensure accurate and precise single-molecule sequencing chemistry.

Another challenge was to minimize even low-level background fluorescence. Since the
sequencer can detect the presence of single fluorophores, it is vital to reduce all nonspecific
emission sources, which could produce errors in the sequence. Engineering designs avoid the
use of materials that autofluoresce, and strive to minimize the adsorption of stray fluorescent
molecules. Reagents, including all enzymes, nucleotides, and buffers, are formulated to
eliminate impurities that less sensitive molecular biology applications may tolerate. Helicos
worked with its partners to develop quality control assays that ensure reagents meet these
standards of purity. The proper selection of single- molecule- grade reagent and surface
materials, together with diligent handling of flow cells during manufacture and storage, and
optimized surface rinsing protocols, collectively produce images that are nearly free of spurious
incorporation events.

Sequencer fluidics
At each step of the sequencing-by-synthesis process, the HeliScope Sequencer records the
incorporation of fluorescently labeled nucleotides in the growing DNA strands. The instrument
maximizes run efficiency by processing two 25-channel flow cells at once, performing strand
synthesis in one flow cell while simultaneously imaging the other (Figure 4).

Zoom In
Figure 4 - HeliScope Single Molecule Sequencer optics configuration. (Reproduced with
permission from Helicos.)
To ensure the integrity needed to perform single-molecule sequencing, reagents and flow cells
are prepackaged in ready-to-use kits that simplify handling and facilitate correct loading. The
packaging design incorporates seals at both the top and bottom of the bottles to provide reagent
access, protection against contamination, and venting for proper flow. The reagent storage
system on the sequencer has both refrigerated and ambient temperature locations to optimize
on-board stability of the various components. When the bar-coded reagent cartridge or bottle
is scanned, the sequencer’s graphical user interface verifies the appropriateness of the kit and
indicates the proper location for placement.

Reagent delivery is driven by the appropriate application script and supports both the chemistry
and imaging conditions on a flow cell. For each step in the sequencing- by-synthesis reaction,
the required reagents are metered, prepared, and delivered to the flow cell. The fluidics design
provides “just-in-time” mixing to tailor the exact conditions needed for each step and provide
flexibility for future applications. In addition, the sequencer accurately controls the temperature
in the flow cell for optimum reaction conditions. This high level of flexibility in the fluidics
architecture supports continuous improvement in performance and throughput, without the
need for expensive instrument upgrades as existing and new chemistries are introduced and
optimized.

Sequencer imaging
To accurately detect the incorporation of fluorescently labeled nucleotides into single strands
of DNA, two primary attributes are required. First, the imaging system must have adequate
resolution to discriminate individual strands tethered to the surface in nonordered locations.
Second, there must be sufficient signal-to-noise ratio to effectively “see” the fluorescent labels
and discriminate them from the background

Ion Torrent: Proton / PGM sequencing

Unlike Illumina and 454, Ion torrent and Ion proton sequencing do not make use of optical
signals. Instead, they exploit the fact that addition of a dNTP to a DNA polymer releases an
H+ ion.

As in other kinds of NGS, the input DNA or RNA is fragmented, this time ~200bp. Adaptors
are added and one molecule is placed onto a bead. The molecules are amplified on the bead
by emulsion PCR. Each bead is placed into a single well of a slide.
Polony sequencing is an inexpensive but
highly accurate multiplex sequencing
technique that can be used to “read” millions
of immobilized DNA sequences in parallel.
This technique was first developed by Dr.
George Church's group at Harvard Medical
School. Unlike other sequencing techniques,
Polony sequencing technology is an open
platform with freely downloadable, open
source software and protocols. Also, the
hardware of this technique can be easily set
up with a commonly available
epifluorescence microscopy and a computer-
controlled flowcell/fluidics system. Polony
sequencing is generally performed on paired-
end tags library that each molecule of DNA
template is of 135 bp in length with two 17–
18 bp paired genomic tags separated and
flanked by common sequences. The current
read length of this technique is 26 bases per
amplicon and 13 bases per tag, leaving a gap
of 4–5 bases in each tag.

epifluorescence amplicon flowcell/fluidics

1. Workflow

An illustrated procedure for Polony


sequencing.

https://handwiki.org/wiki/index.php?curid=1428381

The protocol of Polony sequencing can be broken into three main parts, which are the paired
end-tag library construction, template amplification and DNA sequencing.

1.1. Paired End-Tag Library Construction

This protocol begins by randomly shearing the tested genomic DNA into a tight size
distribution. The sheared DNA molecules are then subjected for the end repair and A-tailed
treatment. The end repair treatment converts any damaged or incompatible protruding ends of
DNA to 5’-phosphorylated and blunt-ended DNA, enabling immediate blunt-end ligation,
while the A-tailing treatment adds an A to the 3’ end of the sheared DNA. DNA molecules
with a length of 1 kb are selected by loading on the 6% TBE PAGE gel. In the next step, the
DNA molecules are circularized with T-tailed 30 bp long synthetic oligonucleotides (T30),
which contains two outward-facing MmeI recognition sites, and the resulting circularized DNA
undergoes rolling circle replication. The amplified circularized DNA molecules are then
digested with MmeI (type IIs restriction endonuclease) which will cuts at a distance from its
recognition site, releasing the T30 fragment flanked by 17–18 bp tags (≈70 bp in length). The
paired-tag molecules need to be end-repaired prior to the ligation of ePCR (emulsion PCR)
primer oligonucleotides (FDV2 and RDV2) to their both ends. The resulting 135 bp library
molecules are size-selected and nick translated. Lastly, amplify the 135 bp paired end-tag
library molecules with PCR to increase the amount of library material and eliminate extraneous
ligation products in a single step. The resulted DNA template consists of a 44 bp FDV
sequence, a 17–18 bp proximal tag, the T30 sequence, a 17-18 bp distal tag, and a 25 bp RDV
sequence.

1.2. Template Amplification

Emulsion PCR

The mono-sized, paramagnetic streptavidin–coated beads are pre-loaded with dual biotin
forward primer. Streptavidin has a very strong affinity for biotin, thus the forward primer will
bind firmly on the surface of the beads. Next, an aqueous phase is prepared with the pre-loaded
beads, PCR mixture, forward and reverse primers, and the paired end-tag library. This is mixed
and vortexed with an oil phase to create the emulsion. Ideally, each droplet of water in the oil
emulsion has one bead and one molecule of template DNA, permitting millions of non-
interacting amplification within a milliliter-scale volume by performing PCR.

Emulsion breaking

After amplification, the emulsion from preceding step is broken using isopropanol and
detergent buffer (10 mM Tris pH 7.5, 1 mM EDTA pH 8.0, 100 mM NaCl, 1% (v/v) Triton X‐
100, 1% (w/v) SDS), following with a series of vortexing, centrifuging, and magnetic
separation. The resulted solution is a suspension of empty, clonal and non-clonal beads, which
arise from emulsion droplets that initially have zero, one or multiple DNA template molecules,
respectively. The amplified bead could be enriched in the following step.

Bead enrichment

The enrichment of amplified beads is achieved through hybridization to a larger, low density,
non-magnetic polystyrene beads that pre-loaded with a biotinylated capture oligonucleotides
(DNA sequence that complementary to ePCR amplicon sequence). The mixture is then
centrifuged to separate the amplified and capture beads complex from the unamplified beads.
The amplified, capture bead complex has a lower density and thus will remain in the
supernatant while the unamplified beads form a pellet. The supernatant is recovered and treated
with NaOH which will break the complex. The paramagnetic amplified beads are separated
from the non-magnetic capture beads by magnetic separation. This enrichment protocol is
capable in enriching five times of amplified beads.

Bead capping
The purpose of bead capping is to attach a “capping” oligonucleotide to the 3’ end of both
unextended forward ePCR primers and the RDV segment of template DNA. The cap that being
use is an amino group that prevents fluorescent probes from ligating to these ends and at the
same time, helping the subsequent coupling of template DNA to the aminosilanated flow cell
coverslip.

Coverslip arraying

First, the coverslips are washed and aminosilane-treated, enabling the subsequent covalent
coupling of template DNA on it and eliminating any fluorescent contamination. The amplified,
enriched beads are mixed with acrylamide and poured into a shallow mold formed by a Teflon-
masked microscope slide. Immediately, place the aminosilane-treated coverslip on top of the
acrylamide gel and allow to polymerize for 45 minutes. Next, invert the slide/coverslip stack
and remove the microscope slide from gel. The silane-treated coverslips will bind covalently
to the gel while the Teflon on the surface of microscope slide will enable the better removal of
slide from the acrylamide gel. The coverslips then bonded to the flow cell body and any
unattached beads will be removed.

1.3. DNA Sequencing

The biochemistry of Polony sequencing mainly relies on the discriminatory capacities of


ligases and polymerases. First, a series of anchor primers are flowed through the cells and
hybridize to the synthetic oligonucleotide sequences at the immediate 3’ or 5’ end of the 17-18
bp proximal or distal genomic DNA tags. Next, an enzymatic ligation reaction of the anchor
primer to a population of degenenerate nonamers that are labeled with fluorescent dyes is
performed.
Differentially labeled nonamers:

5' Cy5‐NNNNNNNNT
5' Cy3‐NNNNNNNNA
5' TexasRed‐NNNNNNNNC
5' 6FAM‐NNNNNNNNG

The fluorophore-tagged nonamers anneal with differential success to the tag sequences
according to a strategy similar to that of degenerate primers, but instead of submission to
polymerases, nonamers are selectively ligated onto adjoining DNA- the anchor primer. The
fixation of the fluorophore molecule provides a fluorescent signal that indicates whether there
is an A, C, G, or T at the query position on the genomic DNA tag. After four-colour imaging,
the anchor primer/nonamer complexes are stripped off and a new cycle is begun by replacing
the anchor primer. A new mixture of the fluorescently tagged nonamers is introduced, for
which the query position is shifted one base further into the genomic DNA tag.

5' Cy5‐NNNNNNNTN
5' Cy3‐NNNNNNNAN
5' TexasRed‐NNNNNNNCN
5' 6FAM‐NNNNNNNGN

Seven bases from the 5’ to 3’ direction and six bases from the 3’ end could be queried in this
fashion. The ultimate result is a read length of 26 bases per run (13 bases from each of the
paired tags) with a 4-base to 5-base gap in the middle of each tag.
2. Analysis and Software

The polony sequencing generates millions of 26 reads per run and this information needed to
be normalized and converted to sequence. These can be done by the software that has been
developed by Church Lab. All of the software is free and could be downloaded from the
website.[1]

3. Instruments

The sequencing instrument used in this technique could be set up by the commonly available
fluorescence microscope and a computer controlled flowcell. The required instruments cost
around US$130,000 in 2005. A dedicated polony sequencing machine, Polonator, was
developed in 2009 and sold at US$170,000 by Dover.[2][3] It and featured open source software,
reagents, and protocols and was intended for use in the Personal Genome Project. [4]

4. Strength and Weaknesses

Polony sequencing allows for a high throughput and high consensus accuracies of DNA
sequencing based on a commonly available, inexpensive instrument. Also, it is a very flexible
technique that enables variable application including BAC (bacterial artificial chromosome)
and bacterial genome resequencing, as well as SAGE (serial analysis of gene expression) tag
and barcode sequencing. Furthermore, the polony sequencing technique is emphasized as an
open system that shares everything including the software that have been developed, protocol
and reagents.

However, although the raw data acquisition could be achieved as high as 786 gigabits but only
1 bit of information out of 10,000 bits collected is useful. Another challenge of this technique
is the uniformity of the relative amplification of individual targets. The non-uniform
amplification could lower the efficiency of sequencing and posted as the biggest obstacle in
this technique.

Third gen sequencing

PacBio's SMRT (single molecule real time) sequencing is one of the most commonly used
third-generation sequencing technologies. Compared with the previous two
generations, PacBio long-read sequencing enabled by SMRT Sequencing technology requires
no PCR amplification and the read length is 100 times longer than that of NGS.
PacBio SMRT sequencing applications
PacBio SMRT sequencing can be used for genomic de novo sequencing to get high quality
genome sequences, obtaining full transcriptome information and detecting alternative splicing
isoforms, diverse mutations in target regions, and epigenetic modifications and more.
The principle of PacBio SMRT sequencing
Zero-mode waveguides (ZMWs), subwavelength optical nanostructures fabricated in a thin
metallic film, are powerful analytical tools that are capable of confining an excitation volume
to the range of attoliters, which allows individual molecules to be isolated for optical analysis
at physiologically relevant concentrations of fluorescently labeled biomolecules. Arrays of
such nanostructures can also be engineered into systems for real-time analysis of a mass of
single-molecule reactions or binding events, which is the principle of PacBio SMRT
sequencing.

Figure 2. A single SMRT Cell. Each SMRT Cell


contains 150,000 ZMWs. Approximately 35,000-75,000 of these wells produce a read in a run
lasting 0.5-4 h, resulting in 0.5-1 Gb of sequence.
PacBio SMRT Sequencing uses the innovation of ZMW to distinguish the ideal fluorescent
signal from the strong fluorescent backgrounds caused by unincorporated free-floating
nucleotides. The binding of a DNA polymerase and the template DNA strand is anchored to
the bottom glass surface of a ZMW. Laser light travels through the bottom surface of a ZMW
and not completely penetrates it, since the ZMW dimensions are smaller than the wavelength
of the light. Therefore, it allows selective excitation and identification of light emitted from
nucleotides recruited for base elongation.
Library construction
The workflow for library construction involves the following steps:
• Determine the quality of genomic DNA (gDNA)
• Shear gDNA using a g-TUBE (Covaris)
• Select size and adjust concentration
• Repair DNA damage and ends of fragmented DNA
• Conduct DNA purification
• Blunt-end ligation using blunt adapters
• Purify template for submission to a sequencer
The template, called a SMRTbell, is a closed single-stranded circular DNA, which is created
by ligating hairpin adapters to both ends of target double-stranded DNA (dsDNA) molecules.
Figure 1. Template Preparation Workflow for PacBio RS II system.
Sequencing
As in Figure 3, a SMRTbell (grey) diffuses into a ZMW, and the adaptor binds to a polymerase
immobilized at the bottom. Four types of nucleotides are labeled with a different fluorescent
dye (indicated in red, yellow, green, and blue, respectively for G, C, T, and A) so that they
have distinct emission spectrums. As a nucleotide is held in the detection volume by the
polymerase, a light pulse that identifies the base is produced. (1) A fluorescently-labeled
nucleotide binds to the template in the active site of the polymerase. (2) The fluorescence output
of the color corresponding to the incorporated base (yellow for base C as an example shows
here) is elevated. (3) The dye linker-pyrophosphate product is cleaved from the nucleotide and
diffuses out of the ZMW to end the fluorescence pulse. (4) The polymerase is translocated to
the next position. (5) The next nucleotide binds to the template in the active site of the
polymerase and initiates the next fluorescence pulse, which corresponds to base A here.
Figure 3. Sequencing via light pulses.
Bioinformatics Analysis
Bioinformatics analysis, such as de novo assembly, reference genome mapping, genome
annotation (pathogenic and susceptibility genes prediction, non-coding RNA prediction,
CRISPRs prediction), gene function annotation (COG/ GO/ KEGG), SNP/InDel
identification and comparative genomics analysis, evolutionary analysis and estimation of
divergence time are viable.
A comparison of RS II and Sequel sequencing platform
Third-generation sequencing has been widely used in genome research since the successful
launch of commercial sequencing instrument PacBio RS II in 2013. After continuous
improvement and upgrading, PacBio launched its new and upgraded third-generation
sequencer PacBio Sequel sequencing system in October 2015. A comparison of RS II and
Sequel sequencing platform is outlined below.

Oxford Nanopore Technology developed third generation sequencers that are portable, able
to sequence DNA in remote locations and produce ultra-long reads.

How does ONT work?


• ONT is based on a completely different principle than PacBio and Illumina technologies as it
doesn’t use any DNA polymerases in its instruments.
• ONT uses another type of naturally occurring protein, the pore-forming protein α-hemolysin.
These barrel shape proteins are typically found embedded within cell membranes and
regulate which molecules can enter or leave the cell.
• α-hemolysin happens to have an inner diameter of 1 nm, just large enough to allow a single
strand of DNA through.

Nanopore proteins are embedded into an artificial membrane inside the sequencing flow cell.
A current is applied across the membrane forcing the negatively charged DNA strands to
move through the nanopores. The obstruction of a nanopore by a DNA fragment leads to a
change in the current that is measured continuously by an electronics chip integrated within
the flow cell. A base-calling algorithm converts the current variation back into the original
DNA sequence.
Image credit: Genome Research Limited.

How is ONT sequencing controlled?

• ONT flow cells are controlled by an electronic chip that sits just under an artificial
membrane containing hundreds or thousands of these protein nanopores.
• A voltage is set across the membrane attracting negatively charged DNA molecules through
the nanopores, obstructing the current across the membrane.
• Because the 4 bases of DNA have different shapes and sizes, they lead to different current
variations when moving through a nanopore.
• These current variations make up ONT’s raw data.
• A base-calling algorithm then converts the current variation traces into a sequence, each
intensity step corresponding to the multiple bases obstructing the pore at a certain time.
Why use ONT?

Despite having a lower accuracy than some of the other technologies, ONT offers several
unique advantages.

• First of all, as long as a DNA sample is handled carefully, reads can reach several millions of
base pairs. These ultra-long reads significantly decrease the number of jigsaw puzzle pieces
when assembling the sequencing data into a genome.
• Depending on the model, the size of an ONT instrument ranges between that of a mobile
phone and microwave. For comparison, Illumina and PacBio largest sequencers are similar
to large fridge-freezer units. Why is this an advantage?
• Large sequencers offer greater throughput, but a small portable device requiring
nothing more than a powerful laptop can be key to performing experiments outside of
the classical academic laboratories.
• ONT’s smallest sequencer has been used in low-income countries, notably during
the Ebola pandemic in 2014, in Antarctica, or even on board the International Space
Station!
NGS file formats
There are lots of file formats related to NGS analyses. The most common ones are:
Sequence file formats
The different sequence related formats include different information about the sequence. The
most common file formats in the NGS world are: fastq and sff.

SFF
The SFF (Standard Flowgram Format) files are the 454 equivalent to the ABI chromatogram
files. They hold information about:

• the flowgram,
• the called sequence,
• the qualities,
• and the recommended quality and adaptor clipping.

These recommended clippings are given by the 454 sequencer. The Roche software takes into
account the quality and the adaptor sequence to recommend a clipping for each sequence.
Like the ABI files, these are binary files that should be opened with specialized programs.
There are several tools to extract the sequences and to convert them to a more usable format.
Roche provides one executable able to do it with the 454 machine. Alternatively we can use
the sff_extract tool to obtain a fasta file.
Fasta
The fasta format is based on a simple text. Each sequence starts with a “>” followed by the
sequence name, an space and, optionally, the description

>seq_1 description
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
>seq_2
ATCGTAGTCTAGTCTATGCTAGTGCGATGCTAGTGCTAGTCGTATGCATGGCTATGTGTG

Usually, if we have quality information, another fasta file with the quality information could
be provided. In this cases both the sequence and the quality file should have the sequences in
the same order.

>seq_1 description
54 57 54 57 48 48 48 48 57 57 57 47 47 41 42 41 47 57 57 57 57 47 44 44 44
44 50 50
54 57 57 46 43 37 44 43 57 37 37 37 57 57 57 57 52 52 52 52 57 50 47 47 52
>seq_2
52 47 52 52 50 50 50 50 50 57 57 54 57 57 57 57 57 57 57 46 46 57 57 57 57
57 57 57
57 57 57 57 57 57 57 57 57 57 57 57 29 29

sanger fastq
The fastq format was developed to provide a convenient way of storing the sequence and the
quality scores in the same file. These are text files and they look like:

@seq_1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@seq_2
ATCGTAGTCTAGTCTATGCTAGTGCGATGCTAGTGCTAGTCGTATGCATGGCTATGTGTG
+
208DA8308AD8SF83FH0SD8F08APFIDJFN34JW830UDS8UFDSADPFIJ3N8DAA

In this file every sequence has 4 lines. In the first line we get the sequence name after the
symbol “@” and, optionally, the description. The second line has the sequence and the fourth
line has the quality scores encoded as letters.

Quality coding (modified from wikipedia).


SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...
...................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
...................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...
...................
!"#$%&'()*+,-
./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
yz{|}~
| | | | |
|
33 59 64 73 104
126

S - Sanger Phred+33, raw reads typically (0, 40)


X - Solexa Solexa+64, raw reads typically (-5, 40)
I - Illumina Phred+64, raw reads typically (0, 40)

Illummina fastq
This file is almost identical to a sanger fastq file, but the encoding for the quality scores is
different. When we deal with a fastq file we have to be sure about which kind of file we are
dealing with, an illumina fastq or a sanger fastq. Unfortunately they are not easy to
differentiate. Also you have to take into account that solexa used to had a third fastq format,
the solexa fastq, although this one is mostly obsoleted. Recently Illumina has also decided to
distribute its files as Sanger fastq, so the Illumina fastq will be not used any more.

One of the seq_crumbs utilities, guess_seq_format, is able to differentiate the Sanger from
the Illumina version by looking for quality characters exclusive of the Sanger version.

SRA
SRA is the file format in which all NCBI SRA content is provided. SRA files are binary files
and we need specific tools to extract the information. There is a toolkit (SRA
Toolkit)developed by NCBI to deal with these binary files.

Compressed files
Sometime these sequence text file can be found compressed to save up hard drive space. The
most common compression formats are gzip and bgzip. bgzip is a gzip variant commonly
used in genomics because, although it is a little less efficient in the compression ratio, it
allows random access. Most software is becoming compatible with these formats.

Paired files
It is common to obtain two reads from a single molecule. Examples of these techniques are
the Illumina pair-ends and mate-pairs. In this cases for each read there is another paired read.
One common way to store those paired reads is to create to fastq files, one for the first read of
the pairs and another one for the second. In this case the files should hold the reads exactly in
the same order.

Fastq file 1
@molecule_1 1st_read_from_pair
@molecule_2 1st_read_from_pair
@molecule_3 1st_read_from_pair

Fastq file 2
@molecule_1 2nd_read_from_pair
@molecule_2 2nd_read_from_pair
@molecule_3 2nd_read_from_pair

Another option is to interleave the reads in a single file alternating the first and second read
for each pair.

Interleaved Fastq file


@molecule_1 1st_read_from_pair
@molecule_1 2nd_read_from_pair
@molecule_2 1st_read_from_pair
@molecule_2 2nd_read_from_pair
@molecule_3 1st_read_from_pair
@molecule_3 2nd_read_from_pair

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy