deep lerning annottation

Briefings in Bioinformatics, 2024, 25(3), bbae138
https://doi.org/10.1093/bib/bbae138
Review
From tradition to innovation: conventional and deep

learning frameworks in genome annotation
Zhaojia Chen, Noor ul Ain , Qian Zhao and Xingtan Zhang
Corresponding author. Xingtan Zhang, Tel.: +8615985733087; Email: zhangxingtan@caas.cn
Abstract
Following the milestone success of the Human Genome Project, the ‘Encyclopedia of DNA Elements (ENCODE)’ initiative was launched in
2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence
of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data
such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect
of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome
annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions.
Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics
algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning
was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the
potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context,
we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the
challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.
Keywords: genome annotation; genome sequence; bioinformatic; RNA-Seq technology; deep learning
INTRODUCTION correction and assembly of sequenced fragments to generate

The successful completion of the Human Genome Project the base sequence of the genome. With the genome sequence in
(HGP) has propelled genome sequencing efforts in various hand, a world of possibilities unfolds, enabling us to meticulously
model organisms, including Escherichia coli, yeast, Drosophila delve into systematic analysis unravelling the functions of all
and mice. For instance, the complete sequencing data of the genes, their interactions, decoding the regulation of genes during
E. coli genome was obtained in 1997, and in the year 2000, the an organism’s growth and development, and to elucidate the
first plant genome, Arabidopsis thaliana, was entirely sequenced. evolutionary patterns of genomes.
Nevertheless, the genome sequences of various organisms Genome annotation, accomplished through the application of
obtained through sequencing are nebulous in unraveling the bioinformatics methods and tools, entails the identification of
complete molecular processes of life. With advancements in the genes and various functional elements, including coding genes
field of sequencing, it is of paramount importance to decipher [3], non-coding RNAs [4], , repetitive sequences such as trans-
the hidden information within DNA sequences. As emphasized posons, and regulatory elements, within a genome. Traditional
by Collins, the chief scientist of the HGP, the next phase of approaches for genome annotation, such as hybridization-based
genome research involves elucidating the structure and function techniques [5] or experimental methods [3, 4], heavily rely on
of genomes [1] and establishing connections between genomics human knowledge and expertise, often with limited through-
and biology. Therefore, in 2003, scientists collectively launched put and high costs. In addition to these, various bioinformatics
the Encyclopedia of DNA Elements (ENCODE) project with the software tools have been developed for gene identification and
aim of mining and deciphering numerous functional elements annotation, such as Blast2GO [6], InterProScan [7] and Gene-
within the human genome [2]. Mark [8], among others. However, these methods and softwares
With the rapid advancement of Next-Generation Sequencing primarily constitute shallow learning approaches, with limited
technologies, the ENCODE website has been updated to the fifth capabilities to handle high-throughput data, posing challenges in
edition, providing researchers with functional genomics and terms of cost and technical accessibility for researchers without
omics data including ChIP-Seq and RNA-Seq experiment reads. backgrounds in biology and medicine. Furthermore, according
This update has opened new doors of opportunities for genomics to a report by McKinsey [9], it is estimated that by 2025, the
research, potentially providing access to billions of genomic scientific databanks will accumulate one billion human genomes,
coordinates and other related data. Typically, after sequencing presenting a significant challenge in the field of bioinformatics for
the genome of a biological organism, the first step involves the the analysis of large-scale omics data.
Received: December 1, 2023. Revised: March 8, 2024. Accepted: March 10, 2024
© The Author(s) 2024. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 | Chen et al.
Over the past decade, deep learning has achieved significant selection and parameter settings, and has high computational
success in fields such as computer vision, speech recognition and complexity when processing large datasets, and may even
natural language processing. The development of machine learn- produce false positives. Therefore, conventional methods are
ing and deep learning has opened new avenues for genome analy- difficult to obtain satisfactory results in terms of accuracy and
sis. Many researchers use machine learning techniques to identify dimensions for diverse genomes.
patterns in genomic datasets, processing and analyzing data from Gene structure prediction represents one of the most critical
a variety of biomolecular levels through statistical analysis and aspects of genome annotation, aiming to identify coding regions,
computational modeling. Deep learning, in particular, employs exons, introns and other functional elements within a genome.
multi-layered nonlinear functions to abstract data, extract repre- Traditional prediction methods combine structural features of
sentative features from large-scale datasets and make more accu- genes with evidence from gene expression and homologous pro-
rate predictions regarding DNA fragments; however, deep learning tein information from closely related species [23]. The traditional
models are more capable and f lexible. With appropriate training prediction method is based on the structural characteristics of
data and models, deep learning can automatically learn features genes while integrating gene expression and homologous pro-
and rules with minimal human participation. Several deep learn- tein evidence from related species [23]. Taking the Maker [23] as
ing methods have already been developed for the identification an example, the Semi-HMM-based Nucleic Acid Parser model is
of various genomic elements, such as exons [10], promoters [11] first trained using protein sequences, Expressed Sequence Tag
and so on. This article aims to provide a concise overview of the sequences, and homologous proteins from closely related species.
application of deep learning and machine learning in genome A specialized configuration file is established to control the oper-
annotation research and highlight potential directions for future ation of the process, and the Marker is iteratively employed based
development. on the output results to obtain the optimal result. It is worth
To comprehend the current topic, we brief ly introduce conven- mentioning that the parameters of the configuration file need
tional algorithms in genome annotation. And we subsequently to be manually configured, resulting in low-level of automa-
review methods and approaches for the identification and function performance. Moreover, the reliability of analysis requires
tional prediction of DNA segments with biological significance a large number of input samples. However, high-dimensional
using deep learning. Finally, we summarize the current challenges data increase the complexity of analysis, requiring more powerful
and potential research directions. computing resources and more data preprocessing work. In addi-
tion, methods based on structural features such as Genscan [24],
GlimmerHMM [25], Augustus [26] and Apollo [27] focus on specific
TRADITIONAL APPROACHES TO GENOME base structures, and the corresponding elements are generally
ANNOTATION identified by specific nucleotide sequences (e.g. open reading
Genome annotation, often referred to as the high-throughput frames, donor sites, TATA boxes). With the continuous develop-
annotation of the biological functions of all genes in a genome, ment of sequencing technologies and approaches, novel genomic
utilizes rigorous bioinformatics methods and tools to search and and transcriptomics data are continually emerging, potentially
define genomic elements using various approaches, including ab revealing new genes or critical structural information and iden-
initio prediction, homology-based methods and structural def- tifying genes expressed in specific tissues or physiological states
initions [12, 13]. In general, a complete and effective genome [28]. Consequently, mainstream genome annotation tools com-
annotation process mainly includes repeat sequence annotation, monly integrate transcriptomic data to aid in genome annotation
gene structure prediction, gene functional annotation, variant [29].
analysis and the identification of regulatory elements. Organisms After obtaining gene structure annotation information, the
are divided into prokaryotes and eukaryotes. The basic ideas and crucial step is to proceed with functional annotation, includ-
methods of genome annotation are similar. However, due to the ing predicted domains, protein functions and biological path-
large genome of eukaryotes and the large number of repeats, the ways. Typically, functional annotation of predicted coding genes
genome annotation process is more complicated, so this section is achieved through sequence similarity searches, primarily uti-
mainly introduces the genome annotation process of eukaryotes. lizing Basic Local Alignment Search Tool alignment methods, by
Repetitive sequences, defined as identical or symmetric DNA comparing proteins to various functional databases [NR, Swiss-
sequences occurring in genomes [14], inf luence processes such as Prot, Gene Ontology (GO), KOG, Kyoto Encyclopedia of Genes and
evolution, gene expression, and transcriptional regulation [15–17]. Genomes (KEGG)] to obtain information regarding gene function.
The recognition of repetitive sequences is the fundamental task in As genomics and bioinformatics advance, an increasing number
genome annotation and is a prerequisite for subsequent gene pre- of species’ genome sequences are being collected, necessitating
diction. Traditional methods for identifying repetitive sequences the continuous exploration of new genome functions and bio-
fall into three categories: homology-based, structure-based and logical processes. For some newly discovered gene sequences,
ab initio prediction [18–20]. Repeatmasker software, widely used especially those lacking annotation information, experimental
for repetitive sequence annotation, employs homology-based methods are usually required to determine gene function defini-
identification by conducting similarity searches against repetitive tively.
sequence libraries (RepBase and Dfam). MASiVE [21], on the other Furthermore, traditional methods tend to yield high accuracy
hand, utilizes heuristic algorithms and structural features to in predicting genes with highly conserved structures but may
analyze species-specific Long Terminal Repeats (LTR) transposons fall short in predicting genes with significant structural varia-
in plants. The ab initio prediction method, which does not rely on tions (SVs) [30]. Genome variations between individuals are typi-
repetitive structures or known sequence similarity information, is cally manifested as single nucleotide variations (SNVs), insertions,
particularly useful for novel sequences and is often more f lexible deletions and SVs. For second-generation sequencing, such as
than the aforementioned two methods [18, 22]. However, because Illumina, with shorter read lengths, predicting complex SVs, par-
it mainly analyzes the sequence structure and characteristics ticularly those with unknown structures, poses challenges, and
based on mathematical models, it is susceptible to algorithm conventional methods based on read depth [31], local assembly
Conventional and deep learning frameworks | 3
Figure 1. Deep learning workf low in genome annotation. The input data of the deep learning model is raw sequence data, obtained from short reads
obtained by genome sequencing after sequence splicing and assembly. Before inputting into the deep learning model, it first undergoes feature encoding
and then input into a deep learning network constructed by multiple hidden layers. The deep learning network extracts potential features from the
input data through multiple hidden layers for subsequent classification and prediction of genomic components.
[32] or other methods [33] may be insufficient [34]. With the The potent tool for genome annotation—deep
rapid development of long-read sequencing technologies, such as learning
Pacific Biosciences [35] and Oxford Nanopore Technologies [36], Deep learning is essentially a Deep Neural Network (DNN) archi-
information containing complete complex SVs is becoming avail- tecture consisting of an input layer, multiple hidden layers and an
able. Nevertheless, advancements in bioinformatics and compu- output layer. Firstly, raw sequencing data are subjected to feature
tational methods are still required to handle highly repetitive encoding (e.g. one-hot encoding [42], word embeddings [43], k-mer
sequences, high sequencing error rates and long read lengths [37, Counting [44]) and are mapped into the input representation of
38]. Moreover, SVs in the genome often occur in highly repetitive the deep learning model. Unlike traditional genome annotation
regions, making it challenging to distinguish SV intervals from methods, deep learning models can model nonlinear patterns
background noise. and adaptively learn feature representations from raw data with-
Of greater importance, eukaryotic genome sequences encom- out the need for manual feature design. This is achieved by
pass DNA methylation sites, promoters, enhancers and numerous embedding the computation of features directly into the machine
regulatory elements. Traditional methods, which rely on sequence learning model, resulting in an end-to-end model, as depicted in
logic based on consensus sequences and positional weight matri- Figure 1.
ces, offer limited information and accuracy in predicting the The most significant capability of deep learning is represen-
activation levels and behaviors of regulatory elements [39]. Con- tation learning, achieved through three key steps: layer-by-layer
sequently, research into gene regulation mechanisms encounters processing, feature transformation and increasing complexity.
significant limitations. In summary, genome annotation is an The raw data in genome annotation comprise various types and
ongoing and evolving process that necessitates the incorporation numerous features, originating from different data sources or
of the latest technologies and methods to continually enhance datasets, hence possessing high dimensionality and heterogene-
annotation accuracy and reliability. ity. Deep learning extracts latent features from high-dimensional
heterogeneous data, facilitating subsequent prediction and classi-
fication tasks. In the face of growing data volumes, deep learning
APPLICATION OF DEEP LEARNING IN models can enhance the width of expressive power by increasing
GENOME ANNOTATION network depth and width, and they accelerate the training process
Deep learning, a branch of machine learning, is considered a through parallel computation. With these characteristics, deep
significant step towards artificial intelligence. The essence is to learning becomes a powerful tool for genome annotation with
learn the intrinsic laws and latent representation of massive broad application prospects.
sample data by constructing multiple hidden layers. In the field Deep learning models for sequence processing include Con-
of bioinformatics, deep learning techniques have gradually found volutional Neural Networks (CNN), Recurrent Neural Networks
applications in genome annotation research [40, 41]. In this sec- (RNN); furthermore, CNN is the primary algorithm.
tion, we will introduce the technical methods and application CNN is a type of multi-layer perceptron akin to artificial neu-
cases of deep learning in genome annotation. ral networks and comprises convolutional layers, pooling layers
4 | Chen et al.
Figure 2. Structure diagram of convolutional neural network. The basic convolutional neural network consists of five parts: input layer, ReLU layer,
pooling layer, fully connected layer and output layer. The input layer is an embedding sequence, and the convolutional layer performs convolution
operations on local data through a filter. The activation layer maps the convolutional calculation results nonlinearly (usually through the activation
function ReLU). The pooling layer is sandwiched between continuous convolutional layers to compress the amount of data and parameters. The fully
connected layer integrates feature representations together, and finally predicts through the output layer.
and fully connected layers. It was pioneered by renowned com- Following the convolutional and pooling layers, neural
puter scientist Yann LeCun [45]. The core of CNN is its convo- networks often include one or more fully connected layers.
lutional and pooling layers. As illustrated in Figure 2, the con- These layers serve to f latten the highly abstracted features
volutional layer detects features at different positions by sliding processed through multiple convolutional and pooling steps
convolution kernels across the sequence, used to identify criti- into one-dimensional vectors, reducing the impact of feature
cal patterns within the sequence [46]. Importantly, the parame- positions on classification. For instance, motifs in sequences
ters of these convolution kernels are learned automatically dur- may exhibit similar features at different positions, which could
ing model training. In convolutional operations, parameters are lead to varying classification results. The role of fully connected
shared among all positions in the feature map or feature vector, layers, in this context, is akin to aggregating all regions with
substantially reducing the number of parameters that need to similar features into a same value, greatly enhancing the model’s
be learned within the network. While convolution operations robustness. Consequently, CNNs are highly useful when dealing
in neural networks essentially represent linear data process- with scenarios that require recognition or capturing spatially
ing, not all situations can be simplified as linear processing. invariant patterns [48].
The genome contains billions of base pairs and a large num- Another variant of DNNs is RNN [49, 50]. The traditional neural
ber of functional elements, and their interactions and regula- network model is fully connected between the layers, which will
tory relationships are nonlinear. Especially for the annotation ignore the context information of the sequence. RNN introduces
of three-dimensional genomics and SVs, nonlinear modeling is recurrent connections to save the information of each layer, and
needed to better understand and explain. This is where activa- synchronously updates the information of the hidden state to the
tion functions come into play to address the non-linearity in new network layer. This mechanism ensures that the information
the model [47]. Common activation functions in CNN include at any given moment is determined not only by the input layer at
ReLU, sigmoid and tanh, which are used for nonlinear transfor- that specific time but is also inf luenced by the preceding process’s
mation to enable the extracted features to represent complex output, as illustrated in Figure 3. Since the sequences themselves
functions. may carry continuous information, RNN can share weights across
Pooling layers undergoes repetitive subsampling similar to the time and extract the temporal and semantic information in the
convolutional layer, max-pooling selects the maximum value in data, Therefore, RNN has unique specifications in identifying the
each region, while average pooling computes the average value. patterns with in long and sequential data.
These layers serve to reduce dimensionality and extract the most However, traditional RNNs suffer from problems like vanish-
critical features. ing gradients and exploding gradients, making it challenging to
on the validation set during training and stopping the training

process once the performance no longer improves helps prevent
overfitting.
Applications in genome annotation

In this section, we brief ly review some of the genome annotation
problems addressed using deep learning methods and discuss
how deep learning models extend their applicability in various
genomics areas, as shown in Table 1.
Identification and classification of transposable elements

Transposable elements (TEs), also known as transposons, are the
most common repetitive sequences [61] and are widely present
Figure 3. Structure diagram of the Recurrent Neural Network. The basic
Recurrent Neural Network consists of three parts: input layer, hidden in eukaryotic genomes. TEs show great diversity across species
layer and output layer. In the hidden layer, Recurrent Neural Network and individuals, with in genomes. Moreover, owning to the recur-
stores and remembers previous information, and then inputs it into the rent process of genome evolution, TEs undergo modifications at
current calculated hidden layer unit. large levels [62]. This ongoing and incessant unceasing evolution
complicates the task of annotating TEs, making it a tough nut
to crack. Traditional bioinformatics software, relying on de novo,
capture long-range dependencies [51, 52]. Fortunately, the intro- structural, comparative genomics and homology-based methods,
duction of models like Long Short-Term Memory (LSTM) [53] often suffer from high false-positive rates [63].
and Bidirectional RNNs [54] have addressed these issues. LSTM While some studies have employed machine learning methods
models incorporate gate units to control information f low, allow- for TE identification [64, 65], recent advancements suggest that
ing dynamic changes in states across different nodes. The bidirec- deep learning techniques can yield better results [66, 67]. Yan et al.
tional RNN considers the context relationship. For each training developed the DeepTE tool [68], which transforms sequence data
sequence, it connects two recurrent networks forward and back- into two-dimensional vectors based on k-mer counts. DeepTE uti-
ward, and both networks connect an output layer. Importantly, lizes CNN to train on datasets from RepBase and PGSB databases.
there is no interconnection of information f low between the It employs a stacked architecture with eight models to classify
forward and backward networks, ensuring a non-cyclic unfolded TEs into super families and orders. DeepTE also employs con-
graph. Overall, the RNN has significant advantages in processing servative thresholds to correct misclassifications, achieving TE
genomic sequence data, which can better capture the features of classification into 15–24 super families for plants, metazoans
the sequence and more accurately predict important biological and fungi. However, k-mer-based computational approaches are
features such as gene location and functional regions. time-consuming [69] and can negatively impact neural network
In addition to CNNs, RNNs and their corresponding variants, a training. Consequently, Orozco-Arias et al. developed Inpactor2
variety of deep learning models such as attention mechanism- [70], which can create an LTR-retrotransposon reference library
based neural networks (such as Transformer [55]), graph con- in a short time frame. By combining structural methods with
volutional network [56] and autoencoder [57] have been widely deep learning and leveraging multi-core and GPU architectures,
used in genome annotation. The choice of deep learning mod- Inpactor2 achieves the best testing accuracy in just five minutes
els is not independent of each other, and appropriate models for the rice genome.
can be selected according to different needs and data types to In summary, deep learning techniques have made significant
improve the accuracy and interpretability of genome annotation. breakthroughs in the field of TE annotation. Despite of offering
The quality and scale of the training set significantly affect the higher identification accuracy, they are able to handle diversity
feature learning process in deep learning. Before inputting into across different species and individuals more effectively, address-
the model, data preprocessing is necessary to address issues ing the challenges posed by TE evolution and variation. These
such as handling erroneous labels, missing values, and outliers. tools not only enhance our understanding of TEs in the genome
Furthermore, the training set should encompass samples from all but also provide vital resources for further research into genome
important categories or scenarios to avoid biases in the model. evolution, function and regulation.
Preventing overfitting is a crucial concern when training mod-
els. Firstly, selecting an appropriate model complexity is essen- Protein-coding genes
tial. If the model is overly complex, it may overfit the training The majority of eukaryotic gene coding regions are discontinuous,
data, leading to poor performance on unseen data. Therefore, with exons and introns alternatively connected [5], and without
striking a balance between model complexity and performance fixed positions. Moreover, eukaryotic genomes exhibit extensive
is necessary. Secondly, regularization techniques are commonly diversity in length, composition and structure, including varia-
employed to prevent overfitting. By adding penalty terms [58] tions such as repeat sequences and splice isoforms, making it
to the loss function, regularization techniques can constrain the difficult for traditional machine learning methods to fully capture
size of model parameters, thereby reducing the risk of overfit- coding region features [71, 72].
ting. Additionally, cross-validation [59] is an effective method Deep learning-based methods map the nucleotide characters
for assessing model performance and selecting optimal hyper- of gene sequences into feature spaces, extracting abstract fea-
parameters. By dividing the dataset into training and validation tures from numeric sequences to effectively utilize global infor-
sets and conducting multiple training and evaluation iterations, mation. Recently, approaches using DNNs, such as Deep Belief
a better estimation of the model’s generalization ability can be Networks, have been proposed for exon-intron classification [10].
obtained. Lastly, early stopping [60] is another effective method Qingyu et al. first processed the DNA sequences of eukaryotes
for preventing overfitting. Monitoring the model’s performance using short-time Fourier transform, transforming complex DNA
6 | Chen et al.
Table 1: Genomic tools/algorithm based on deep learning architecture for Genome Annotation.
Tools ML&DL model Application Input Year References
nLLCPN+LCPNB MLPs The first attempt to propose Hierarchical REPBASE1, PGSB2 2017 54
Classifiers of TEs.
SClassifyTE optimized SVM (1) Higher accuracy; PGSB, REPBASE 2021 55
with RBF kernel (2) lack of training data results in poor
classification performance at super family
level.
DeepTE CNN (1) A tree structured classification process to PGSB, REPBASE 2020 58
classify TEs into super families and orders;
(2) detected domains inside TEs to correct
false classification;
(3) K-mer-based computational approaches
are time-consuming.
Inpactor2 CNN (1) The first NN-based tool to detect REPBASE, PGSB, REPETDB 2023 60
LTR-retrotransposons de novo;
(2) can be run using CPUs + GPUs, speeding up
the execution time up to 7 times.
None DNN (1) Identification of protein coding region by BG570, HMR195, GEN_x005fSCAN65 2020 63
Deep Belief Network model;
(2) overcome the barriers of traditional protein
coding region identification techniques.
SpliceAI CNN (1) Accurately predicts splice junctions from GENCODE-annotated pre-mRNA 2019 64
an arbitrary pre-mRNA transcript sequence transcript sequences
and noncoding genetic variants that cause
cryptic splicing;
(2) bio interpretability is low, and the
understanding of how noncoding genomic
mutations cause human disease is still far
from complete.
Bidirectional RNN Identification and prediction of splice sites of Cryptosporidium parvum 2022 88
LSTM-RNNs eukaryotic DNA.
Gene2vec word2vec Utilizes transcriptome-wide gene Gene expression data, gene type data, 2019 66
(ML) + DNN co-expression to generate a distributed gene–gene interaction data,
representation of genes. tissue-specific gene expression data
GeneWalk DeepWalk (ML) (1) Utilizes network representation learning to Gene expression data, Dynamical 2021 67
embed genes; Reasoning Assembler (INDRA)
(2) introduces bias towards genes that have statements
been studied.
DeepGMAP CNN(DL) (1) Efficiently identifies patterns in regulatory Alignment files, containing data from 2018 68
DNA sequences; chromatin accessibility assays and
(2) can process information from both forward chromatin immunoprecipitation-
and reverse sequences; sequencing experiments from the
(3) comparison with other models and ENCODE project, and regions enriched
addressing runtime memory issues are with reads were determined as peaks
necessary. by MACS2 peak caller
deepNF AE(DL) + SVM(ML) (1) Using autoencoders to construct compact The protein–protein interaction 2018 69
low-dimensional protein feature networks of both human and yeast
representations from various types of contained in the STRING database
networks;
(2) accurate annotation of proteins with
functional labels.
DeepGO CNN (1) Combining two forms of multi-layer neural The amino acid (AA) sequence of a 2018 70
network-based representation learning to protein
acquire useful features for predicting protein
function;
(2) require a large amount of training data and
consumes significant computational
resources.
LncADeep DNN LncADeep demonstrates competitive time RefSeq, GENCODE 2018 71
efficiency compared to state-of-the-art tools
in lncRNA identification, and it stands out as
the fastest tool for large-scale prediction of
lncRNA–protein interactions.
(Continued)
Table 1: Continued
Tools ML&DL model Application Input Year References
HMPI CNN Simultaneously simulating structural profiles Eukaryotic Promoter Database (EPD), 2022 7
of promoters and the original sequences of PlantProm DB
promoters for comprehensive promoter
identification, and successfully applied to
human, plant, and Escherichia coli K-12 strain
datasets.
BERT-DNA CNN Using pre-trained BERT model to extract The chromatin state information of 2021 75
features enables efficient identification of nine cell lines, including H1ES, K562,
DNA enhancers from sequence information. GM12878, HepG2, HUVEC, HSMM,
NHLF, NHEK and HMEC
DeepArk CNN (1) Predicting the regulatory activity of Caenorhabditis elegans, D. rerio, D. 2021 76
cis-elements from the DNA sequences and melanogaster and M. musculus
genomic variants of four widely studied
model organisms. (2) Affected by data
limitations and contextual complexity,
optimization of CNN architectures and
sequence lengths is required.
DeepVariant CNN (1) Transforming the problem of distinguishing Next-generation sequencing (NGS) 2018 87
between different mutation types and lengths reads and reference genome
into an image classification task;
(2) efficiently identifying mutation sites
using CNN.
Basset CNN (1) Predicting pathogenic single nucleotide DNA sequence information derived 2016 78
polymorphisms (SNPs); from DNase I hypersensitive sites
(2) lack the ability for de novo annotation of (DHS)
large genomes.
MuRaL CNN (1) Learning mutation information from The DNA sequences of four 2022 79
proximal and distal sequence environments representative species—H. sapiens, M.
to generate high-quality mutation rate maps mulatta, A. thaliana and D. melanogaster
for humans and multiple species;
(2) uneven coverage of genomic regions;
(3) limited scope for other mutation types.
SVision CNN Detect and characterize complex structural Long-read sequencing data 2022 80
variants (CSVs) from long-read sequencing
data.
ML: machine learning, DL: deep learning, MLP: multi-layer perceptron, SVM: support vector machine, CNN: Convolutional Neural Network, DNN: Deep Neural
Network, AE: autoencoder.
strings into numerical sequences. After using Random Forests emergence of genomic data, deep learning will have more oppor-
for feature selection, the extracted feature set is used as the tunities in predicting gene start and stop positions, splice sites,
discriminant variable, and the known coding discriminant result exons, introns and other aspects.
is used as the discriminant target to construct a deep belief
network model. The comparison on three standard test datasets Functional annotation
show that the accuracy and specificity of the deep belief network Existing gene functional analogies are discrete and primarily
model are significantly better than those of the logistic regression generated through manual processes [43]. The widespread appli-
model and the Bayesian discriminant model. cation of high-throughput experiments has resulted in exten-
Alternative splicing is a crucial step in protein-coding gene sive molecular interaction networks, providing rich sources of
transcription, and accurate identification of splice sites is essen- information for gene and protein functional annotation. DNNs
tial for understanding and analyzing protein coding. SpliceAI can learn from various types of biological data, inferring interac-
[73], which does not rely on predefined features, predicts splice tions between genes and biological functions followed by train-
function solely based on pre-mRNA transcripts as an input. It ing the model. Gene2vec [43] utilizes the Skip-gram algorithm
constructs a 32-layer DNN that predicts pre-mRNA splice sites from the Word2Vec neural network model to transform gene
by evaluating 10,000 nucleotides of context sequences, directly expression data into text, generating distributed representations
identifying splice sites from primary sequences. However, as a of genes based on gene co-expression network of the transcrip-
black-box model, deep learning has low bio interpretability, so tome, thus predicting functions of unknown genes. This method
the understanding of how mutations in non-coding genes cause has shown promising results in gene annotation tasks across
human disease is far from complete. Singh et al. combined bidi- multiple species.
rectional LSTM and recursive neural network models for identify- The most commonly used softwares for gene functional anno-
ing and predicting splice sites in eukaryotic DNA, facilitating exon tation are GO and the KEGG. However, these methods do not pro-
recognition [74]. vide specific functional information for genes, meaning that the
In conclusion, deep learning has made progress in identify- functions of annotated genes may not be related to the so-called
ing and analyzing protein-coding regions. With the continuous biological processes, molecular functions in the network. Thus,
8 | Chen et al.
finding the most relevant functions for genes in specific exper- limitations. For example, it may not cover TF binding in rare
imental contexts is a challenge in gene functional annotation. cell types. In addition, DeepArk’s predictions may vary across
The Churchman lab developed the Genewalk method [75], which different contexts, increasing the complexity of analysis, and
extends GO annotation by learning network representations to this complexity can make analysis challenging but also generate
identify the most important genes and their related functions in novel hypotheses for mechanistic experiments. Thus, there’s still
gene lists, facilitating the identification of key genes and down- room for improvement in DeepArk. Novel CNN architectures
stream experimental breakthroughs. Because the study used the and optimization of sequence lengths are expected to enhance
INDRA and Pathway Commons knowledge bases to assemble the regulatory activity prediction.
GeneWalk network, there are certain limitations when dealing Over time, deep learning methods for identifying genomic func-
with more distant species, such as those more closely related to tional elements are evolving and proliferating, thus provide robust
humans. support for our in-depth understanding of genome function and
Deep learning has been applied to predict relationships regulation mechanisms.
between genes and phenotypes across multiple species. A
typical example is DeepGMAP [76], a deep learning-based Identifying sequence variation
genotype–phenotype mapping platform. It trains various neural Genomic mutations are the basis of genetic diversity, and various
network architectures on epigenomic data and incorporates ideas changes in DNA sequences (e.g. SNVs, insertions, repeats) signif-
from graph embedding and attention mechanisms to predict icantly impact on an organism’s physiology and behavior. Deep
associations between genes and phenotypes. The model can learning methods help efficiently extract features of different
process both positive and negative sequence information, but types of variations and f lexibly identify various genomic data
comparing with other models and addressing runtime memory types.
issues are necessary. In order to learn the internal connections of sequences and
Moreover, researchers have utilized deep learning methods efficiently identify mutation sites, Poplin et al. developed the
to predict the functions of non-coding RNAs [77] and protein– DeepVariant [30], which transforms the problem of distinguishing
protein interactions [78, 79], respectively. Long non-coding RNAs different mutation types and lengths into an image classification
(lncRNAs) share many similarities with mRNAs, such as tran- problem. The seven channels of the image are used to represent
script length and splicing structure, making lncRNA annotation different gene data expression information, and the variant model
challenging. Yang et al. developed LncADeep [77], the first tool to Inception-v3 network of CNN is used to effectively extract image
identify and infer the functions of lncRNAs. The model integrates features, transforming the problem of identifying mutations from
intrinsic and homologous features of sequences and employs an expert-driven, technology-specific-statistical modeling pro-
deep belief networks to discriminate transcripts, incorporating cess to a more automated process for optimizing a universal
annotations from KEGG and Reactome, among others, for func- model for data.
tional annotation. Many non-coding variations are associated with disease risk.
Zhou et al. utilized whole-genome deep learning analysis to pre-
dict the specific regulatory effects and adverse impacts of gene
Regulatory elements mutations [84], demonstrating the involvement of non-coding
In addition to protein-coding regions, the genome contains cru- gene mutations in synaptic transmission and neuron develop-
cial regulatory elements that determine when, where and how ment. Basset et al. [42] attempted to link neural network com-
genes are expressed. A deep understanding of regulatory elements ponents with biological significance and predicted pathogenic
is essential for elucidating the mechanisms of life, the causes single-nucleotide polymorphisms (SNPs) using CNNs. However,
of human diseases and conservation patterns among species it lacks the capability to de novo annotate large genomes. Thus,
[80, 81]. In recent years, DNNs have demonstrated outstanding by profiling the genome and variations of individual patients,
performance in the identification of promoters, enhancers and deep learning models help predict the most effective drug and
transcription factors. its dosage. As genomic sequences significantly inf luence muta-
Wang et al. proposed a hybrid model, HMPI [11], for promoter tion rates and are closely related to various functional genomic
recognition, which combines fully connected networks and features, it is speculated that DNN models can learn informa-
DenseNet-based Deep Structural Profile Networks, achieving tion related to mutation rates by learning from extensive nearby
excellent performance on datasets for plants, humans, and sequences, thus providing better mutation rate estimates. MuRaL,
E. coli. a recently developed framework [85], was constructed with two
Enhancers, which increase the transcriptional activity of modules learning mutation information from proximal and distal
promoters, are critical regulatory elements. Researchers have sequence context. It employs different types of DNNs to extract
recently improved enhancer prediction by processing DNA features, generating high-quality mutation rate maps for humans
sequences with pre-trained Bidirectional Encoder Representa- and multiple species. However, the model demonstrates rela-
tions from Transformers [82]. This approach combines bidirec- tively poor mutation rate estimates in certain genomic regions,
tional encoding with CNNs to capture interpretable features, particularly those overlapping recent segmental duplications on
resulting in significant enhancements in enhancer prediction human chromosome 8, indicating potential inaccuracies and lim-
performance. Identifying transcription factor binding sites is itations in coverage. Moreover, while the model addresses fine-
equally crucial for understanding transcriptional regulation. scale germline mutation rates, it may have limited applicability
DeepArk [83], a deep learning model for studying cis-regulatory for predicting mutation rates of other types, such as small inser-
activity across four extensively researched species, accurately tions and deletions.
predicts thousands of different regulatory features, including In contrast to simple SVs like SNPs, complex SVs involve
chromatin states, histone marks and transcription factors. changes in large genomic sequences, often with multiple
Despite predicting thousands of regulatory features, DeepArk breakpoints, and are frequently overlooked. Researchers have
may still miss modeling certain regulatory features due to data developed a deep learning-based multi-object recognition
framework called SVision [86]. It represents sequence variations

as denoised images depicting differences between the sequence Key Points
and its corresponding fragment on the reference genome,
• The ability to characterize data and features learning for
characterizing complex SVs with different structures from the
conventional genome annotation methods was limited.
genome.
• Deep learning model becomes a powerful tool for
Genomic variations are often the result of errors during DNA
genome annotation with broad application prospects.
replication, recombination processes, or external environmental
• Two basic frameworks of deep learning models for
pressures and experimental handling. The application of deep
sequence processing include Convolutional Neural Net-
learning in the field of genomic variations not only aids in muta-
works (CNN), Recurrent Neural Networks (RNN).
tion detection and identification of variation types but also con-
• The application field of deep learning in genome anno-
tributes to our understanding of the biological functions and
tation mainly includes five aspects.
mechanisms related to genomic variation.
In addition to the above aspects, deep learning has been applied
to other genome annotation tasks, such as predicting methy-
lation sites [87], annotating transcription factor binding sites
[46], and extending to epigenomics, structural genomics, func-
tional genomics and other fields. So far, deep learning models
FUNDING
have shown significant promise in genome annotation, bringing This work was supported by National Key Research and Devel-
renewed vigor to our research efforts. opment Program of China (2021YFF1000900), Shenzhen Science
and Technology Program (Grant No. RCYX20210706092103024),
National Natural Science Foundation of China grant (No.
CONCLUSION 32222019) and National Natural Science Foundation of China
In summary, we have presented an overview of conventional (32102218).
methods for genome annotation, discussed the fundamental
frameworks of CNNs and RNNs. With the advancements in arti-
ficial intelligence and deep learning models’ bioinformatics finds
its applications in diverse facets in genomics, i.e. protein-coding
REFERENCES
regions, regulatory elements, sequence variations, functional 1. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the
annotation and more. It is evident that the capacity of deep future of genomics research. Nature 2003;422:835–47.
learning to handle high-dimensional and heterogeneous data has 2. An integrated encyclopedia of DNA elements in the human
significantly aided researchers in genome annotation, analysis genome. Nature 2012;489:57–74.
and classification. 3. Harrow J, Nagy A, Reymond A, et al. Identifying protein-coding
Although deep learning models outperform many traditional genes in genomic sequences. Genome Biol 2009;10:201.
methods in genomic annotation, machine learning may perform 4. Hüttenhofer A, Vogel J. Experimental approaches to identify
better for small datasets. Additionally, deep learning requires non-coding RNAs. Nucleic Acids Res 2006;34:635–46.
hardware support and a large amount of parallelism [88], so deep 5. Hoheisel JD. Application of hybridization techniques to genome
learning algorithms are not always superior to machine learn- mapping and sequencing. Trends Genet 1994;10:79–83.
ing algorithms. Furthermore, when integrating natural language 6. Conesa A, Götz S, García-Gómez JM, et al. Blast2GO: a universal
processing models into genomic data, interpretability challenges tool for annotation, visualization and analysis in functional
are inevitably encountered. The ability of deep learning to extract genomics research. Bioinformatics 2005;21:3674–6.
features from high-dimensional and heterogeneous data is a 7. Zdobnov EM, Apweiler R. InterProScan – an integration platform
double-edged sword. The highly non-linear structure and over- for the signature-recognition methods in InterPro. Bioinformatics
parameterization of DNNs enable precise problem-solving and 2001;17:847–8.
predictions. However, the opaque nature of feature extraction and 8. Besemer J, Borodovsky M. GeneMark: web software for gene
decision-making processes from the data hinders the biological finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res
interpretability of omics data, The specific training details of 2005;33:W451–4.
the model are not known. At present, the main approach is to 9. Manyika J, Chui M, Bughin J. Disruptive technologies: advances
infer and evaluate the features and weights of the model by that will transform life, business, and the global economy. 2013.
studying the relationship between input and output, and provide a 10. Qingyu HU, Guangchen LIU. Application of deep belief network
feature importance score. Additionally, the choice of deep learning in recognition of protein coding regions[J]. Comput Eng Appl
models poses a challenging step. CNN excel in capturing crucial 2020;56(4):247–55.
attributes of omics data, RNN leverage contextual information 11. Wang Y, Peng Q, Mou X, et al. A successful hybrid deep learn-
from sequences and autoencoders uncover hidden data distribu- ing model aiming at promoter identification. BMC Bioinform
tions. Therefore, users are required to have an acute criteria and 2022;23:206.
understanding for selection of the appropriate model based on the 12. Ranganathan S, Gribskov M, Nakai K, et al., (eds). Encyclopedia
research objectives. of bioinformatics and computational biology. Amsterdam, Oxford,
Although the impact of deep learning in the field of genome Cambridge: Elsevier, 2019, p. 3346.
annotation may not have revolutionized the field as dramat- 13. Stein L. Genome annotation: from sequence to biology. Nat Rev
ically as it has in protein structure prediction, we anticipate Genet 2001;2:493–503.
a brighter future as new models continue to emerge, and 14. Kazazian HH, Jr. Mobile elements: drivers of genome evolution.
omics data to proliferate. Deep learning holds the potential Science 2004;303:1626–32.
to bring significant advancements to the realm of genome 15. Lu Q, Wallrath LL, Granok H, Elgin SC. (CT)n (GA)n repeats and
annotation. heat shock elements have distinct roles in chromatin structure
10 | Chen et al.
and transcriptional activation of the Drosophila hsp26 gene. Mol 38. Jain M, Olsen HE, Paten B, Akeson M. The Oxford Nanopore
Cell Biol 1993;13:2802–14. MinION: delivery of nanopore sequencing to the genomics com-
16. Kundu TK, Rao MR. CpG islands in chromatin organization and munity. Genome Biol 2016;17:239.
gene expression. J Biochem 1999;125:217–22. 39. Vaishnav ED, de Boer CG, Molinet J, et al. The evolution, evolv-
17. Shapiro JA, von Sternberg R. Why repetitive DNA is essential to ability and engineering of gene regulatory DNA. Nature 2022;603:
genome function. Biol Rev Camb Philos Soc 2005;80:227–50. 455–63.
18. Lerat E. Identifying repeats and transposable elements in 40. Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new com-
sequenced genomes: how to find your way through the dense putational modelling techniques for genomics. Nat Rev Genet
forest of programs. Heredity 2010;104:520–33. 2019;20:389–403.
19. Romero JR, Carballido JA, Garbus I, et al. A bioinformatics 41. Libbrecht MW, Noble WS. Machine learning applications in
approach for detecting repetitive nested motifs using pattern genetics and genomics. Nat Rev Genet 2015;16:321–32.
matching. Evol Bioinform 2016;12:247–51. 42. Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory
20. Bergman CM, Quesneville H. Discovering and detecting trans- code of the accessible genome with deep convolutional neural
posable elements in genome sequences. Brief Bioinform 2007;8: networks. Genome Res 2016;26:990–9.
382–92. 43. Du J, Jia P, Dai Y, et al. Gene2vec: distributed representation of
21. Darzentas N, Bousios A, Apostolidou V, Tsaftaris AS. MASiVE: genes based on co-expression. BMC Genom 2019;20:82.
mapping and analysis of Sirevirus elements in plant genome 44. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidi-
sequences. Bioinformatics 2010;26:2452–4. rectional encoder representations from transformers model for
22. Guo R, Li YR, He S, et al. RepLong: de novo repeat identifica- DNA-language in genome. Bioinformatics 2021;37:2112–20.
tion using long read sequencing data. Bioinformatics 2018;34: 45. LeCun Y, Boser BE, Denker JS, et al. Backpropagation applied to
1099–107. handwritten zip code recognition. Neural Comput 1989;1:541–51.
23. Cantarel BL, Korf I, Robb SM, et al. MAKER: an easy-to-use anno- 46. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the
tation pipeline designed for emerging model organism genomes. sequence specificities of DNA- and RNA-binding proteins by
Genome Res 2008;18:188–96. deep learning. Nat Biotechnol 2015;33:831–8.
24. Eyras E, Reymond A, Castelo R, et al. Gene finding in the chicken 47. Xu B, Wang N, Chen T, et al. Empirical evaluation of rectified
genome. BMC Bioinform 2005;6:131. activations in convolutional network. Comput Sci 2015. https://
25. Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: doi.org/10.48550/arXiv.1505.00853.
two open source ab initio eukaryotic gene-finders. Bioinformatics 48. Zou J, Huss M, Abid A, et al. A primer on deep learning in
2004;20:2878–9. genomics. Nat Genet 2019;51:12–8.
26. Stanke M, Waack S. Gene prediction with a hidden Markov model 49. Jordan MI. JAiP. Serial Order: A Parallel Distributed Processing
and a new intron submodel. Bioinformatics 2003;19:ii215–25. Approach. 1997;121:471–95.
27. Dunn N, Unni D, Diesh C, et al. Apollo: democratizing genome 50. Elman JL. Finding structure in time. Cognit Sci 1990;14:179–211.
annotation. PLoS Comput Biol 2019;15:e1006790. 51. Werbos PJ. Backpropagation through time: what it does and how
28. Cerqueira GC, Arnaud MB, Inglis DO, et al. The Aspergillus to do it. Proc IEEE 1990;78:1550–60.
Genome Database: multispecies curation and incorporation of 52. Pascanu R, Mikolov T, Bengio YO. On the difficulty of
RNA-Seq data to improve structural gene annotations. Nucleic training recurrent neural networks. International Conference on
Acids Res 2014;42:D705–10. Machine Learning. JMLR.org, 2013. https://doi.org/10.1007/s12088-
29. Holt C, Yandell M. MAKER2: an annotation pipeline and genome- 011-0245-8.
database management tool for second-generation genome 53. Hochreiter S, Schmidhuber J. Long short-term memory. Neural
projects. BMC Bioinform 2011;12:491. Comput 1997;9:1735–80.
30. Poplin R, Chang P-C, Alexander D, et al. A universal SNP and 54. Schuster M, Paliwal KK. Bidirectional Recurrent Neural Net-
small-indel variant caller using deep neural networks. Nat works. IEEE Trans Signal Process 1997;45:2673–81.
Biotechnol 2018;36:983–7. 55. Zhou J, Chen Q, Braun PR, et al. Deep learning predicts DNA
31. Yoon S, Xuan Z, Makarov V, et al. Sensitive and accurate detec- methylation regulatory variants in the human brain and eluci-
tion of copy number variants using read depth of coverage. dates the genetics of psychiatric disorders. Proc Natl Acad Sci U S
Genome Res 2009;19:1586–92. A 2022;119:e2206069119.
32. Chen K, Chen L, Fan X, et al. TIGRA: a targeted iterative graph 56. Yuan Y, Bar-Joseph Z. GCNG: graph convolutional networks for
routing assembler for breakpoint assembly. Genome Res 2014;24: inferring gene interaction from spatial transcriptomics data.
310–7. Genome Biol 2020;21:300.
33. Jiang Y, Wang Y, Brudno M. PRISM: pair-read informed split-read 57. Wang Y, Liu T, Xu D, et al. Predicting DNA methylation state of
mapping for base-pair level detection of insertion, deletion and CpG dinucleotide using genome topological features and deep
structural variants. Bioinformatics 2012;28:2576–83. networks. Sci Rep 2016;6:19598.
34. Gong T, Hayes VM, Chan EKF. Detection of somatic structural 58. Wang W, Zhang X, Dai DQ. DeFusion: a denoised network regu-
variants from short-read next-generation sequencing data. Brief larization framework for multi-omics integration. Brief Bioinform
Bioinform 2021;22(3):bbaa056. 2021;22(5).
35. Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark 59. Parvandeh S, Yeh HW, Paulus MP, McKinney BA. Consen-
matter: bioinformatics of long-range sequencing and mapping. sus features nested cross-validation. Bioinformatics 2020;36:
Nat Rev Genet 2018;19:329–46. 3093–8.
36. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten 60. Wang K, Abid MA, Rasheed A, et al. DNNGP, a deep neural
years of next-generation sequencing technologies. Nat Rev Genet network-based method for genomic prediction using multi-
2016;17:333–51. omics data in plants. Mol Plant 2023;16:279–93.
37. Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT 61. Muszewska A, Hoffman-Sommer M, Grynberg M. LTR retrotrans-
sequencing. Genome Biol 2013;14:405. posons in fungi. PloS One 2011;6:e29425.
62. Morse AM, Peterson DG, Islam-Faridi MN, et al. Evolution 75. Ietswaart R, Gyori BM, Bachman JA, et al. GeneWalk identifies
of genome size and complexity in Pinus. PloS One 2009;4: relevant gene functions for a biological context using network
e4332. representation learning. Genome Biol 2021;22:55.
63. Orozco-Arias S, Isaza G, Guyot R. Retrotransposons in plant 76. Onimaru K, Nishimura O, Kuraku A. A regulatory-sequence
genomes: structure, identification, and classification through classifier with a neural network for genomic information
bioinformatics and machine learning. Int J Mol Sci 2019;20(15). processing. Cold Spring Harbor Laboratory, 2018. https://doi.
64. Nakano FK, Pinto WJ, Pappa GL, et al. Top-down strategies for org/10.1101/355974.
hierarchical classification of transposable elements with neural 77. Yang C, Yang L, Zhou M, et al. LncADeep: an ab initio lncRNA
networks. In: 2017 International Joint Conference on Neural Networks identification and functional annotation tool based on deep
(IJCNN). 2017, p. 2539–46. learning. Bioinformatics 2018;34:3825–34.
65. Panta M, Mishra A, Hoque MT, Atallah J. ClassifyTE: a stacking- 78. Gligorijevic V, Barot M, Bonneau R. deepNF: deep network fusion
based prediction of hierarchical classification of transposable for protein function prediction. Bioinformatics 2018;34:3873–81.
elements. Bioinformatics 2021;37:2529–36. 79. Kulmanov M, Khan MA, Hoehndorf R, Wren J. DeepGO: predict-
66. Montesinos-López OA, Montesinos-López A, Pérez-Rodríguez P, ing protein functions from sequence and interactions using a
et al. A review of deep learning applications for genomic selec- deep ontology-aware classifier. Bioinformatics 2018;34:660–8.
tion. BMC Genom 2021;22:19. 80. Pregizer S, Mortlock DP. Control of BMP gene expression by long-
67. da Cruz MHP, Domingues DS, Saito PTM, et al. TERL: classifica- range regulatory elements. Cytokine Growth Factor Rev 2009;20:
tion of transposable elements by convolutional neural networks. 509–15.
Brief Bioinform 2021;22(3):bbaa185. https://doi.org/10.1093/bib/ 81. Wittkopp PJ, Kalay G. Cis-regulatory elements: molecular mech-
bbaa185. anisms and evolutionary processes underlying divergence. Nat
68. Yan H, Bombarely A, Li S. DeepTE: a computational method for Rev Genet 2011;13:59–69.
de novo classification of transposons with convolutional neural 82. Le NQK, Ho QT, Nguyen TT, Ou YY. A transformer architecture
network. Bioinformatics 2020;36:4269–75. based on BERT and 2D convolutional neural network to iden-
69. Pandey P, Bender MA, Johnson R, et al. Squeakr: an exact and tify DNA enhancers from sequence information. Brief Bioinform
approximate k-mer counting system. Bioinformatics (Oxford, Eng- 2021;5:5. https://doi.org/10.1093/bib/bbab005.
land) 2018;34(4):568–75. 83. Cofer EM, Raimundo J, Tadych A, et al. Modeling transcriptional
70. Orozco-Arias S, Humberto Lopez-Murillo L, Candamil-Cortés regulation of model species with deep learning. Genome Res
MS, et al. Inpactor2: a software based on deep learning to identify 2021;31:1097–105.
and classify LTR-retrotransposons in plant genomes. Brief Bioin- 84. Zhou J, Park CY, Theesfeld CL, et al. Whole-genome deep-
form 2023;24(1):bbac511. learning analysis identifies contribution of noncoding muta-
71. Rajapakse JC, Loi SH. Markov encoding for detecting signals in tions to autism risk. Nat Genet 2019;51:973–80.
genomic sequences. IEEE/ACM Trans Comput Biol Bioinform 2005;2: 85. Fang Y, Deng S, Li C. A generalizable deep learning framework
131–42. for inferring fine-scale germline mutation rate maps. Nat Mach
72. Yu N, Li Z, Yu Z. Survey on encoding schemes for genomic data Intell 2022;4:1209–23.
representation and feature learning—from signal processing to 86. Lin J, Wang S, Audano PA, et al. SVision: a deep learning approach
machine learning. Big Data Mining Anal 2018;1:191–210. to resolve complex structural variants. Nat Methods 2022;19:
73. Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. 1230–3.
Predicting splicing from primary sequence with deep learning. 87. Tan F, Tian T, Hou X, et al. Elucidation of DNA methylation on
Cell 2019;176:535–48.e24. N6-adenine with deep learning. Nat Mach Intell 2020;2:466–75.
74. Singh N, Nath R, Singh DB. Splice-site identification for exon pre- 88. Lecun Y. 1.1 Deep learning hardware: past, present, and future.
diction using bidirectional LSTM-RNN approach. Biochem Biophys In: Proceedings of the 2019 IEEE International Solid- State Circuits
Rep 2022;30:101285. Conference - (ISSCC), 2019

deep lerning annottation

Uploaded by

Copyright:

Available Formats

deep lerning annottation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

deep lerning annottation

Uploaded by

Copyright:

Available Formats

Briefings in Bioinformatics, 2024, 25(3), bbae138

From tradition to innovation: conventional and deep

INTRODUCTION correction and assembly of sequenced fragments to generate

on the validation set during training and stopping the training

Applications in genome annotation

Identification and classification of transposable elements

Tools ML&DL model Application Input Year References

framework called SVision [86]. It represents sequence variations

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.