1 s2.0 S0303264718302843 Main

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

BioSystems 174 (2018) 37–48

Contents lists available at ScienceDirect

BioSystems
journal homepage: www.elsevier.com/locate/biosystems

Review article

A guide to gene regulatory network inference for obtaining predictive T


solutions: Underlying assumptions and fundamental biological and data
constraints
Sara Barbosaa, , Bastian Niebela, Sebastian Wolfa, Klaus Maucha, Ralf Takorsb

a
Insilico Biotechnology AG, Meitnerstrasse 9, 70563 Stuttgart, Germany
b
Institute of Biochemical Engineering, University of Stuttgart, Allmandring 31, 70569 Stuttgart, Germany

ARTICLE INFO ABSTRACT

Keywords: The study of biological systems at a system level has become a reality due to the increasing powerful computational
Gene Regulatory Network Inference approaches able to handle increasingly larger datasets. Uncovering the dynamic nature of gene regulatory net-
Regression works in order to attain a system level understanding and improve the predictive power of biological models is an
Information theory important research field in systems biology. The task itself presents several challenges, since the problem is of
Bayesian networks
combinatorial nature and highly depends on several biological constraints and also the intended application. Given
Boolean networks
the intrinsic interdisciplinary nature of gene regulatory network inference, we present a review on the currently
Neural networks
available approaches, their challenges and limitations. We propose guidelines to select the most appropriate
method considering the underlying assumptions and fundamental biological and data constraints.

1. Introduction inference approach. Recent technologies are collected and qualified


with respect to underlying assumptions and fundamental biological
Since the pioneering days of the human genome project in 2000, constraints. As such, statements may serve as a compendium to decide
sequencing costs have dropped significantly to 0.000000012$ per base on the best suitable approach for a given data analysis problem.
pair (Wetterstrand, 2017). Accordingly, a plethora of genomic and
transcriptomic datasets have been created covering thousands of 2. Biological constraints
genome sequences or mirroring transcriptional patterns in microbial or
mammalian cells, or in human beings. With the increasingly amount of Gene expression values can be measured at a single time point or
available data, the challenge of how to better interpret it has also during consecutive time points, after some perturbation is applied to the
emerged (Vivek-Ananth and Samal, 2016). There is a need to develop system (Hecker et al., 2009; Markowetz and Spang, 2007). These are
methods that allow full discovery of information that is often hidden in known as steady-state and time-series data, respectively (Wang et al.,
the complex gene–gene or gene–environment interactions. The in- 2013). Time-series data include useful extra information of how gene
ference of gene regulatory networks (GRNs) represents a very pro- expression values change, providing information about the system dy-
mising approach to fulfil this challenging task. GRNs aim at unraveling namics (Lingeman and Shasha, 2012). However, due to the added tem-
gene expression patterns, localization and temporal organization poral information, they are intrinsically more complex and thus the GRN
(Fig. 1). The inference of such networks is based on the assumption that inference may involve higher computational effort (Lingeman and
it is possible to model the expression level of one gene as a function of Shasha, 2012; Zavlanos et al., 2011). Perturbations can be of different
one or more genes (Kabir et al., 2010; Wu et al., 2016). This task in- nature, comprising environmental changes, manipulations on the meta-
herits several challenges, since the problem is of combinatorial nature. bolic, proteomic, genetic or transcriptomic level (Hecker et al., 2009).
Moreover, different biological constraints and also the intended appli- The sampling time and time scales are also one crucial point for the
cation for the inferred GRN play an important role in the choice of the collection of relevant data. One should be aware of the time scales of the
most appropriate inference method. biological processes and choose the sampling time and time scales ac-
The purpose of the present review is to guide and help beginners in cordingly, to guarantee that the collected data correctly represents the
this field with the challenging task of selecting the best suitable GRN biological process in question. The measurement technology also plays


Corresponding author.
E-mail address: sara.barbosa@insilico-biotechnology.com (S. Barbosa).

https://doi.org/10.1016/j.biosystems.2018.10.008
Received 10 August 2018; Received in revised form 5 October 2018; Accepted 8 October 2018
Available online 09 October 2018
0303-2647/ © 2018 Elsevier B.V. All rights reserved.
S. Barbosa et al. BioSystems 174 (2018) 37–48

Fig. 1. Gene regulatory network inference. GRN inference methods are applied to time-series or non-temporal transcript data. They can be divided into gene selection
and gene ranking methods. Examples of gene selection and gene ranking based methods are the LARS (a) and decision trees (b), respectively. From the application of the
inference method to the transcript data, the most relevant predictive genes for a specific target gene are inferred. A GRN can then be constructed from this information.

an important role. Several have been considered including Chromatin try to model them, infer the causal map of molecular interactions, de-
Immunoprecipitation combined with microarray technology (ChIP-on- sign experimental and perturbation experiments or for example drug
chip), DNA microarrays, Polymerase Chain Reaction (PCR) and others. discovery. Fig. 2 summarizes some of the biological constraints and
DNA microarray has been one of the most common technologies for GRN intended applications of GRNs.
inference (Ghazikhani et al., 2011; Haury et al., 2012; Hecker et al., We now focus on the example of drug discovery and process de-
2009; Kao et al., 2004; Manioudaki and Poirazi, 2013). The major velopment. The motivation behind drug discovery relies on a disease or
drawback of DNA microarrays is that the data contains high noise-to- condition that does not have yet an available therapeutic solution, or
signal ratio levels. Next generation sequencing (NGS) techniques have the available solutions are not efficient (Hughes et al., 2011). One of the
recently allowed the development of more advanced gene expression first steps involves the identification of target molecules or pathways
measuring methods (Martins et al., 2016). NGS allows parallel processing (Jiang and Zhou, 2005). Candidate drugs are then chosen taking into
of millions of sequence reads and avoids some of the cloning bias issues account several key factors, as for example if they easily attach them-
(Mardis, 2008). As the cost of NGS techniques lowers (cost per genome selves to the target and if a cascade of reactions that results in a ther-
from around 100M$ in 2001 to 1K$ in 2015, data source: www.geno- apeutic effect is triggered (Hughes et al., 2011). One of the next steps is
me.gov), it will be possible to perform genome resequencing of several drug production, taking place in organisms such as CHO or E. coli.
biological systems in order to capture a spectrum of variability and es- Starting with an initial concept of the production process, the process
tablish a baseline for different studies (Mardis, 2008). can be optimized in order to increase different performance indicators.
Several constraints can arise from the data used for GRN inference. Fig. 3 depicts the main steps from drug discovery to process develop-
Is it transcript data? Is it time-series transcript data? How was the data ment. With this simple example, we start to see that different steps may
measured? Does it have high levels of noise? Is it functional (pertur- demand different approaches, since they occur in different organisms
bation) data? The studied organism can also contribute with more and therefore have their unique constraints.
biological constraints. If we are looking into the GRN of Escherichia coli When we talk about drug discovery, the focus is on human models
(E. coli), Chinese Hamster Ovarian (CHO) or even human beings, the and for process development usually the focus is on CHO and/or E. coli
constraints are different. Not only the size of the network is different, models. Nevertheless, it is necessary to have knowledge of the under-
but also the available data and the prior knowledge is different. For lying biological mechanisms for both cases in order to make predictive
example, the E. coli GRN is smaller when compared to the CHO or solutions. Metabolism and gene regulation are considered the two pri-
human being GRNs; there is also more published available data and mary activities of life (Yeang and Vingron, 2006). The coupling be-
more comprehensive prior knowledge. All of these constraints play an tween them is not yet well understood although of great importance,
important role and can sometimes become an impediment in the choice since it is known that metabolic reactions are regulated by enzymes and
of the inference approach to be considered. substrates of metabolic reactions regulate the activity of transcription
One should also consider the intended application when choosing factors. Predictive power can greatly increase by the understanding of
the most appropriate method. We can aim at learning about induction metabolic, gene regulatory models and the coupling of both. In this
of disease phenomena in human cell, discover disease mechanisms and review, we specifically focus on one aspect of this problem: the

38
S. Barbosa et al. BioSystems 174 (2018) 37–48

Fig. 2. Example of biological and data constraints and intended applications of GRNs.

inference of gene regulatory networks. between them. A connection exists when one gene is known to control
As mentioned in (Emmert-Streib et al., 2014) and (Marbach et al., another gene. The exact nature (inhibition or activation) and the direc-
2012), it is highly unlikely that there is one single method able to tion of control may or may not be known. If the edges have a direction,
correctly infer different systems. It is critical to understand what are the they are called directed graphs, whereas if no direction is present, they
biological and data constraints of the problem and the limitations and are called undirected graphs (Roy and Guzzi, 2016). Several factors such
advantages of the different inference methods, so effective application as the network size, type of data and availability of prior knowledge have
of a given inference method can be achieved (Marbach et al., 2012). to be taken into account when choosing between different inference
Differences in the studied organism, conditions, type of data, size of the methods. To infer small-scale networks, methods such as Bayesian net-
network, number of samples and amount of noise will have a great works and neural networks are an option. For large-scale networks, re-
impact on the performance of different inference approaches. Several gression-based, information theory and Boolean networks are a good
methods ranging from Bayesian networks, neural networks, Boolean choice. Small-scale networks can be in principle inferred using methods
networks, regression-based and information theory have been con- able to deal with large-scale networks. And on the other hand methods
sidered by several authors to tackle the complex topic of GRN inference. primarily designed for small networks can be adjusted in order to deal
In the next sections, we review the complex question of reverse en- with large-scale networks. In this section, we present and discuss some of
gineer GRNs, we explore some approaches and the future directions of the available approaches in GRN inference providing information about
research in this field. We also aim at focusing on the constraints from the intended use of such approaches.
the biological problems and how these are transferred to the problem of
GRN inference. We intend to guide future researchers into choosing the
3.1. Bayesian networks
most appropriate method according to the biological and data con-
straints and intended application.
Bayesian networks provide flexible frameworks to combine different
data types and prior knowledge, to avoid overfitting and to handle
3. Gene regulatory network inference methods incomplete and noisy data (Hecker et al., 2009). Due to the high
number of parameters, this type of approach is usually not appropriate
Gene regulatory networks are represented as graphs, G = {V, E} with for large-scale GRN inference (Sławek and Arodź, 2013; Wu et al.,
V and E representing the vertices and edges of the graph, respectively. 2016). They aim at finding the directed acyclic graph that describes the
The vertices correspond to the genes and the edges to the connections causal interactions among the different gene network components and

Fig. 3. From drug discovery to process development. Drug discovery is motivated by a disease or condition for which there is not a therapeutic solution, or the
solutions are not efficient. Target prediction and selection takes place by evaluating different phenotypes. Candidate drugs are investigated and the ideal drug is
selected based on factors such as safeness and efficacious connection to the target. The production process of such drug is then evaluated. Improvements to an initial
production process can be predicted in order to find an optimized drug production process.

39
S. Barbosa et al. BioSystems 174 (2018) 37–48

also the local joint probability distributions that convey these interac- An ANN, where each neuron represents a particular gene and the
tions (Albert, 2007). The general rule for Bayesian networks is that the connections between neurons represent their connectivity was pro-
probability of a target gene can be described as the conditional prob- posed by Vohradský (2001). The author identified the following ad-
ability of the set of parents of that gene (Albert, 2007; Auliac et al., vantages of the proposed model: it is continuous in time and uses a
2008). Markov chain Monte Carlo (MCMC) simulation has been one of transfer function that transforms the input into a similar shape as the
the main heuristic search procedures in the context of Bayesian net- one observed in natural processes. Nevertheless, several parameters had
works. BUGS (https://www.mrc-bsu.cam.ac.uk/software/bugs/), JAGS to be computed from experimental data, interpolation schemes and
(http://mcmc-jags.sourceforge.net/), PyMC3 (https://pymc-devs.- minimization procedures had to be applied, which represents the main
github.io/pymc3/index.html) and Stan (http://mc-stan.org/) are a few drawback of the model. Tong et al. (2014) also proposed a network
examples of software and packages that rely on MCMC sampling for framework to deal with GRN inference. A backpropagation 3-layered
Bayesian analysis. multilayer perceptron (MLP) with a sigmoid activation function to
Several other tools intended to deal with Bayesian networks have model potential gene interactions was presented. The predicted weights
been proposed in the literature. For example, a C++ toolkit for and signal directions are used as the strengths and directions of the
Bayesian analysis, BCM (Thijssen et al., 2016), provides the im- interactions between genes, respectively. A priori knowledge was added
plementation of eleven algorithms for sampling from posterior prob- in the framework of Manioudaki and Poirazi (2013). The authors con-
ability distributions and evaluation of marginal likelihoods. The R fined an ANN to match the biological network, DNA-protein interaction
package, bnlearn (Scutari, 2010), allows to infer Bayesian networks data and a priori knowledge. The artificial neurons of the input and
with either discrete or continuous variables, implements constrain-/ hidden layers corresponded to the transcription factors (TFs) and the
score-based algorithms and is parallelized. Another example is the B- outputs correspond to the expression of the genes. An increase of the
Course (Mylly Aki et al., 2002) a web-based application that uses a success rate was found when time-delays between TFs and genes were
combination of stochastic and greedy search heuristics, that auto- incorporated.
matically discretizes variables and handles missing data. Recurrent neural networks (RNNs) have been also considered in the
In order to compare different methods, Werhli et al. (2006) eval- literature for GRN inference. They are able to capture non-linear and
uated three methods: relevance networks (RNs), graphical Gaussian dynamic interactions between genes, represent feedback structures and
models (GGMs) and Bayesian networks. Regarding the Bayesian net- produce oscillatory and periodic activities (Raza and Alam, 2014).
work, MCMC methods were adopted. The authors showed that Bayesian RNNs perform the same task for every single element with the output
networks outperformed GGMs and RNs, when considering interven- dependent on previous states (Ghazikhani et al., 2011). In the context
tional data (perturbation on nodes). However, no significant difference of RNNs, Ghazikhani et al. (2011) proposed a multi agent system (MAS)
was found with observational data (nodes unperturbed). for RNN training, being the agents separate swarms of particles building
Standard Bayesian networks do not allow to describe closed loops up a multi population particle swarm optimization (PSO) algorithm.
and feedback mechanisms, since they rely on acyclic graphs (Hecker PSO represents a search method that iteratively tries to improve a
et al., 2009). Dynamic Bayesian networks (DBNs) overcome such lim- candidate solution. It solves the problem by having an initial candidate
itation. When compared to static Bayesian networks, DBNs have the of solutions (particles), and moving the particles around a search-space.
main limitation of increased computational complexity (Lee and Tzou, Multi-layer RNNs were combined with genetic algorithms (GAs) by
2009). Several approaches based on DBNs have also been proposed for Chiang and Chao (2007) in order to find the feed-forward regulated
GRN inference. For example, Vinh et al. (2011) presented the Glo- genes. Genetic algorithms provide the power of global searching for
balMIT, a Matlab toolbox for learning optimal DBNs. An information regulated target genes. The authors validated their approach by cor-
theoretic-based scoring metric, the mutual information test (MIT) is rectly identifying known oncogenes and their interaction genes.
used as a scoring metric for learning Bayesian networks. Another ex- Several neural network frameworks have been considered and
ample is the ARTIVA (Auto Regressive Time Varying regulatory showed great potential for GRN inference. Nevertheless, there is still
models) algorithm (Lèbre et al., 2010), that simultaneously infers the space for improvements. For example, Tian and Burrage (2003) sug-
GRN topology and temporal evolution. The method uses a combination gested that stochastic neural network models can lead to better de-
of DBNs and reversible jump MCMC. The proposed inference approach scription of GRNs. A potentially interesting feature of the stochastic
simultaneously considers all the possible combinations of change points models is that they give probabilistic distributions of expression levels;
and topologies within different phases. A novel influence score for this might be useful to describe the variation of expression products
DBNs was developed by Yu et al. (2004). The score tries to estimate the from cell to cell. Another example is proposed by Raza and Alam
relative magnitude and the sign (activation or repression) of the in- (2014). A generalized extended Kalman filter (GEKF) was used for
teractions and can also prune false positive links. weight update during the RNN training, representing a nonlinear form
Due to their high number of parameters, Bayesian networks are of the Kalman filter. The authors showed that the model was robust
normally restricted to small-scale GRN inference. Liu et al. (2016) even when noise was added. According to the authors, by dividing the
proposed one possible way to tackle this problem. Their approach uses estimation of parameters for the complete network into sub-problems,
Conditional Mutual Information (CMI) to construct an initial network the reconstruction of large-scale networks becomes more feasible.
and then series of local Bayesian networks (LBNs) are generated, by
selecting the k-nearest neighbors of each gene as the candidate reg- 3.3. Boolean networks
ulators. It allows reducing significantly the search space from all pos-
sible GRNs. The use of CMI to construct the initial network and the Boolean Networks have been intensively investigated since they
creation of series of LBNs allows dealing with large-scale networks. were proposed by Kauffman (1969) and due to their low computational
complexity they are able to deal with large-scale networks. They are
3.2. Neural networks characterized by directed graphs with Boolean functions representing
the edges and binary variables are used to define two possible states of
Artificial neural networks (ANNs) are one of the most popular ma- gene activity (Karlebach and Shamir, 2012). Although simple and easy
chine learning technique applied to GRN inference. ANNs are inter- to interpret, this approach can be limited, since some research questions
connected groups of nodes, where each node represents an artificial may need a more satisfactorily description than simple “on” and “off”
neuron and the arrows represent the connection from the output of a states of gene expression (Hecker et al., 2009).
neuron to the input of another. The activation function of a node de- Several Boolean Network frameworks have been proposed to deal
fines its output given a set of inputs (Bishop, 1995). with GRN inference. For example, Albert et al. (2008) presented the

40
S. Barbosa et al. BioSystems 174 (2018) 37–48

BooleanNet, a software toolbox able to simulate a Boolean model based regulators. TIGRESS (Trustful Inference of Gene Regulation with Sta-
on simple inputs. Rather than focusing on large-scale network in- bility Selection) (Haury et al., 2012), which was considered the best
ference, analysis and modelling, it aims at supporting the modelling of linear regression-based approach in the DREAM5 challenge (Marbach
dynamic behaviors of well-defined biological subsystems. Asynchro- et al., 2012), combines LARS with stability selection. Stability selection
nous updates are introduced and replicate simulations starting from the overcomes the instability of LARS when high correlations between
same initial condition were evaluated, counting the reachable attractors different explanatory features are present.
and the node states at different time steps. In another example, Several other regression-based frameworks for GRN inference have
Karlebach and Shamir (2012) focused on the inconsistencies introduced been considered. Covert et al. (2004) reconstructed the first integrated
into the data due to discretization. They phrased the problem of error genome-scale E. coli model of a transcriptional and metabolic network
reduction as a minimum entropy problem and developed a heuristic using information derived from literature and databases. By using two-
algorithm to solve it. A Boolean model was constructed from partial way ANOVA (Analysis of variance) to identify which TFs influence
prior knowledge and real-valued expression data. differential expression of specific genes, the authors rewrote, relaxed
Several interfaces and tools to simulate and visualize Boolean net- and removed various regulatory rules in the model in order to deal with
works are available. One example is the ViSiBooL, developed by discrepancies between predictions and experimental differential gene
Schwab et al. (2016). It is a graphical user interface able to model, expression. Throughout their findings, they suggest that the determi-
visualize and simulate Boolean Networks. Some of its main purposes nation of co-regulated gene sets alone is not enough to identify reg-
include the reduction of errors, speed up and also to facilitate the ulatory networks. Such approach required several iterations of manual
network design. It also introduces optional time delays, addressing the tuning and prior knowledge which can limit it to small-scale networks
temporal dependencies in Boolean Networks. BooleSim (Bock et al., and well-described organisms. A non-parametric non-linear correlation
2014) is another Boolean-based tool; it is an open-source in-browser coefficient η2 derived from ANOVA was proposed by Küffner et al.
tool that allows the simulation and manipulation of Boolean networks. (2012) for the evaluation of TF-gene interactions. η2 coefficient mea-
The state of the node can be easily and intuitively changed, nodes can sures the interaction as a fraction of the total variance explained by the
also be deleted or added and the changes are translated to a new net- differential expression across different conditions.
work layout and considered for the next simulation steps. A Cytoscape Tree-based regression models have been extensively considered for
(www.cytoscape.com) plugin, the GeStoDifferent, that adopts noisy the inference of GRNs. They are known to perform well in the presence
random Boolean networks (NRBNs) with specific focus on their emer- of a large number of features, are fast to compute and scalable. They
gent dynamical behavior, was presented by Antoniotti et al. (2013). aim at finding the subset of genes, from which the function that mini-
NRBNs are an extension of random Boolean networks (RBNs), RBNs mizes the error between predictions and experimental values has an
randomly choose the input connections from the remaining nodes and optimum value. The resulting subset of genes is inferred as the true
the Boolean function associated to each node (Shmulevich and regulators (Sławek and Arodź, 2013; Wu et al., 2016). Ensemble
Dougherty, 2010; Villani et al., 2011). By adding noise to RBNs, it is methods are used to average the predictions of several trees. In this
possible to switch between attractors. Also, NRBNs are able to model context, Huynh-Thu et al. (2010) presented GENIE3 (GEne Network
noise-resistance GRNs (Antoniotti et al., 2013). A follow-up to the Inference with Ensemble of trees), the best performer in DREAM4 and
GeStoDifferent, the CABeRNET was presented by Paroni et al. (2016). DREAM5 challenges, that considers random forests and extra trees as
Some key functionalities were added, including network augmentation tree-based ensemble methods. Like GENIE3, GENIMS (Wu et al., 2016),
module, simulation of specific regulatory structures and querying of decomposes the regression problem into sub-problems. It consists of
available databases. It generates missing portions of partially char- three levels: (i) feature selection step based on guided regularized
acterized GRNs and performs several robustness analyses. random forests, (ii) normalization of individual feature selection and
Randomness to the models can be introduced by considering sto- (iii) a final refinement step, where the results from the guided reg-
chastic generalizations of Boolean networks. Probabilistic Boolean net- ularized random forests algorithm are normalized and refined ac-
works (PBNs) achieve that with Markov chains, where transition prob- cording to the topology of the network. Normalization allows com-
abilities specify the transitions from one state to another (Shmulevich paring the different sub-problems, a step that is not present in the
and Aitchison, 2009; Shmulevich and Dougherty, 2010). Multiple reg- GENIE3 approach. The final refinement step arises from the assumption
ulatory functions are given to each network node, each with a pre-de- that GRNs are sparse. A generalization of GENIE3, the NIMEFI (Net-
fined probability and the function of the node is probabilistically de- work Inference using Multiple Ensemble Feature Importance) algorithm
termined (Shmulevich et al., 2003). By assuming the stochasticity, any was presented by Ruyssinck et al. (2014). The other feature importance
given set of node states can result in different network states. One of the methods considered included support vector regression, elastic net,
main restrictions of the PBNs is the high computational complexity, symbolic regression and their ensemble variants. The authors showed
being difficult to scale to large networks (Lee and Tzou, 2009). that higher performance is achieved when considering the average
predictions from the ensemble variants.
3.4. Regression-based In the context of non-linear regression methods, Sławek and Arodź
(2013) proposed ENNET, an algorithm that combines gradient boosting
Regression techniques aim at finding the statistical relationship with regression stumps. The regression stump partitions the expression
between two or more variables, where a change in a dependent variable values of a candidate TF into two disjoint regions. It uses an iterative
depends on the change of one or more independent variables. Several regression method, choosing one TF per interaction and adding it to a
regression-based approaches have been proposed for GRN inference. non-linear regression ensemble. Gradient boosting produces a pre-
One example is the Least Angle Regression (LARS) (Haury et al., 2012; dictive model as an ensemble of several weak prediction models. The
Singh and Vidyasagar, 2016) which evaluates the linear combination of inferred network is then refined by assuming that important compo-
a subset of potential variables that determines the target variable. It has nents regulate several target transcripts and checks the difference if a
a fairly low computational complexity making it scalable to large net- TF is knocked out (requiring knock-out data).
works. After M iterations of LARS, a list of M genes is selected for their Graphical Gaussian models are also another option when talking
ability to predict the gene target of interest. The set of predictors is then about GRN inference methods. They assume that gene expression levels
used to construct a multilinear model (Efron et al., 2004). More re- are jointly Gaussian distributed, being a combination of probability and
cently, bLARS (Singh and Vidyasagar, 2016), which tackles the sus- graph theory (He et al., 2009). GGMs evaluate the network structure by
ceptibility of LARS to noise was proposed. It uses bootstrapping to ex- estimating the inverse of covariance matrix, also called precision matrix
tend it, allowing different regulator functions for different candidate (Yu et al., 2013). Several methods have been proposed to solve the

41
S. Barbosa et al. BioSystems 174 (2018) 37–48

inverse of the covariance matrix and obtain the sparse solution (Yu et al., retrieved and transmitted. Information theory approaches have been
2013). The GeneTS approach (Schäfer and Strimmer, 2005) represents an also considered for the GRN inference problem. The networks resulting
empirical Bayes-based method that relies on three key components: from such approaches are also known as relevance networks. In-
combination of singular value decomposition (SVD) and bagging to formation theory-based approaches are simple to use and, due to their
compute improved coefficients, empirical Bayes approach to detect sta- low computational complexity, able to study the global properties of
tistically significant edges and heuristic approach to perform approx- large-scale regulatory systems (Hecker et al., 2009; Wu et al., 2016).
imate network selection. Another example is the CLIME (Constrained l1- They look for similarities and dissimilarities between pairs of genes
minimization for inverse matrix estimation) (Cai et al., 2011) that re- (Martins et al., 2016) and are able to deal with high number of genes
presents a constrained l1 minimization method able to estimate high- even in the presence of limited observations (Zhang et al., 2012).
dimensional precision matrices. A sparse partial correlation estimation In the context of GRN inference if two genes have a correlation
method (space) was proposed by Peng et al. (2009). It selects nonzero coefficient above a pre-defined threshold, these genes are said to in-
partial correlations under the high dimension low sample size setting. A teract (Hecker et al., 2009). Correlation can be employed to discover
general framework for combining regularized regression methods with sets of genes with similar expression profiles, but it has been proven
the estimation of graphical Gaussian was investigated by Krämer et al. that low correlation values can be observed for related genes and the
(2009). Different regression methods were used including ridge regres- correlation is generally non-zero for all edges, since GRNs are highly
sion, partial least squares (PLS), least absolute shrinkage and selection connected networks. This creates the necessity to define thresholds to
operator (LASSO) and two-stage LASSO. Graphical LASSO was proposed remove meaningless interactions from the inferred networks (Yu et al.,
by Friedman et al. (2008) and it directly maximizes a likelihood function 2013; Margolin et al., 2006b). Mutual Information (MI) overcomes one
using positive definite constraints and a penalized likelihood-based l1 of the main limitations of correlation by measuring non-linear de-
penalty for the estimation of the precision matrix. The authors claim that pendencies, a feature highly common in biological systems (Zhang
the algorithm is remarkably fast, solving a 1000 node problem (∼ et al., 2012). It measures how much information is present in one
500000 parameters) in at most a minute. random variable about another.
Regression-based approaches can also be used for Transcriptional Several approaches based on MI have been proposed for the in-
regulatory networks. For example, Network component analysis (NCA) ference of GRNs. A novel extension of the relevance networks, the
is able to determine the TF activity and control strength of each reg- context likelihood of relatedness (CLR), developed by Faith et al.
ulatory pair of the network (Kao et al., 2004). Several linear decom- (2007), adjusts random noise in the MI values using likelihood based on
positions beyond NCA have been studied in this context, including SVD z-scores. It uses transcriptional profiles across different conditions to
and independent component analysis (ICA). It is pointed by Kao et al. evaluate the transcriptional regulatory interactions, being able to
(2004) that both SVD and ICA have molecular basis difficult to pin- eliminate correlations and indirect influences. The GTRNetwork (Gene
point. They used NCA to determine the activities of various TFs during a expression and Transcription factor activity based Relevance Network)
glucose to acetate transition in E. coli. Provided that the connectivity of algorithm (Fu et al., 2011) additionally considers transcription factor
TF to the genes is known and satisfies NCA criteria, it is possible to activities (TFAs) as a hidden layer. It is possible to predict the activities
determine the contribution of the TFs to the genes. The authors refer of TFs using transcriptomic data and a TF-gene network structure.
that the utility of NCA greatly depends on the available accurate con- GTRNetwork first estimates the changes of TFAs with known TF-gene
nectivity information. For well-studied organisms, such as the E. coli, interactions using NCA and then uses the CLR algorithm to identify
there are significant amounts of available data, but this does not hold transcriptional regulatory interactions between the estimated TFAs and
true for many other organisms including, for example, mammalian and genes.
yeast cells. Also due to the fairly high computational complexity, NCA Butte and Kohane (2000) evaluated the entropy of gene expression
can be limited to small-scale networks. patterns and the MI between RNA expression patterns for each pair of
Dimension reduction techniques, such as PLS and principal com- genes. A threshold rule based on the distribution of the permuted MI
ponent analysis (PCA), can also be useful in the context of GRN in- values was proposed. The authors were able to find biologically re-
ference. They are able to select a few principal components from levant clusters, by linking all genes through comprehensive pair-wise
multivariate data that capture the maximum covariation. This allows to mutual information and then isolating gene clusters. Clustering was
highlight global patterns in the expression of large number of genes or also considered together with an information theory method by
other compounds (Albert, 2007). PLS is described by Boulesteix and Dimitrakopoulos et al. (2014) to increase accuracy and decrease time
Strimmer (2005) as an excellent framework to integrative analysis, complexity. The authors first clustered the data into different groups
since it combines dimension reduction with regression and variable and then the maximal information coefficient (MIC), which relies on
selection. It is able to integrate and generalize previous NCA ap- mutual information, was used to estimate the similarity of gene ex-
proaches and at the same time to overcome its limitations. The authors pression profiles. The MIC was first proposed by Reshef et al. (2011)
draw attention to the fact that PLS, like NCA, relies on simple linear and it is based on the idea that if a relationship between two variables
models and more elaborate regression approaches should be employed exists, then a grid can be drawn on the scatterplot of the two variables
to improve the modeling of complex structures which are not simply that partitions the data to encapsulate that relationship.
described by linear relationships. An entropy reduction step based on conditional entropies is con-
Based on Support Vector Machine (SVM), Gillani et al. (2014) sidered by the MIDER (Mutual Information Distance and Entropy
proposed the CompareSVM for GRN inference. The training data com- Reduction) algorithm (Villaverde et al., 2014). The entropy reduction
prised a list of known regulatory interactions between TF and genes and step is intended to discriminate between direct and indirect interactions
gene expression data. After training the SVM, a list of new genes can be and transfer entropies to assign the direction of the inferred interac-
assigned to a TF if their score is higher than a predefined threshold. The tions. It aims at minimizing false positives, accurately distinguishing
major drawbacks of SVM rely on its inability to scale to large networks direct and indirect interactions. More recently, a parallel version of
and its supervised nature; it is not able to predict targets of a TF, if no MIDER, the PREMER (Parallel Reverse Engineering with Mutual in-
prior targets are known. formation and Entropy Reduction) was implemented (Villaverde et al.,
2018). It further allows the incorporation of prior knowledge.
3.5. Information theory An algorithm specifically designed to deal with the complexity of
regulatory networks in mammalian cells, ARACNE (Algorithm for the
Information theory was originally proposed by Claude E. Shannon Reconstruction of Accurate Cellular Networks), was proposed by
in 1948 (Shannon, 1948). It studies the behaviour of data as it is stored, Margolin et al. (2006a,b). ARACNE identifies interactions by estimating

42
S. Barbosa et al. BioSystems 174 (2018) 37–48

pairwise mutual information and uses data processing inequality (DPI) performance classification of ++. The same logic was applied to the
to remove indirect interactions. Based on the ARACNE, Zoppoli et al. network size feature. For example, if the average network size eval-
(2010) proposed the TimeDelay ARACNE, an approach able to infer uated with a specific method scaled to the average network size of all
GRNs from time-series measurements relying also in MI and DPI. It tries the considered methods was 0.2, a classification of −−− would be
to extract dependencies between two genes at different time delays. The assigned to the network size feature. Approaches where quantitative
less informative dependencies are filtered out whereas the most reliable performance was not provided were classified with the lowest negative
connections are retained. The authors were able to achieve a higher F- score (−− −−). Table 3 describes the qualitative features and their
score when compared with the standard ARACNE. Conditional mutual classification. Table 4 presents the classification of the different ap-
information (CMI) is also employed for removing indirect interactions proaches according to the eight aforementioned features. Some of the
and is sometimes preferable over DPI (Zhang et al., 2012). Zhang et al. approaches present in Table 1 are not present in Table 4 due to their
(2012) presented a path consistency algorithm based on CMI (PCA- nature (for example R packages, Matlab toolboxes or interfaces), being
CMI). A complete initial graph is generated, where the theoretical in- impossible to classify under the classification scheme here proposed.
teractions between all the pairs of genes are represented. Some edges
are then deleted according to their mutual information value, first order 4. Discussion
CMI is evaluated and a few more edges are deleted. High order CMI is
evaluated until no more adjacent edges are observed. More recently, In this paper, we focus on the GRN inference problem, reviewing
Xiao et al. (2016) presented the RPNI (Regulation Pattern based Net- several approaches within five major GRN inference methods: Bayesian
work Inference) that also consists on a CMI based algorithm. Three networks, neural networks, Boolean networks, regression and in-
patterns were defined (co-regulation, indirect-regulation and mixture- formation theory. Generally speaking, Bayesian networks are able to
regulation) to guide the selection of candidate genes. deal with uncertainties and incomplete and noisy data, easy to interpret
More recently, Chan et al. (2017) proposed the PIDC, a fast, efficient and represent flexible frameworks to combine different data types and
algorithm that uses partial information decomposition (PID). It uses prior knowledge. However, they are limited to small-scale networks
multivariate information theory to explore the statistical dependencies since they rely on the computation and analysis of different possible
between triplets of genes in single-cell gene expression datasets. PIDC structures. Neural networks are able to capture the nonlinear and dy-
identifies putative functional relationships between genes based on the namic interactions and due to their high computational complexity they
unique contribution to parwise MI combined with local network con- have shown some limitations with large-scale network inference. The
text information of each gene. Feature selection approaches are also of advance in computational power can be the answer to such limitations.
great potential in the context of GRN inference. For example, MRNET Boolean networks are able to deal with large-scale networks, are
(minimum redundancy network), which was presented Meyer et al. parameter-free, easy to interpret, suitable to capture the global beha-
(2007), is based on maximum relevance/minimum redundancy vior of the networks, require few data and have straightforward im-
(MRMR), a feature selection approach in supervised learning. MRMR plementation, simulation and analysis. However due to their simple
aims at selecting the set of variables with the highest MI with the target “on” or “off” gene expression states, they can be limited to some re-
(maximum relevance) that are at the same time mutually maximally search questions. Linear regression-based models are simple, practical
independent (minimum redundancy). and do not impose high computational complexity. However, the bio-
More recently, new methods focused on the ensemble of different logical assumption that the regulatory effects of various genes on a
approaches have been proposed. For example, Chan et al. (2018) pro- given gene can be described by linear functions, is mostly not observed
posed a method that brings together empirical Bayes and work on in biological systems. Non-linear approaches are sometimes preferred
theoretical null distributions for information measures. It performs over linear ones, although they require more parameters and therefore
formal hypothesis tests on putative network edges derived from in- are more prone to overfitting. Information theory methods are suitable
formation theory. Another ensemble approach is proposed by for large-scale networks, since they do not present high computational
Henriques et al. (2017). SELDOM (ensemble of dynamic logic-based complexity. The main obstacle in this type of method arises from the
models) (Henriques et al., 2017) combines information theory, en- need to select thresholds.
semble modeling, parametric dynamic model identification, logic-based Choosing between different GRN inference approaches will greatly
modeling and model reduction. The proposed method aims not only at depend on the posed research question and the biological constraints of
inferring network topologies but also at generating high quality pre- the data itself. For example, if we focus on the processes from drug
dictions about the system behaviour under untested experimental per- discovery to process development, the research question is quite dif-
turbations. The method has proved to outperform other state of the art ferent for the different steps involved in the overall process. For drug
methods and give better predictions than the ones from individual discovery, predictive solutions are interesting to identify and select
models. Most of the information theory based approaches here re- target molecules, whereas for process development the predictive so-
viewed have free-available R, Julia or Matlab packages ready to be used lutions would be needed for the optimization of the drug production
and tested. One example is the minet (Meyer et al., 2008), a R/Bio- process. Different steps in the process may be better tackled with dif-
conductor package that implements some of the aforementioned ferent approaches and knowing the specific assumptions, challenges of
methods: ARACNE, CLR, MRNET and relevance networks, further in- the system under study and the available data is necessary for a more
cluding several accuracy assessment tools. informed decision on the approach to follow.
Table 1 summarizes the aforementioned discussed approaches. In
order to compare the different approaches in a more intuitive way, a 4.1. The role of the biological system under study
total of eight features including two quantitative (network size and
performance) and six qualitative (linearity assumption, directionality The biological system under study can be a limiting factor when
inference, parameter tuning, threshold setting, loop structure handling adopting one of the different available GRN inference approaches.
and knowledge requirement) features were considered. Different organisms may have significant differences in several biolo-
The network size feature corresponds to the average network size gical processes, and therefore different constraints. One should become
evaluated with the specific method scaled to the average network size familiar with the biological mechanisms and be aware of the particu-
of the different methods. The network size and the performance of the larities of the studied organism.
different approaches were classified according to Table 2. Eight clas- When dealing with human models, we are dealing with large-scale
sification steps with a lower and upper bound were defined. As an ex- networks, whereas for organisms like E. coli and CHO, we are dealing
ample, if an inference method had a precision of 0.7 it would receive a with small-scale networks. Some methods may be better at handling

43
S. Barbosa et al. BioSystems 174 (2018) 37–48

Table 1
Gene Regulatory Network Inference approaches
Author/ Approach name Used methods Test data Availability

BUGS Bayesian networks − R


JAGS Bayesian networks − R
PyMC3 Bayesian networks − Python
Stan Bayesian networks − R, Python, Matlab
BCM (Thijssen et al., 2016) Bayesian networks − C++
bnlearn (Scutari, 2010) Bayesian networks − R
B-course (Mylly Aki et al., 2002) Bayesian networks − Web-based
Werhli et al. (2006) Bayesian networks Raf signaling −
GlobalMIT (Vinh et al., 2011) Bayesian networks and MI − Matlab
ARTIVA (Lèbre et al., 2010) DBNs D. melanogaster and S. cerevisiae R
Yu et al. (2004) DBNs Simulated −
Liu et al. (2016) LBNs and CMI DREAM and E. coli −
Vohradský (2001) ANN Simulated −
Tong et al. (2014) 3-layered MLP Simulated −
Manioudaki and Poirazi (2013) Stochastic ANN Simulated −
Ghazikhani et al. (2011) RNN with MAS training E. coli −
Chiang and Chao (2007) RNN (GA) Human, yeast cell cycle −
Tian and Burrage (2003) ANN S. cerevisiae −
Raza and Alam (2014) RNN with GEKF DREAM, E. coli, IRMA −
BooleanNet (Albert et al., 2008) Boolean networks − Python
Karlebach and Shamir (2012) Boolean networks Mouse embryonic stem cells Executable file
ViSiBooL (Schwab et al., 2016) Boolean networks − User interface
BooleSim (Bock et al., 2014) Boolean networks − In-browser tool
GeStoDifferent (Antoniotti et al., 2013) NRBN − Cytoscape plugin
CABeRNET (Paroni et al., 2016) RBN and NRBN − Cytoscape app
bLARS (Singh and Vidyasagar, 2016) LARS + bootstrapping DREAM Matlab
TIGRESS (Haury et al., 2012) LARS + stability selection DREAM, E. coli Matlab
Covert et al. (2004) Literature based and ANOVA E. coli −
Küffner et al. (2012) Score η2 derived from ANOVA DREAM, E. coli, S. cerevisiae −
GENIE3 (Huynh-Thu et al., 2010) Random forests and extra-trees DREAM, E. coli R, Python, Matlab
GENIMS (Wu et al., 2016) Random forests DREAM R
NIMEFI (Ruyssinck et al., 2014) Ensemble support vector regression, elastic net, random DREAM R
forests, symbolic regression
ENNET (Sławek and Arodź, 2013) Gradient boosting with regression stumps DREAM R
GeneTS (Schäfer and Strimmer, 2005) Empirical Bayes-based GGM Simulated, breast cancer R
CLIME (Cai et al., 2011) Constrained l1 minimization-based GGM Simulated, breast cancer R
Space (Peng et al., 2009) Regression-based GGM Simulated, breast cancer R
Krämer et al. (2009) Regression-based GGM T-Cell activation, simulated breast cancer, R
E. coli, A. thaliana
Friedman et al. (2008) Penalized likelihood-based GGM Cell-signaling from proteomics R
Kao et al. (2004) NCA E. coli Matlab
Boulesteix and Strimmer (2005) PLS Yeast and E. coli R
CompareSVM (Gillani et al., 2014) SVM Simulated Matlab
Faith et al. (2007) CLR E. coli R
GTRNetwork (Fu et al., 2011) CLR and NCA E. coli Matlab
Butte and Kohane (2000) Entropy and MI S. cerevisiae −
Dimitrakopoulos et al. (2014) MIC and clustering DREAM, human aging and ovarian cancer −
MIDER, PREMER (Villaverde et al., 2014, Entropy and MI DREAM, synthetic, glycolytic pathway, Matlab, Octave and Fortran
2018) IRMA, MAPK cascade 90
ARACNE (Margolin et al., 2006a) MI and DPI Synthetic, human B cells R
TimeDelay ARACNE (Zoppoli et al., 2010) MI and DPI Synthetic, S. cerevisiae, E. coli, IRMA R
PCA-CMI (Zhang et al., 2012) CMI and path consistency DREAM, E. coli Matlab
RPNI (Xiao et al., 2016) CMI and pattern analysis DREAM, acute myeloid leukemia −
PIDC (Chan et al., 2017) PID, Multivariate information theory Synthetic, Experimental single-cell Julia package
MRNET (Meyer et al., 2007) MRMR Synthetic R
Chan et al. (2018) Empirical Bayes and Information Theory In silico S. cerevisiae, mouse PSCs Julia package
SELDOM (Henriques et al., 2017) Ensemble information theory, dynamic model MAPK cascade, synthetic, HPN DREAM R
identification, logic-based modeling

IRMA, In vivo Reverse-engineering and Modeling Assessment; D. melanogaster, Drosophila melanogaster; S. cerevisiae, Saccharomyces cerevisiae; E. coli, Escherichia coli;
A. thaliana, Arabidopsis thaliana; PSCs, Pluripotent stem cells; MAPK, Mitogen-activated Protein Kinase

large or small-scale networks. Those methods who present scalable methods scale better than others, but it is also true that the increasing
computational ability are in principle more appropriate for large-scale computational power plays an important role in overcoming the chal-
GRN inference. On the other hand, if we are dealing with small-scale lenges behind the scalability problem. Generally speaking Bayesian
networks, this is not a concern. Bayesian networks and neural networks networks and neural networks have shown to be limited to small-scale
are examples of methods that have some limitations when it comes to networks. However, one must be aware that it is possible to make im-
scalability, mainly due to the high number of parameters. Nevertheless, provements to these methods in order to deal with large-scale networks.
they can be of great interest for small-scale and well-studied organisms, One way is by dividing the problem into sub-problems and then com-
as they easily allow to incorporate prior knowledge and different types bine the information obtained from the different sub-problems. One
of data. In the end, the most appropriate method for medium to large- crucial aspect of such approach is how the division is made, since it can
scale networks will depend on its ability to scale. It is true that some influence the inferred information.

44
S. Barbosa et al. BioSystems 174 (2018) 37–48

Table 2 Researchers in this field should be aware that it is necessary to un-


Classification scheme for the quantitative features (network size and perfor- derstand, even before starting to think of the best inference methods,
mance). what is exactly their data, how was it measured, how many experiments
Lower bound Upper bound Classification and replicates are they dealing with? It is quite straightforward to
realize that better inferred networks will be attained when more in-
0 0.125 −− −− formative data is available. Also, more data does not necessarily mean
0.126 0.250 −−−
more information. One problem that arises from the high data avail-
0.251 0.375 −−
0.376 0.500 − ability is that some of these data can be noisy and redundant. Bayesian
0.501 0.625 + networks may be the most appropriate method for this case since they
0.626 0.750 ++ are able to deal with noisy data. They are also interesting when only
0.751 0.875 +++
incomplete datasets are available.
0.875 > 1 ++ ++
Depending on the studied organism, it is possible that more in-
formation, besides transcriptomic data is necessary to understand
Table 3 GRNs. In this case, methods able to easily incorporate different types of
Description of the qualitative features and their classification. data are preferred. Also, the level of knowledge and understanding of
different organisms is quite different. If prior knowledge is available, it
Feature Question Classification
should be included in order to improve the method performance. It can
− + be used, for example, to choose the most appropriate threshold for in-
formation theory methods or the set of parameters for neural networks.
Linearity assumption Assumes linear relationships Yes No Different types of data and also greater amounts of a priori information
between genes?
are in place for organisms like E. coli. These data should definitively be
Directionality inference Able to infer direction of No Yes
connections? incorporated into the GRN inference method. Some of the reviewed
Parameter tuning Requires manual tuning of Yes No methods rely on a priori knowledge and can be quickly discarded by
parameters? some researchers due to their dependency on the prior knowledge.
Threshold setting Requires setting up thresholds? Yes No However, when dealing with organisms where great knowledge is
Handles loop structures Able to handle loops or feedback No Yes
structures?
available, we strongly believe that this information should be added to
Requires knowledge Requires prior knowledge? Yes No the models. It will certainly help in setting thresholds and/or tuning
parameters and more accurate information can be extracted from the
inferred networks.
Taking a look at the Bayesian network methods presented in
Table 4, we can see that positive performances where achieved by three 4.3. Future directions in GRN inference
out of four methods. We can also see that these were only tested with
small networks. The remaining one (ARTIVA (Lèbre et al., 2010)) al- It is highly unlikely that there is one best single method able to deal
though tested with a large network, quantitative measures of perfor- with different research questions and perform equally well under dif-
mance were not reported. Looking at the neural network methods that ferent assumptions. And although the inferred networks will vary be-
present quantitative performances, we see that they have a positive tween, they can be highly complementary. The ensemble of predictions
performance, however once again they were tested in small networks. It made by different methods has showed to be more closely approxi-
is clear that both Bayesian and neural networks perform quantitatively mated to the true interaction network compared to the individual
well with small-scale networks. However, we cannot conclude that they methods (De Smet and Marchal, 2010; Emmert-Streib et al., 2014;
do not perform well with larger networks, evaluation of the methods Marbach et al., 2012; Henriques et al., 2017). Another possible direc-
with large-scale networks would be required to make such statement. tion to further improve the accuracy of the inference methods is by
We now take a look at the Boolean network method presented in integrating existing a priori knowledge (Liu, 2015), since expression
Table 4. Although Boolean networks are in theory able to deal with data is usually limited in terms of number of samples and noise (Martins
large-scale networks, the method implemented by Karlebach and et al., 2016).
Shamir (2012) showed a poor performance even in the presence of a Although state-of-the-art GRN inference methods have become in-
small-scale network. We believe that this is due to their binary gene creasingly more accurate over the years, they still present many chal-
expression states, limiting such method to some research questions. lenges and limitations. Several methods are based on weak assump-
From the evaluated regression-based methods, seven of them presented tions, require gold standard for validation, have to deal with overfitting
a positive quantitative performance with large networks. These issues and their performance can vary significantly with different input
methods consist of LARS and stability selection, a score derived from data. From our review, we see that it is not possible to choose a con-
ANOVA, random forests, ensemble of different regression methods and sensus approach for dealing with GRN inference, since different
regression-based GGM. Several regression-based methods were able to methods are more appropriate for different research questions and to
perform well with large-scale networks and we see that it is possible to the available data and knowledge of the system under study. Further
make adjustments to simple methods to deal with large networks. From improvements in the inference of gene regulatory networks are still an
the information theory-based methods, all the ones dealing with large- important open research topic in systems biology, since the knowledge
scale networks and with quantitative performance information, have a gained through such insight is extremely useful to increase the pre-
positive performance. From the several reviewed approaches, we can dictive power of biological models. The increasing computational
therefore assume that both regression-based and information theory power plays an important role on the future trends in this field.
methods are able to deal with large-scale networks. Graphics Processing Units (GPUs) are becoming real game changers in
the artificial intelligence field. By having thousands of cores, they are
4.2. The role of the data and level of knowledge of the system under study capable of performing millions of mathematical operations in parallel.
As the computational demand becomes less challenging, some methods
The data itself represents one of the most crucial points in the all will become more and more promising. We strongly believe that neural
GRN inference problem. The ones collecting the data are not, normally, networks is one of these cases. We observed a great effort from inter-
the ones applying the different methods and algorithms in order to disciplinary fields to tackle this question and we expect that over the
extract some new knowledge and breakthrough information. next years this will continue to be an important topic in this field.

45
S. Barbosa et al.

Table 4
Classification of the different approaches according to the proposed classification scheme. Two quantitative (network size and performance) and six qualitative features were considered. The performance measure
considered for the Performance feature is indicated between parentheses.
Author/ Approach name Network size Linearity assumption Directionality inference Parameter tuning Threshold setting Handles loop structures Requires knowledge Performance

Werhli et al. (2006) −− −− + + − − − + ++ ++ (True int.)


ARTIVA (Lèbre et al., 2010)) ++ ++ + + − + + + −− −−
Yu et al. (2004) −− −− + + − + + + ++ ++ (Recall)
Liu et al. (2016) −− −− + + − − − + + (Precision)
Vohradský (2001) −− −− + + − + + + −− −−
Tong et al. (2014) −− −− + + − − + + +++ (Recall)
Manioudaki and Poirazi (2013) ++ ++ + + − + + + −− −−
Ghazikhani et al. (2011) −− −− + + − + + + −− −−
Chiang and Chao (2007) ++ ++ + + − + + + −− −−
Tian and Burrage (2003) −− −− + + − + + + −− −−
Raza and Alam (2014) −− −− + + − − + + + (Precision)
Karlebach and Shamir (2012) −−− − + − − + + − (True int.)
bLARS (Singh and Vidyasagar, 2016) ++ ++ + + − + + + − (Overall Score)
TIGRESS (Haury et al., 2012) ++ ++ − + − − + + ++ (AUROC)
Covert et al. (2004) +++ − − + + − − −− −−
Küffner et al. (2012) ++ ++ + − + + − + ++ (AUROC)
GENIE3 (Huynh-Thu et al., 2010) ++ + + − − + + ++ (AUROC)
GENIMS (Wu et al., 2016) ++ ++ + + − + + + +++ (AUROC)
NIMEFI (Ruyssinck et al., 2014) +++ + + − + − + ++ (AUROC)

46
ENNET (Sławek and Arodź, 2013) ++ + + − + − + +++ (AUROC)
GeneTS (Schäfer and Strimmer, 2005) ++ ++ − − − − + + −− −−
CLIME (Cai et al., 2011) −− −− − − − − + + ++ (Recall)
space (Peng et al., 2009) ++ − − − + + + +++ (Recall)
Krämer et al. (2009) ++ − − − − + + −− −−
Friedman et al. (2008) −− −− − − − − + + −− −−
Kao et al. (2004) ++ − − + + − − −− −−
Boulesteix and Strimmer (2005) ++ ++ − − − + + + −− −−
CompareSVM (Gillani et al., 2014) − + − − − − − −− −−
Faith et al. (2007) ++ ++ + + + − − + + (Precision)
GTRNetwork (Fu et al., 2011) ++ ++ − − + − − − −− −−
Butte and Kohane (2000) ++ ++ + − + − − + −− −−
Dimitrakopoulos et al. (2014) ++ ++ + − − − − + + (AUROC)
MIDER, PREMER (Villaverde et al., 2014, 2018) −− −− + + + + + + ++ (Precision)
ARACNE (Margolin et al., 2006a) ++ ++ + − − − + + ++ (Precision)
TimeDelay ARACNE (Zoppoli et al., 2010) −− −− + + − − + + − (Recall)
PCA-CMI (Zhang et al., 2012) −− −− + − + − − + ++ (Precision)
RPNI (Xiao et al., 2016) −− −− + − + − + + ++ (Precision)
PIDC (Chan et al., 2017) −− −− + − − − − + −− (AUPR)
MRNET (Meyer et al., 2007) − + − + − − + + (F-score)
Chan et al. (2018) −− −− + + − − − − −− (Precision)
SELDOM (Henriques et al., 2017) −− −− + + − + − + + (AUPR)

AUPR−Area Under Precision Recall curve; AUROC−Area Under Receiver Operating Characteristic; True int.−True interactions
BioSystems 174 (2018) 37–48
S. Barbosa et al. BioSystems 174 (2018) 37–48

5. Conclusion Emmert-Streib, F., Dehmer, M., Haibe-Kains, B., 2014. Gene regulatory networks and
their applications: understanding biological and medical problems in terms of net-
works. Front. Cell Develop. Biol. 2 (38).
Several approaches for the inference of gene regulatory networks Faith, J.J., Hayete, B., Thaden, J.T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S.,
have been proposed in the literature. Choosing one or few of these Collins, J.J., Gardner, T.S., 2007. Large-scale mapping and validation of Escherichia
approaches represents quite a challenge, since different assumptions coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5
(1), e8.
and simplifications are considered and a standard validation process is Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with
not available. Also the choice of the most appropriate method will the graphical lasso. Biostatistics 9 (3), 432–441.
highly depend on the research question, the specific assumptions and Fu, Y., Jarboe, L.R., Dickerson, J.A., 2011. Reconstructing genome-wide regulatory net-
work of E. coli using transcriptome data and predicted transcription factor activities.
challenges of the selected biological system and data constraints. In this BMC Bioinform. 12 (233).
regard, we propose that future methods explicitly point the research Ghazikhani, A., Akbarzadeh, T.M.R., Monsefi, R., 2011. Genetic regulatory network in-
question and also the constraints regarding the data. ference using Recurrent Neural Networks trained by a multi agent system. In: In:
Computer and Knowledge Engineering (ICCKE). 2011 1st International eConference
Ensemble inference methods should be further investigated. As
on. pp. 95–99.
showed by the several ensemble approaches here reviewed, limitations Gillani, Z., Akash, M., Rahaman, M., Chen, M., 2014. CompareSVM: supervised, Support
of one approach can be corrected by other approaches. Moreover, we Vector Machine (SVM) inference of gene regularity networks. BMC Bioinform. 15
strongly believe that neural networks are one of the most promising (395).
Haury, A.-C., Mordelet, F., Vera-Licona, P., Vert, J.-P., 2012. TIGRESS: Trustful Inference
methods. Neural networks have showed great potential in several fields of Gene REgulation using Stability Selection. BMC Syst. Biol. 6 (145).
including e-commerce, social media marketing and GPS navigation. We He, F., Balling, R., Zeng, A.P., 2009. Reverse engineering and verification of gene net-
expect that as the computational demand becomes less challenging, works: Principles, assumptions, and limitations of present methods and future per-
spectives. J. Biotechnol. 144 (3), 190–203.
neural networks will be more and more promising in the field of GRN Hecker, M., Lambeck, S., Toepfer, S., van Someren, E., Guthke, R., 2009. Gene regulatory
inference. network inference: Data integration in dynamic models-A review. BioSystems 96 (1),
It is our opinion that the community should come together and 86–103.
Henriques, D., Villaverde, A.F., Rocha, M., Saez-Rodriguez, J., Banga, J.R., 2017. 02 Data-
define a standard performance metric, in order to more intuitively driven reverse engineering of signaling pathways using ensembles of dynamic
compare different approaches. We suggest the overall score (obtained models. PLOS Comput. Biol. 13 (2) e1005379.
as the mean of the AUROC and AUPR scores), used in the DREAM Hughes, J.P., Rees, S., Kalindjian, S.B., Philpott, K.L., 2011. Principles of early drug
discovery. Br. J. Pharmacol. 162 (6), 1239–1249.
challenge, as the standard performance metric for GRN inference. Huynh-Thu, V.A., Irrthum, A., Wehenkel, L., Geurts, P., 2010. Inferring regulatory net-
works from expression data using tree-based methods. PLoS ONE 5 (9), e12776.
Acknowledgements Jiang, Z., Zhou, Y., 2005. Using gene networks to drug target identification. J. Integrative
Bioinform. 2 (1), 48–57.
Kabir, M., Noman, N., Iba, H., 2010. Reverse engineering gene regulatory network from
Funding: This work was supported by the European Union's Horizon microarray data using linear time-variant model. BMC Bioinform. 11 (1), S56.
2020 research and innovation program under the grant agreement No Kao, K.C., Yang, Y.-L., Boscolo, R., Sabatti, C., Roychowdhury, V., Liao, J.C., 2004.
675585 (Marie-Curie ITN SyMBioSys). Transcriptome-based determination of multiple transcription regulator activities in
Escherichia coli by using network component analysis. Proc. Natl. Acad. Sci. U S A
101 (2), 641–646.
References Karlebach, G., Shamir, R., 2012. Constructing Logical Models of Gene Regulatory
Networks by Integrating Transcription Factor-DNA Interactions with Expression Data:
An Entropy-Based Approach. J. Comput. Biol. 19 (1), 30–41.
Albert, I., Thakar, J., Li, S., Zhang, R., Albert, R., 2008. Boolean network simulations for Kauffman, S.A., 1969. Metabolic stability and epigenesis in randomly constructed genetic
life scientists. Source Code Biol. Med. 3 (16). nets. J. Theoretical Biol. 22 (3), 437–467.
Albert, R., 2007. Network inference, analysis, and modeling in systems biology. Plant Cell Krämer, N., Schäfer, J., Boulesteix, A.-L., 2009. Regularized estimation of large-scale gene
19 (11), 3327–3338. association networks using graphical Gaussian models. BMC Bioinformatics 10 (384).
Antoniotti, M., Bader, G.D., Caravagna, G., Crippa, S., Graudenzi, A., Mauri, G., 2013. Küffner, R., Petri, T., Tavakkolkhah, P., Windhager, L., Zimmer, R., 2012. Inferring gene
GeStoDifferent: A Cytoscape plugin for the generation and the identification of gene regulatory networks by ANOVA. Bioinformatics 28 (10), 1376–1382.
regulatory networks describing a stochastic cell differentiation process. Lèbre, S., Becq, J., Devaux, F., Stumpf, M.P.H., Lelandais, G., 2010. Statistical inference of
Bioinformatics 29 (4), 513–514. the time-varying structure of gene-regulation networks. BMC Syst. Biol. 4 (130).
Auliac, C., Frouin, V., Gidrol, X., D’Alché-Buc, F., 2008. Evolutionary approaches for the Lee, W.-P., Tzou, W.-S., 2009. Computational methods for discovering gene networks
reverse-engineering of gene regulatory networks: a study on a biologically realistic from expression data. Brief. Bioinform. 10 (4), 408–423.
dataset. BMC Bioinformatics 9 (91). Lingeman, J., Shasha, D., 2012. Network Inference in Molecular Biology: A Hands-on
Bishop, C., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Inc, Framework, Vol. 36. SpringerBriefs in Electrical and Computer Engineering.
New York, NY, USA. Liu, F., Zhang, S., Guo, W., Wei, Z., Chen, L., 2016. Inference of Gene Regulatory Network
Bock, M., Scharp, T., Talnikar, C., Klipp, E., 2014. BooleSim: An interactive Boolean Based on Local Bayesian Networks. PLOS Comput. Biol. 12 (8) e1005024.
network simulator. Bioinformatics 30 (1), 131–132. Liu, Z.-P., 2015. Reverse Engineering of Genome-wide Gene Regulatory Networks from
Boulesteix, A.L., Strimmer, K., 2005. Predicting transcription factor activities from Gene Expression Data. Curr. Genom. 16 (1), 3–22.
combined analysis of microarray and ChIP data: a partial least squares approach. Manioudaki, M.E., Poirazi, P., 2013. Modeling regulatory cascades using artificial neural
Theor. Biol. Med. Model 2 (23). networks: The case of transcriptional regulatory networks shaped during the yeast
Butte, A.J., Kohane, I.S., 2000. Mutual information relevance networks: functional stress response. Front. Genet. 4 (110).
genomic clustering using pairwise entropy measurements. Biocomputing 418–429. Marbach, D., Costello, J.C., Küffner, R., Vega, N.M., Prill, R.J., Camacho, D.M., Allison,
Cai, T., Liu, W., Luo, X., 2011. A Constrained ℓ1 Minimization approach to sparse pre- K.R., Kellis, M., Collins, J.J., Stolovitzky, G., 2012. Wisdom of crowds for robust gene
cision matrix estimation. J. Am. Stat. Assoc. 106 (494), 594–607. network inference. Nat. Methods 9 (8), 796–804.
Chan, T.E., Pallaseni, A., Babtie, A.C., McEwen, K., Stumpf, M.P., 2018. Empirical Bayes Mardis, E.R., 2008. The impact of next-generation sequencing technology on genetics.
Meets Information Theoretical Network Reconstruction from Single Cell Data. Trends in Genetics 24 (3), 133–141.
bioRxiv. Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R.D.,
Chan, T.E., Stumpf, M.P., Babtie, A.C., 2017. Gene regulatory network inference from Califano, A., 2006a. ARACNE: An Algorithm for the Reconstruction of Gene
single-cell data using multivariate information measures. Cell Syst. 5 (3), 251–267 e3. Regulatory Networks in a Mammalian Cellular Context. BMC Bioinform. 7 (1), S7.
Chiang, J.-H., Chao, S.-Y., 2007. Modeling human cancer-related regulatory modules by Margolin, A.A., Wang, K., Lim, W.K., Kustagi, M., Nemenman, I., Califano, A., 2006b.
GA-RNN hybrid algorithms. BMC Bioinform. 8 (91). Reverse engineering cellular networks. Nat. Protocols 1 (2), 662–671.
Covert, M.W., Knight, E.M., Reed, J.L., Herrgard, M.J., Palsson, B.O., 2004. Integrating Markowetz, F., Spang, R., 2007. Inferring cellular networks - a review. BMC Bioinform. 8
high-throughput and computational data elucidates bacterial networks. Nature 429 (6), S5.
(6987), 92–96. Martins, D., Correa Lopes, D., Martins Ray, F., Sankar, S., 2016. Inference of Gene
De Smet, R., Marchal, K., 2010. Advantages and limitations of current network inference Regulatory Networks by Topological Prior Information and Data Integration.
methods. Nat. Rev. Microbiol. 8, 717–729. Meyer, P.E., Kontos, K., Lafitte, F., Bontempi, G., 2007. Information-theoretic inference of
Dimitrakopoulos, G.N., Maraziotis, I.A., Sgarbas, K., Bezerianos, A., 2014. A clustering large transcriptional regulatory networks. Eurasip Journal on Bioinformatics and
based method accelerating gene regulatory network reconstruction. Procedia Systems Biology, pp. 79879.
Comput. Sci. 29, 1993–2002. Meyer, P.E., Lafitte, F., Bontempi, G., 2008. minet: A R/Bioconductor package for in-
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., Ishwaran, H., Knight, K., Loubes, J.M., ferring large transcriptional networks using mutual information. BMC Bioinform. 9
Massart, P., Madigan, D., Ridgeway, G., Rosset, S., Zhu, J.I., Stine, R.A., Turlach, B.A., (461).
Weisberg, S., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. Mylly Aki, P., Silander, T., Tirri, H., Uronen, P., 2002. B-COURSE: a web-based tool for
Ann. Stat. 32 (2), 407–499. Bayesian and causal data analysis. Int. J. Artif. Intell. Tools 11 (3), 369–387.

47
S. Barbosa et al. BioSystems 174 (2018) 37–48

Paroni, A., Graudenzi, A., Caravagna, G., Damiani, C., Mauri, G., Antoniotti, M., 2016. Villani, M., Barbieri, A., Serra, R., 2011. A dynamical model of genetic networks for cell
CABeRNET: a Cytoscape app for augmented Boolean models of gene regulatory differentiation. PLoS ONE 6 (3), e17703.
NETworks. BMC Bioinform. 17 (64). Villaverde, A.F., Becker, K., Banga, J.R., 2018. PREMER: A Tool to Infer Biological
Peng, J., Wang, P., Zhou, N., Zhu, J., 2009. Partial Correlation Estimation by Joint Sparse Networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 15 (4), 1193–1202.
Regression Models. J. Am. Stat. Assoc. 104 (486), 735–746. Villaverde, A.F., Ross, J., Morán, F., Banga, J.R., 2014. MIDER: Network inference with
Raza, K., Alam, M., 2014. Recurrent Neural Network Based Hybrid Model of Gene mutual information distance and entropy reduction. PLoS ONE 9 (5), e96732.
Regulatory Network. ArXiV (i), 18. Vinh, N.X., Chetty, M., Coppel, R., Wangikar, P.P., 2011. GlobalMIT: Learning globally
Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., optimal dynamic bayesian network with the mutual information test criterion.
Lander, E.S., Mitzenmacher, M., Sabeti, P.C., 2011. Detecting novel associations in Bioinformatics 27 (19), 2765–2766.
large data sets. Science 334 (6062), 1518–1524. Vivek-Ananth, R.P., Samal, A., 2016. Advances in the integration of transcriptional reg-
Roy, S., Guzzi, P.H., 2016. Biological Network Inference from Microarray Data, Current ulatory information into genome-scale metabolic models. BioSystems 147, 1–10.
Solutions, and Assessments. Methods Mol. Biol. 1375, 155–167. Vohradský, J., 2001. Neural network model of gene expression. FASEB J. 15 (3),
Ruyssinck, J., Huynh-Thu, V.A., Geurts, P., Dhaene, T., Demeester, P., Saeys, Y., 2014. 846–854.
NIMEFI: Gene regulatory network inference using multiple ensemble feature im- Wang, Y.K., Hurley, D.G., Schnell, S., Print, C.G., Crampin, E.J., 2013. Integration of
portance algorithms. PLoS ONE 9 (3), e92709. Steady-State and Temporal Gene Expression Data for the Inference of Gene
Schäfer, J., Strimmer, K., 2005. An empirical Bayes approach to inferring large-scale gene Regulatory Networks. PLoS ONE 8 (8), e72103.
association networks. Bioinformatics 21 (6), 754–764. Werhli, A.V., Grzegorczyk, M., Husmeier, D., 2006. Comparative evaluation of reverse
Schwab, J., Burkovski, A., Siegle, L., Müssel, C., Kestler, H., 2016. ViSiBooL - visualization engineering gene regulatory networks with relevance networks, graphical gaussian
and simulation of Boolean networks with temporal constraints. Bioinformatics 33 (4), models and bayesian networks. Bioinformatics 22 (20), 2523–2531.
601–604. Wetterstrand, K. A., 2017. DNA Sequencing Costs: Data from the NHGRI Large-Scale
Scutari, M., 2010. Learning Bayesian Networks with the bnlearn R Package. J. Stat. Softw. Genome Sequencing Program (www.genome.gov/sequencingcostsdata) - Accessed.
35 (3), 1–22. Wu, J., Zhao, X., Lin, Z., Shao, Z., 2016. Large scale gene regulatory network inference
Shannon, C.E., 1948. A mathematical theory of communication. Bell Syst. Technical J. 27 with a multi-level strategy. Mol. BioSyst. 12 (2), 588–597.
(3), 379–423. Xiao, F., Gao, L., Ye, Y., Hu, Y., He, R., 2016. Inferring Gene Regulatory Networks Using
Shmulevich, I., Aitchison, J.D., 2009. Deterministic and Stochastic Models of Genetic Conditional Regulation Pattern to Guide Candidate Genes. PLoS One 11 (5),
Regulatory Networks. Methods Enzymol. 467, 335–356. e0154953.
Shmulevich, I., Dougherty, E., 2010. Probabilistic Boolean Networks: The Modeling and Yeang, C.-H., Vingron, M., 2006. A joint model of regulatory and metabolic networks.
Control of Gene Regulatory Networks. Society for Industrial and Applied BMC Bioinform. 7 (332).
Mathematics, Philadelphia. Yu, D., Kim, M., Xiao, G., Hwang, T.H., 2013. Review of biological network data and its
Shmulevich, I., Gluhovsky, I., Hashimoto, R.F., Dougherty, E.R., Zhang, W., 2003. Steady- applications. Genom. Informatics 11 (4), 200–210.
state analysis of genetic regulatory networks modelled by probabilistic Boolean Yu, J., Smith, V.A., Wang, P.P., Hartemink, A.J., Jarvis, E.D., 2004. Advances to Bayesian
networks. Comp. Funct. Genom. 4 (6), 601–608. network inference for generating causal networks from observational biological data.
Singh, N., Vidyasagar, M., 2016. bLARS: An Algorithm to Infer Gene Regulatory Bioinformatics 20 (18), 3594–3603.
Networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 13 (2), 301–314. Zavlanos, M.M., Julius, A.A., Boyd, S.P., Pappas, G.J., 2011. Inferring stable genetic
Sławek, J., Arodź, T., 2013. ENNET: inferring large gene regulatory networks from ex- networks from steady-state data. Automatica 47 (6), 1113–1122.
pression data using gradient boosting. BMC Syst. Biol. 7 (106). Zhang, X., Zhao, X.-M., He, K., Lu, L., Cao, Y., Liu, J., Hao, J.-K., Liu, Z.-P., Chen, L., 2012.
Thijssen, B., Dijkstra, T.M.H., Heskes, T., Wessels, L.F.A., 2016. BCM: toolkit for Bayesian Inferring gene regulatory networks from gene expression data by path consistency
analysis of Computational Models using samplers. BMC Syst. Biol. 10 (100). algorithm based on conditional mutual information. Bioinformatics (Oxford,
Tian, T., Burrage, K., 2003. Stochastic neural network models for gene regulatory net- England) 28 (1), 98–104.
works. The 2003 Congress on Evolutionary Computation 1, 162–169. Zoppoli, P., Morganella, S., Ceccarelli, M., 2010 Mar. Timedelay-aracne: Reverse en-
Tong, D.L., Boocock, D.J., Dhondalay, G.K.R., Lemetre, C., Ball, G.R., 2014. Artificial gineering of gene networks from time-course data by an information theoretic ap-
Neural Network Inference (ANNI): A study on gene-gene interaction for biomarkers proach. BMC Bioinform. 11 (154).
in childhood sarcomas. PLoS ONE 9 (7), e102483.

48

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy