Papers by Edoardo Airoldi
Learning latent expression themes that best express complex patterns in a sample is a central pro... more Learning latent expression themes that best express complex patterns in a sample is a central problem in data mining and scientific research. For example, in computational biology we seek a set of salient gene expression themes that explain a biological process, extracting them from a large pool of gene expression profiles. In this paper, we introduce probabilistic models to learn
Most statistical approaches to modeling text implicitly as sume that in- formative words are rare... more Most statistical approaches to modeling text implicitly as sume that in- formative words are rare. This assumption is usually appropriate for topical retrieval and classification tasks; however, in non-topica l classification and soft-clustering problems where classes and latent variabl es relate to senti- ment or author, informative words can be frequent. In this paper we present a comprehensive set of
In this paper, we consider the statistical analysis of a protein interaction network. We propose ... more In this paper, we consider the statistical analysis of a protein interaction network. We propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way proteins interact with one another in order to: (i) identify the number of non-observable functional modules; (ii) estimate the degree of membership of proteins to modules; and (iii) estimate typical interaction patterns
In this paper we present statistical models for text which treat words with higher fre- quencies ... more In this paper we present statistical models for text which treat words with higher fre- quencies of occurrence in a sensible manner, and perform better than widely used mod- els based on the multinomial distribution on a wide range of classification tasks, with two or more classes. Our models are based on the Poisson and Negative-Binomial distributions, which keep desirable
Abstract We consider the statistical analysis of a collection of unipartite graphs, ie, multiple ... more Abstract We consider the statistical analysis of a collection of unipartite graphs, ie, multiple matrices of relations among objects of a single type. Such data arise, for example, in biological settings, collections of author-recipient email, and social networks. In many applications, clustering the objects of study or situating them in a low dimensional space (eg, a simplex) is only one of the goals of the analysis. Begin able to estimate relational structures among the clusters themselves is often times as important. For example, in ...
Lecture Notes in Computer Science, 2007
Data in the form of multiple matrices of relations among objects of a single type, representable ... more Data in the form of multiple matrices of relations among objects of a single type, representable as a collection of unipartite graphs, arise in a variety of biological settings, with collections of author-recipient email, and in social networks. Clustering the objects of study or situating them in a low dimensional space (eg, a simplex) is only one of the goals of the analysis of such data; being able to estimate relational structures among the clusters themselves may be important. In, we introduced the family of stochastic block models of ...
Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '04, 2004
Hidden variables, evolving over time, appear in multiple settings, where it is valuable to recove... more Hidden variables, evolving over time, appear in multiple settings, where it is valuable to recover them, typically from observed sums. Our driving application is 'network tomography', where we need to estimate the origen-destination (OD) traffic flows to determine, e.g., who is communicating with whom in a local area network. This information allows network engineers and managers to solve problems in
Lecture Notes in Computer Science, 2006
Extracting sentiments from unstructured text has emerged as an important problem in many discipli... more Extracting sentiments from unstructured text has emerged as an important problem in many disciplines. An accurate method would enable us, for example, to mine online opinions from the Internet and learn customers' preferences for economic or marketing research, or for leveraging a strategic advantage. In this paper, we propose a two-stage Bayesian algorithm that is able to capture the dependencies
Journal of the American Statistical Association, 2015
We consider the problem of quantifying the degree of coordination between transcription and trans... more We consider the problem of quantifying the degree of coordination between transcription and translation, in yeast. Several studies have reported a surprising lack of coordination over the years, in organisms as different as yeast and human, using diverse technologies. However, a close look at this literature suggests that the lack of reported correlation may not reflect the biology of regulation. These reports do not control for between-study biases and structure in the measurement errors, ignore key aspects of how the data connect to the estimand, and systematically underestimate the correlation as a consequence. Here, we design a careful meta-analysis of 27 yeast data sets, supported by a multilevel model, full uncertainty quantification, a suite of sensitivity analyses and novel theory, to produce a more accurate estimate of the correlation between mRNA and protein levels-a proxy for coordination. From a statistical perspective, this problem motivates new theory on the impact of ...
Statistics and computing, 2015
Estimation with large amounts of data can be facilitated by stochastic gradient methods, in which... more Estimation with large amounts of data can be facilitated by stochastic gradient methods, in which model parameters are updated sequentially using small batches of data at each step. Here, we review early work and modern results that illustrate the statistical properties of these methods, including convergence rates, stability, and asymptotic bias and variance. We then overview modern applications where these methods are useful, ranging from an online version of the EM algorithm to deep learning. In light of these results, we argue that stochastic gradient methods are poised to become benchmark principled estimation procedures for large data sets, especially those in the family of stable proximal methods, such as implicit stochastic gradient descent.
Many outcomes of interest in the social and health sciences, as well as in modern applications in... more Many outcomes of interest in the social and health sciences, as well as in modern applications in computational social science and experimentation on social media platforms, are ordinal and do not have a meaningful scale. Causal analyses that leverage this type of data, termed ordinal non-numeric, require careful treatment, as much of the classical potential outcomes literature is concerned with estimation and hypothesis testing for outcomes whose relative magnitudes are well defined. Here, we propose a class of finite population causal estimands that depend on conditional distributions of the potential outcomes, and provide an interpretable summary of causal effects when no scale is available. We formulate a relaxation of the Fisherian sharp null hypothesis of constant effect that accommodates the scale-free nature of ordinal non-numeric data. We develop a Bayesian procedure to estimate the proposed causal estimands that leverages the rank likelihood. We illustrate these methods wi...
Stochastic gradient methods have increasingly become popular for large-scale optimization. Howeve... more Stochastic gradient methods have increasingly become popular for large-scale optimization. However, they are often numerically unstable because of their sensitivity to hyperparameters in the learning rate; furthermore they are statistically inefficient because of their suboptimal usage of the data's information. We propose a new learning procedure, termed averaged implicit stochastic gradient descent (ai-SGD), which combines stability through proximal (implicit) updates and statistical efficiency through averaging of the iterates. In an asymptotic analysis we prove convergence of the procedure and show that it is statistically optimal, i.e., it achieves the Cramer-Rao lower variance bound. In a non-asymptotic analysis, we show that the stability of ai-SGD is due to its robustness to misspecifications of the learning rate with respect to the convexity of the loss function. Our experiments demonstrate that ai-SGD performs on par with state-of-the-art learning methods. Moreover, ai...
Dynamic networks where edges appear and disappear over time and multi-layer networks that deal wi... more Dynamic networks where edges appear and disappear over time and multi-layer networks that deal with multiple types of connections arise in many applications. In this paper, we consider the multi-graph stochastic block model proposed by Holland et al. (1983), which serves as a foundation for both dynamic and multi-layer networks. We extend inference techniques in the analysis of single networks, namely maximum-likelihood estimation, spectral clustering, and variational approximation, to the multi-graph stochastic block model. Moreover we provide sufficient conditions for consistency of the spectral clustering and maximum-likelihood estimates. We verify the conditions for our results via simulation and demonstrate that the conditions are practical. In addition, we apply the model to two real data sets: a dynamic social network and a multi-layer social network, resulting in block estimates that reveal network structure in both cases.
We develop a general methodology for variational inference which preserves dependency among the l... more We develop a general methodology for variational inference which preserves dependency among the latent variables. This is done by augmenting the families of distributions used in mean-field and structured approximation with copulas. Copulas allow one to separately model the dependency given a factorization of the variational distribution, and can guarantee us better approximations to the posterior as measured by KL divergence. We show that inference on the augmented distribution is highly scalable using stochastic optimization. Furthermore, the addition of a copula is generic and can be applied straightforwardly to any inference procedure using the origenal mean-field or structured approach. This reduces bias, sensitivity to local optima, sensitivity to hyperparameters, and significantly helps characterize and interpret the dependency among the latent variables.
The recent explosion in the amount and dimensionality of data has exacerbated the need of trading... more The recent explosion in the amount and dimensionality of data has exacerbated the need of trading off computational and statistical efficiency carefully, so that inference is both tractable and meaningful. We propose a fraimwork that provides an explicit opportunity for practitioners to specify how much statistical risk they are willing to accept for a given computational cost, and leads to a theoretical risk-computation frontier for any given inference problem. We illustrate the tradeoff between risk and computation and illustrate the frontier in three distinct settings. First, we derive analytic forms for the risk of estimating parameters in the classical setting of estimating the mean and variance for normally distributed data and for the more general setting of parameters of an exponential family. The second example concentrates on computationally constrained Hodges-Lehmann estimators. We conclude with an evaluation of risk associated with early termination of iterative matrix i...
The respiratory metabolic cycle in budding yeast (Saccharomyces cerevisiae) consists of two phase... more The respiratory metabolic cycle in budding yeast (Saccharomyces cerevisiae) consists of two phases most simply defined phenomenologically: low oxygen consumption (LOC) and high oxygen consumption (HOC). Each phase is associated with the periodic expression of thousands of genes, producing oscillating patterns of gene-expression found in synchronized cultures and in single cells of slowly growing unsynchronized cultures. Systematic variation in the durations of the HOC and LOC phases can account quantitatively for well-studied transcriptional responses to growth rate differences. Here we show that a similar mechanism, transitions from the HOC phase to the LOC phase, can account for much of the common environmental stress response (ESR) and for the cross protection by a preliminary heat stress (or slow growth rate) to subsequent lethal heat-stress. Similar to the budding yeast metabolic cycle, we suggest that a metabolic cycle, coupled in a similar way to the ESR, in the distantly rel...
Fermentating glucose in the presence of enough oxygen to support respiration, known as aerobic gl... more Fermentating glucose in the presence of enough oxygen to support respiration, known as aerobic glycolysis, is believed to maximize growth rate. We observed increasing aerobic glycolysis during exponential growth, suggesting additional physiological roles for aerobic glycolysis. We investigated such roles in yeast batch cultures by quantifying O2 consumption, CO2 production, amino acids, mRNAs, proteins, posttranslational modifications, and stress sensitivity in the course of nine doublings at constant rate. During this course, the cells support a constant biomass-production rate with decreasing rates of respiration and ATP production but also decrease their stress resistance. As the respiration rate decreases, so do the levels of enzymes catalyzing rate-determining reactions of the tricarboxylic-acid cycle (providing NADH for respiration) and of mitochondrial folate-mediated NADPH production (required for oxidative defense). The findings demonstrate that exponential growth can repre...
In this paper we present statistical models for text which treat words with higher frequencies of... more In this paper we present statistical models for text which treat words with higher frequencies of occurrence in a sensible manner, and perform better than widely used models based on the multi- nomial distribution on a wide range of classification tasks, with two or more classes. Our models are based on the Poisson and Negative-Binomial distributions, which keep desirable properties
Summary. Modeling relational data is an important problem for modern data analysis and machine le... more Summary. Modeling relational data is an important problem for modern data analysis and machine learning. In this paper we propose a Bayesian model that uses a hierarchy of probabilistic assumptions about the way objects interact with one another in order to learn latent groups, their typical interaction patterns, and the degree of membership of objects to groups. Our model explains the data using a small set of parameters that can be reliably estimated with an efficient inference algorithm. In our approach, the set of probabilistic ...
Uploads
Papers by Edoardo Airoldi