Causal Discovery Using Proxy Variables
Causal Discovery Using Proxy Variables
Causal Discovery Using Proxy Variables
vational causal discovery algorithms estimate the The gold standard to discover causal relations is to perform
cause-effect relation between two random entities active interventions (also called experiments) in the sys-
X and Y , given n samples from P (X, Y ). tem under study (Pearl, 2009). However, interventions are
In this paper, we develop a framework to estimate in many situations expensive, unethical, or impossible to
the cause-effect relation between two static enti- realize. In all of these situations, there is a prime need to
ties x and y: for instance, an art masterpiece x and discover and reason about causality purely from observation.
its fraudulent copy y. To this end, we introduce Over the last decade, the state-of-the-art in observational
the notion of proxy variables, which allow the causal discovery has matured into a wide array of algorithms
construction of a pair of random entities (A, B) (Shimizu et al., 2006; Hoyer et al., 2009; Daniusis et al.,
from the pair of static entities (x, y). Then, esti- 2012; Peters et al., 2014; Mooij et al., 2016; Lopez-Paz et al.,
mating the cause-effect relation between A and B 2015; Lopez-Paz, 2016). All these algorithms estimate the
using an observational causal discovery algorithm causal relations between the random variables (X1 , . . . , Xp )
leads to an estimation of the cause-effect relation by estimating various asymmetries in P (X1 , . . . , Xp ). In
between x and y. For example, our framework the interest of simplicity, this paper considers the problem
detects the causal relation between unprocessed of discovering the causal relation between two variables X
photographs and their modifications, and orders and Y , given n samples from P (X, Y ).
in time a set of shuffled frames from a video. The methods mentioned estimate the causal relation between
As our main case study, we introduce a human- two random entities X and Y , but often we are interested
elicited dataset of 10,000 pairs of casually-linked instead in two static entities x and y. These are a pair of
pairs of words from natural language. Our meth- single objects for which it is not possible to define a proba-
ods discover 75% of these causal relations. Fi- bility distribution directly. Examples of such static entities
nally, we discuss the role of proxy variables in may include one art masterpiece and its fraudulent copy,
machine learning, as a general tool to incorporate one translated document and its original version, or one pair
static knowledge into prediction tasks. of causally linked words in natural language, such as “virus”
and “death”. Looking into the distant future, an algorithm
able to discover the causal structure between static entities
1. Introduction in natural language could read throughout medical journals,
and discover the causal mechanisms behind a new cure for
Discovering causal relations is a central task in science
a specific disease–the very goal of the ongoing $45 million
(Pearl, 2009; Beebee et al., 2009), and empowers humans
dollar Big Mechanism DARPA initiative (Cohen, 2015). Or,
to explain their experiences, predict the outcome of their
if we were able to establish the causal relation between
interventions, wonder about what could have happened but
two arbitrary natural language statements, we could tackle
never did, or plan which decisions will shape the future to
general-AI tasks such as the Winograd schema challenge
their maximum benefit. Causal discovery is essential to the
(Levesque et al., 2012), which are out-of-reach for current
development of common-sense (Kuipers, 1984; Waldrop,
algorithms. The above and many more are situations where
1987). In machine learning, it has been argued that causal
causal discovery between static entities is at demand.
1
Facebook AI Research, Paris, France 2 University of Cam-
bridge, Cambridge, UK 3 MPI for Intelligent Systems, Tübingen, Our Contributions
Germany. Correspondence to: Mateo Rojas-Carulla <mrojas-
carulla@gmail.com>. First, we introduce the framework of proxy variables to esti-
mate the causal relation between static entities (Section 3).
Causal Discovery Using Proxy Variables
X
Y
the cases, and it can recover the correct ordering of a set of
shuffled video frames (Section 4.2).
−1
Third, we apply our framework to discover the cause-effect −1
relation between pairs of words in natural language (Sec- −1 0 1 −1 0 1
tion 5). To this end, we introduce a novel dataset of 10,000 X Y
human-elicited pairs of words with known causal relation (a) Y = F (X) + N , X ⊥ N . (b) X = G(Y ) + E, Y 6⊥ E.
(Section 5.2). Our methods are able to recover 75% of
the cause-effect relations (such as “accident → injury” or
Figure 1. Example of an Additive Noise Model (ANM).
“sentence → trial”) in this challenging task (Section 5.4).
Fourth, we discuss the role of proxy variables as a tool
In the third case, X and Y are conditionally independent
to incorporate external knowledge, as provided by static
given Z (write X ⊥ Y | Z).
entities, into general prediction problems (Section 6).
All our code and data are available at anonymous. In machine learning, these three types of statistical depen-
dencies are exploited without distinction, as dependence is
We start the exposition by introducing the basic language sufficient to perform optimal predictions about identically
of observational causal discovery, as well as motivating its and independently distributed (iid) data (Schölkopf et al.,
role in machine learning. 2012). However, we argue that taking into account the Prin-
ciple of common cause would have far-reaching benefits
2. Causal Discovery in Machine Learning in non-iid machine learning. For example, assume that we
are interested in predicting the values of a target variable
The goal of observational causal discovery is to reveal the Y , given the values taken by two features (X1 , X2 ). Then,
cause-effect relation between two random variables X and understanding the causal structure underlying (X1 , X2 , Y )
Y , given n samples (x1 , y1 ), . . . (xn , yn ) from P (X, Y ). In brings two benefits.
particular, we say that “X causes Y ” if there exists a mecha-
nism F that transforms the values taken by the cause X into First, interpretability. Explanatory questions such as “Why
the values taken by the effect Y , up to the effects of some does Y = 2 when (X1 , X2 ) = (−1, 3)?”, and counterfac-
random noise N . Mathematically, we write Y ← F (X, N ). tual questions such as “What value would have Y taken,
Such equation highlights an asymmetric assignment rather had X2 = −3?” cannot be answered using statistics alone,
than a symmetric equality. If we were to intervene and since their answers depend on the particular causal structure
change the value of the cause X, then a change in the value underlying the data.
of the effect Y would follow. On the contrary, if we were to Second, robustness. Predictors which estimate the values
manipulate the value of the effect Y , a change in the cause taken by a target variable Y given only its direct causes
X would not follow. are robust with respect to distributional shifts on their in-
When two random variables share a causal relation, they puts. For example, let X1 ∼ P (X1 ), Y ← F1 (X1 ), and
often become statistically dependent. However, when two X2 ← F2 (X1 ). Then, the predictor E(Y | X1 ) is invariant
random variables are statistically dependent, they do not to changes in the joint distribution P (X1 , X2 ) as long as
necessarily share a causal relation. This is at the origin of the causal mechanism F1 does not change. However, the
the famous warning “dependence does not imply causality”. predictor E(Y | X1 , X2 ) can vary wildly even if the causal
This relation between dependence and causality was formal- mechanism F1 (the only one involved in computing Y ) does
ized by Reichenbach (1956) into the following principle. not change (Peters et al., 2016; Rojas-Carulla et al., 2015).
The previous two points apply to the common “non-iid” sit-
Principle 1 (Principle of common cause). If two random uations where we have access to data drawn from some
variables X and Y are statistically dependent (X 6⊥ Y ), distribution P , but we are interested in some different but
then one of the following causal explanations must hold: related distribution P̃ . One natural way to phrase and lever-
age the similarities between P and P̃ is in terms of shared
i) X causes Y (write X → Y ), or
causal structures (Peters, 2012; Lopez-Paz, 2016).
ii) Y causes X (write X ← Y ), or
iii) there exists a random variable Z that is the common While it is indeed an attractive endeavor, discovering the
cause of both X and Y (write X ← Z → Y ). causal relation between two random variables purely from
Causal Discovery Using Proxy Variables
x = original image y = stylized image and y, we can apply a regular causal discovery algorithm to
(A, B) to estimate the causal relation between x and y.
−10
original image, as described by the ANM:
−20
mutual information, or PMI (Church & Hanks, 1990). They are thus outside the scope of the purely corpus-based
7) πprec-pmi (w, x), similar to the one above, but computed methods we are considering here.
only over sentences where w precedes x.
Most NLP work specifically focusing on the causality re-
Applying the causal projections to our sample from proxy lation relies on informative linking patterns co-occurring
W , we construct the n-vector with the target pairs (such as, most obviously, the conjunc-
tion because). These patterns are extracted and processed
Πxproj = (πproj (w1 , x), . . . , πproj (wn , x)), (1) with sophisticated methods, involving annotation, ontolo-
and similarly for Πyproj , where gies, bootstrapping and/or manual filtering (see, e.g., Blanco
et al. 2008 Hashimoto et al. 2012, Radinsky et al. 2012, and
proj ∈ {w2vii, w2vio, w2voi, references therein). We experimented with extracting link-
counts, prec-counts, pmi, prec-pmi}. (2) ing patterns from our corpus, but, due to the relatively small
size of the latter, results were extremely sparse (note that
patterns can only be extracted from sentences in which both
In particular, we use the skip-gram model implementa-
cause and effect words occur). More recent work started
tion of fastText (Bojanowski et al., 2016) to compute
looking at causal chains of events as expressed in text (see
300−dimensional word2vec representations.
Mirza & Tonelli 2016 and references therein). Applying our
generic method to this task is a direction for future work.
5.2. A Real-World Dataset of Cause-Effect Words
A semantic relation that received particular attention in NLP
We introduce a human-elicited, human-filtered dataset of is that of entailment between words (dog entails animal). As
10, 000 pairs of words with a known causal relation. This causality is intuitively related to entailment, we will apply
dataset was constructed in two steps: below entailment detection methods to cause/effect classi-
1) We asked workers from Amazon Mechanical Turk to fication. Most lexical entailment methods rely on distribu-
create pairs of words linked by a causal relation. We tional representations of the words in the target pair. Tradi-
provided the turks with examples of words with a clear tionally, entailing pairs have been identified with unsuper-
causal link (such as “sun causes radiation”) and exam- vised asymmetric similarity measures applied to distributed
ples of related words not sharing a causal relation (such word representations (Geffet & Dagan, 2005; Kotlerman
as “knife” and “fork”). For details, see Appendix A. et al., 2010; Lenci & Benotto, 2012; Weeds et al., 2004).
2) Each of the pairs collected from the previous step was We will test one of these related measures, namely, Weeds
randomly shuffled and submitted to 20 different turks, Precision (WS). More recently, Santus et al. 2014 showed
none of whom had created any of the word pairs. Each that the relative entropy of distributed vectors representing
turk was required to classify the pair of words (x, y) the words in a pair is an effective cue to which word is
as “x causes y”, “y causes x”, or “x and y do not share entailing the other, and we also look at entropy for our task.
a causal relation”. For more details, see Appendix B. However, the most effective method to detect entailment is
to apply a supervised classifier to the concatenation of the
This procedure resulted in a dataset of 10, 000 causal word vectors representing the words in a pair (Baroni et al., 2012;
pairs (x, y), each accompanied with three numbers: the Roller et al., 2014; Weeds et al., 2014).
number of turks that voted “x causes y”, the number of
turks that voted “y causes x”, and the number of turks that 5.4. Experiments
voted “x and y do not share a causal relation”.
We evaluate a variety of methods to discover the causal
5.3. Causal Relation Discovery in NLP relation between two words appearing in a large corpus of
natural language. We study methods that fall within three
The NLP community has devoted much attention to the prob- categories: baselines, distribution-based causal discovery
lem of identifying the semantic relation holding between methods, and feature-based supervised methods. These
two words, with causality as a special case. Girju et al. three families of methods consider an increasing amount of
(2009) discuss the results of the large shared task on relation information about the task at hand, and therefore exhibit an
classification they organized (their benchmark included only increasing performance up to 85% classification accuracy.
220 examples of cause-effect). The task required recogniz-
ing relations in context, but, as discussed by the authors, All our computations will be based on the full English
most contexts display the default relation we are after here Wikipedia, as post-processed by Matt Mahoney (see http:
(e.g., “The mutant virus gave him a severe flu” instantiates //www.mattmahoney.net/dc/textdata.html). We
the default relation in which virus is the cause, flu is the study the N = 1, 970 pairs of words out of 10, 000 from the
effect). All participating systems used extra resources, such dataset described in Section 5.2 that achieved a consensus
as ontologies and syntactic parsing, on top of corpus data. across at least 18 out of 20 turks. We use RCC to estimate
Causal Discovery Using Proxy Variables
the causal relation between pairs of random variables. The methods in this family first split the dataset D into
a training set Dtr and a test set Dte . Then, the methods
5.4.1. BASELINES train RCC on the training set Dtr , and test its classification
accuracy on Dte . This process is repeated ten times, splitting
These are a variety of unsupervised, heuristic baselines.
at random D into a training set containing 75% of the pairs,
Each baseline computes two scores, denoted by Sx→y and
and a test set containing 25% of the pairs. Each method
Sx←y , predicting x → y if Sx→y > Sx←y , and x ← y if
builds on top of a causal projection from (2) above. Figure 5
Sx→y < Sx←y . The baselines are:
shows the test accuracy of these methods in green.
• frequency: Sx→y is the number of sentences where
x appears in the corpus, and Sx←y is the number of 5.4.3. F EATURE - BASED SUPERVISED METHODS
sentences where y appears in the corpus.
These methods use the same data generated by our causal
• precedence: considering only sentences from the cor-
projections, but treat them as fixed-size vectors fed to a
pus where both x and y appear, Sx→y is the number
generic classifier, rather than random samples to be analyzed
of sentences where x occurs before y, and Sx←y is the
with an observational causal discovery method. They can
number of sentences where y occurs before x.
be seen as an oracle to upper-bound the amount of causal
• counts (entropy): Sx→y is the entropy of Πxcounts , and
signals (and signals correlated to causality) contained in our
Sx←y is the entropy of Πycounts , as defined in (1).
data. Specifically, they use 2n-dimensional vectors given
• counts (WS): Using the WS measure of Weeds & Weir
by the concatenation of those in (1). Given N word pairs
(2003), Sx→y = WS(Πxcounts , Πycounts ), and Sx←y = N
WS(Πycounts , Πxcounts ). (xi , yi ), they build a dataset D = (Πxproj i
, Πyproj
i
), `i ,
i=1
• prec-counts (entropy): Sx→y is the entropy of where `i = +1 if xi → yi , `i = −1 if xi ← yi , and “proj”
Πxprec-counts , and Sx←y is the entropy of Πyprec-counts (1). is a projection from (2). Next, we split the dataset D into a
• prec-counts (WS): analogous to the previous. training set Dtr containing 75% of the pairs, and a disjoint
The baselines PMI (entropy), PMI (WS), prec-PMI (en- test set Dte containing 25% of the pairs. To evaluate the
tropy), prec-PMI (WS) are analogous to the last four, but use accuracy of each method in this family, we train a random
(Πx(prec-)pmi , Πy(prec-)pmi ) instead of (Πx(prec-)counts , Πy(prec-)counts ). forest of 500 trees using Dtr , and report its classification
Figure 5 shows the performance of these baselines in blue. accuracy over Dte . This process is repeated ten times, by
splitting the dataset D at random. The results are presented
5.4.2. D ISTRIBUTION - BASED CAUSAL DISCOVERY as red bars in Figure 5. We also report the classification
METHODS accuracy of training the random forest on the raw word2vec
representations of the pair of words (top three bars).
These methods implement our framework of causal dis-
covery using proxy variables. They classify n samples 5.4.4. D ISCUSSION OF RESULTS
from a 2-dimensional probability distribution as a whole.
Recall that a vocabulary (wj )nj=1 drawn from the proxy Baseline methods are the lowest performing, up to 59% test
is available. Given N word pairs (xi , yi ), this family of accuracy. We believe that the performance of the best base-
N line, precedence, is due to the fact that most Wikipedia is
methods constructs a dataset D = ({(aij , bij )}nj=1 , `i ) i=1 ,
written in the active voice, which often aligns with the tem-
where aij = πproj (wj , xi ), bij = πproj (wj , yi ), `i = +1 if
poral sequence of events, and thus correlates with causality.
xi → yi and `i = −1 otherwise. In short, D is a dataset of
N “scatterplots” annotated with binary labels. The i-th scat- The feature-based methods perform best, achieving up to
terplot contains n 2-dimensional points, which are obtained 85% test classification accuracy. However, feature-based
by applying the causal projection to both xi and yi , against methods enjoy the flexibility of considering each of the
the n vocabulary words drawn from the proxy. n = 10, 000 elements in the causal projection as a distinct
feature. Therefore, feature-based methods do not focus
The samples (aij , bij )nj=1 are computed using a determinis- on patterns to be found at a distributional level (such as
tic projection of iid draws from the proxy, meaning that causality), and are vulnerable to permutation or removal
{(aij , bij )}nj=1 ∼ P n (Ai , B i ). Therefore, we could permute of features. We believe that feature-based methods may
the points inside each scatterplot without altering the results achieve their superior performance by overfitting to biases
of these methods. In principle, we could also remove some in our dataset, which are not necessarily related to causality.
of the points in the scatterplot without a significant drop in
performance. Therefore, these methods search for causal Impressively, the best distribution-based causal discovery
footprints at the 2-dimensional distribution level, and we method achieves 75% test classification accuracy, which is
term them distribution-based causal discovery methods. a significant improvement over the best baseline method.
Importantly, our distribution-based methods take a whole
Causal Discovery Using Proxy Variables
7. Conclusion Figure 5. Results for all methods on the NLP experiment. Ac-
curacies above 52% are statistically significant with respect to a
We have introduced the necessary machinery to estimate the
Binomial test at a significance level α = 0.05.
causal relation between pairs of static entities x and y — one
piece of art and its forgery, one document and its translation,
or the concepts underlying a pair of words appearing in a 0.75
corpus of natural language. We have done so by introducing
RCC accuracy
To provide us with high quality word pairs, we ask you to follow these indications:
• All word pairs must have the form “WordA → WordB”. It is essential that the first word (WordA) is the cause, and the
second word (WordB) is the effect.
• WordA and WordB must be one word each (no spaces, and no “recessive gene → red hair”). Avoid compound words
such as “snow-blind”.
• In most situations, you may come up with a word pair that can be justified both as “WordA → WordB” and “WordB →
WordA”. In such situations, prefer the causal direction with the easiest explanation. For example, consider the word
pair “virus → death”. Most people would agree that “virus causes death“. However, “death causes virus” can be true
in some specific scenario (for example, “because of all the deaths in the region, a new family of virus emerged.”).
However, the explanation “virus causes death“ is preferred, because it is more general and depends less on the context.
• We do not accept word pairs with an ambiguous causal relation, such as “book - paper”.
• We do not accept simple variations of word pairs. For example, if you wrote down “dog → bark”, we will not credit
you for other pairs such as “dogs → bark” or “dog → barking”.
• Use frequent words (avoid strange words such as “clithridiate”).
• Do not rely on our examples, and use your creativity. We are grateful if you come up with diverse word pairs! Please do
not add any numbers (for example, “1 - dog → bark”). For your guidance, we provide you examples of word pairs that
belong to different categories. Please bear in mind that we will reward your creativity: therefore, focus on providing
new word pairs with an evident causal direction, and do not limit yourself to the categories shown below.
1) Physical phenomenon: there exists a clear physical mechanism that explains why “WordA → WordB”.
• sun → radiation (The sun is a source of radiation. If the sun were not present, then there would be no radiation.)
• altitude → temperature
• winter → cold
• oil → energy
2) Events and consequences: WordA is an action or event, and WordB is a consequence of that action or event.
• crime → punishment
• accident → death
• smoking → cancer
• suicide → death
Causal Discovery Using Proxy Variables
• call → ring
3) Creator and producer: WordA is a creator or producer, WordB is the creation of the producer.
• fear → scream
• age → salary
Some of the pairs that will be presented are non-causal. This may happen if:
To provide us with high quality categorization of word pairs, we ask you to follow these indications:
• Prefer the causal direction with the simplest explanation. Most people would agree that “virus causes death”. However,
“death causes virus” can be true in some specific scenario (for example, “because of all the deaths in the region, a
new virus emerged.”). However, the explanation “virus causes death” is preferred, because it is true in more general
contexts.
• If no direction is clearer, mark the pair as non-causal. Here, conservative is good!
Causal Discovery Using Proxy Variables
• Think twice before deciding. We will present the pairs in random order!
Please classify all the presented pairs. If one or more has not been answered, the whole batch will be invalid. PLEASE
DOUBLE CHECK THAT YOU HAVE ANSWERED ALL 40 WORD PAIRS.
Examples of causal word pairs: