Causal Discovery Using Proxy Variables

Causal Discovery Using Proxy Variables
Mateo Rojas-Carulla 1 2 3 Marco Baroni 1 David Lopez-Paz 1
Abstract discovery algorithms are a necessary step towards machine

Discovering causal relations is fundamental to reasoning (Bottou, 2014; Bottou et al., 2013; Lopez-Paz,
reasoning and intelligence. In particular, obser- 2016) and artificial intelligence (Lake et al., 2016).
arXiv:1702.07306v1 [stat.ML] 23 Feb 2017
vational causal discovery algorithms estimate the The gold standard to discover causal relations is to perform
cause-effect relation between two random entities active interventions (also called experiments) in the sys-
X and Y , given n samples from P (X, Y ). tem under study (Pearl, 2009). However, interventions are
In this paper, we develop a framework to estimate in many situations expensive, unethical, or impossible to
the cause-effect relation between two static enti- realize. In all of these situations, there is a prime need to
ties x and y: for instance, an art masterpiece x and discover and reason about causality purely from observation.
its fraudulent copy y. To this end, we introduce Over the last decade, the state-of-the-art in observational
the notion of proxy variables, which allow the causal discovery has matured into a wide array of algorithms
construction of a pair of random entities (A, B) (Shimizu et al., 2006; Hoyer et al., 2009; Daniusis et al.,
from the pair of static entities (x, y). Then, esti- 2012; Peters et al., 2014; Mooij et al., 2016; Lopez-Paz et al.,
mating the cause-effect relation between A and B 2015; Lopez-Paz, 2016). All these algorithms estimate the
using an observational causal discovery algorithm causal relations between the random variables (X1 , . . . , Xp )
leads to an estimation of the cause-effect relation by estimating various asymmetries in P (X1 , . . . , Xp ). In
between x and y. For example, our framework the interest of simplicity, this paper considers the problem
detects the causal relation between unprocessed of discovering the causal relation between two variables X
photographs and their modifications, and orders and Y , given n samples from P (X, Y ).
in time a set of shuffled frames from a video. The methods mentioned estimate the causal relation between
As our main case study, we introduce a human- two random entities X and Y , but often we are interested
elicited dataset of 10,000 pairs of casually-linked instead in two static entities x and y. These are a pair of
pairs of words from natural language. Our meth- single objects for which it is not possible to define a proba-
ods discover 75% of these causal relations. Fi- bility distribution directly. Examples of such static entities
nally, we discuss the role of proxy variables in may include one art masterpiece and its fraudulent copy,
machine learning, as a general tool to incorporate one translated document and its original version, or one pair
static knowledge into prediction tasks. of causally linked words in natural language, such as “virus”
and “death”. Looking into the distant future, an algorithm
able to discover the causal structure between static entities
1. Introduction in natural language could read throughout medical journals,
and discover the causal mechanisms behind a new cure for
Discovering causal relations is a central task in science
a specific disease–the very goal of the ongoing $45 million
(Pearl, 2009; Beebee et al., 2009), and empowers humans
dollar Big Mechanism DARPA initiative (Cohen, 2015). Or,
to explain their experiences, predict the outcome of their
if we were able to establish the causal relation between
interventions, wonder about what could have happened but
two arbitrary natural language statements, we could tackle
never did, or plan which decisions will shape the future to
general-AI tasks such as the Winograd schema challenge
their maximum benefit. Causal discovery is essential to the
(Levesque et al., 2012), which are out-of-reach for current
development of common-sense (Kuipers, 1984; Waldrop,
algorithms. The above and many more are situations where
1987). In machine learning, it has been argued that causal
causal discovery between static entities is at demand.
1
Facebook AI Research, Paris, France 2 University of Cam-
bridge, Cambridge, UK 3 MPI for Intelligent Systems, Tübingen, Our Contributions
Germany. Correspondence to: Mateo Rojas-Carulla <mrojas-
carulla@gmail.com>. First, we introduce the framework of proxy variables to esti-
mate the causal relation between static entities (Section 3).
Second, we apply our framework to the task of inferring the 1

cause-effect relation between pairs of images (Section 4). In 1
particular, our methods are able to infer the causal relation
between an image and its stylistic modification in 80% of 0 0
X
Y
the cases, and it can recover the correct ordering of a set of
shuffled video frames (Section 4.2).
−1
Third, we apply our framework to discover the cause-effect −1
relation between pairs of words in natural language (Sec- −1 0 1 −1 0 1
tion 5). To this end, we introduce a novel dataset of 10,000 X Y
human-elicited pairs of words with known causal relation (a) Y = F (X) + N , X ⊥ N . (b) X = G(Y ) + E, Y 6⊥ E.
(Section 5.2). Our methods are able to recover 75% of
the cause-effect relations (such as “accident → injury” or
Figure 1. Example of an Additive Noise Model (ANM).
“sentence → trial”) in this challenging task (Section 5.4).
Fourth, we discuss the role of proxy variables as a tool
In the third case, X and Y are conditionally independent
to incorporate external knowledge, as provided by static
given Z (write X ⊥ Y | Z).
entities, into general prediction problems (Section 6).
All our code and data are available at anonymous. In machine learning, these three types of statistical depen-
dencies are exploited without distinction, as dependence is
We start the exposition by introducing the basic language sufficient to perform optimal predictions about identically
of observational causal discovery, as well as motivating its and independently distributed (iid) data (Schölkopf et al.,
role in machine learning. 2012). However, we argue that taking into account the Prin-
ciple of common cause would have far-reaching benefits
2. Causal Discovery in Machine Learning in non-iid machine learning. For example, assume that we
are interested in predicting the values of a target variable
The goal of observational causal discovery is to reveal the Y , given the values taken by two features (X1 , X2 ). Then,
cause-effect relation between two random variables X and understanding the causal structure underlying (X1 , X2 , Y )
Y , given n samples (x1 , y1 ), . . . (xn , yn ) from P (X, Y ). In brings two benefits.
particular, we say that “X causes Y ” if there exists a mecha-
nism F that transforms the values taken by the cause X into First, interpretability. Explanatory questions such as “Why
the values taken by the effect Y , up to the effects of some does Y = 2 when (X1 , X2 ) = (−1, 3)?”, and counterfac-
random noise N . Mathematically, we write Y ← F (X, N ). tual questions such as “What value would have Y taken,
Such equation highlights an asymmetric assignment rather had X2 = −3?” cannot be answered using statistics alone,
than a symmetric equality. If we were to intervene and since their answers depend on the particular causal structure
change the value of the cause X, then a change in the value underlying the data.
of the effect Y would follow. On the contrary, if we were to Second, robustness. Predictors which estimate the values
manipulate the value of the effect Y , a change in the cause taken by a target variable Y given only its direct causes
X would not follow. are robust with respect to distributional shifts on their in-
When two random variables share a causal relation, they puts. For example, let X1 ∼ P (X1 ), Y ← F1 (X1 ), and
often become statistically dependent. However, when two X2 ← F2 (X1 ). Then, the predictor E(Y | X1 ) is invariant
random variables are statistically dependent, they do not to changes in the joint distribution P (X1 , X2 ) as long as
necessarily share a causal relation. This is at the origin of the causal mechanism F1 does not change. However, the
the famous warning “dependence does not imply causality”. predictor E(Y | X1 , X2 ) can vary wildly even if the causal
This relation between dependence and causality was formal- mechanism F1 (the only one involved in computing Y ) does
ized by Reichenbach (1956) into the following principle. not change (Peters et al., 2016; Rojas-Carulla et al., 2015).
The previous two points apply to the common “non-iid” sit-
Principle 1 (Principle of common cause). If two random uations where we have access to data drawn from some
variables X and Y are statistically dependent (X 6⊥ Y ), distribution P , but we are interested in some different but
then one of the following causal explanations must hold: related distribution P̃ . One natural way to phrase and lever-
age the similarities between P and P̃ is in terms of shared
i) X causes Y (write X → Y ), or
causal structures (Peters, 2012; Lopez-Paz, 2016).
ii) Y causes X (write X ← Y ), or
iii) there exists a random variable Z that is the common While it is indeed an attractive endeavor, discovering the
cause of both X and Y (write X ← Z → Y ). causal relation between two random variables purely from
observation is an impossible task when considered in full x y

generality. Indeed, any of the three causal structures out-
lined in Principle 1 could explain the observed dependency
between two random variables. However, one can in many W
π π
cases impose assumptions to render the causal relation be-
tween two variables identifiable from their joint distribu- A B
tion. For example, consider the family of Additive Noise
Models, or ANM (Hoyer et al., 2009; Peters et al., 2014; Figure 2. A pair of static entities (x, y) share a causal relation of
Mooij et al., 2016). In ANM, one assumes that the causal interest (thick blue arrow). A proxy variable W , together with a
model has the form Y = F (X) + N , where X ⊥ N . It proxy projection π produces the random entities (A, B), that share
turns out that, under some assumptions, the reverse ANM the causal footprint of (x, y), denoted by the dotted blue arrow.
X = G(Y ) + E will not satisfy the independence assump-
tion Y ⊥ E (Fig. 1). The statistical dependence shared
by the cause and noise in the wrong causal direction is the
footprint that renders the causal relation between X and Y
identifiable from statistics alone.
In situations where the ANM assumption is not satisfied First, a proxy random variable W is a random variable
(e.g., multiplicative or heteroskedastic noise) one may pre- taking values in some set W, which can be understood as
fer learning-based causal discovery tools, such as the Ran- a random source of information related to x and y. This
domized Causation Coefficient (Lopez-Paz et al., 2015). definition is on purpose rather vague and will be illustrated
RCC assumes access to a causal dataset D = {(Si , li )}ni=1 , through several examples in the following sections.
where Si = (xi,j , yi,j )nj=1
i
∼ P i (Xi , Yi ) is a bag of exam-
Second, a proxy projection is a function π : W × S → R.
ples drawn from some distribution P i , ì = +1 if Xi → Yi ,
Using a proxy variable and projection, we can construct
and ì = −1 if Xi ← Yi . By featurizing each of the train-
a pair of scalar random variables A = π(W, x) and B =
ing distribution samples Si using kernel mean embeddings
π(W, y). A proxy variable and projection are causal if
(Smola et al., 2007), RCC learns a binary classifier on D to
the pair of random entities (A, B) share the same causal
reveal the causal footprints necessary to classify new pairs
footprint as the pair of static entities (x, y).1
of random variables.
If the proxy variable and projection are causal, we may esti-
However, both ANM and RCC based methods need n 1
mate the cause-effect relation between the static entities x
samples from P (X, Y ) to classify the causal relation be-
and y in three steps. First, draw (a1 , b1 ), . . . , (an , bn ) from
tween the random variables X and Y . Therefore, these
P (A, B). Second, use an observational causal discovery
methods are not suited to infer the causal relation between
algorithm to estimate the cause-effect relation between A
static entities such as, for instance, one painting and its
and B given {(ai , bi )}ni=1 . Third, conclude “x causes y”
fraudulent copy. In the following section, we propose a
if A → B, or “y causes x” if A ← B. This process is
framework to extend the state-of-the-art in causal discovery
summarized in Figure 2.
methods to this important case.
Note that the causal relation X → Y does not imply the
3. The Main Concepts: Static Entities, Proxy causal relation A → B in the interventional sense: even if
A is a copy of X and B is a copy of Y , intervening on A
Variables and Proxy Projections will not change B! We only care here about the presence
In the following, we consider two static entities x, y in some of statistically observable causal footprints between the
space S that satisfy the relation “x causes y”. Formally, variables. Furthermore, our framework extends readily to
this causal relation manifests the existence of a (possibly the case where x and y live in different modalities (say, x
noisy) mechanism f such that the value y is computed as is an image and y is a piece of audio describing the image).
y ← f (x). This asymmetric assignment guarantees changes In this case, all we need is a proxy variable W = (Wx , Wy )
in the static cause x would lead to changes in the static effect and a pair of proxy projections (πx , πy ) with the appropriate
y, but the converse would not hold. structure. For simplicity and throughout this paper, we will
choose our proxy variables and projections based on domain
As mentioned previously, traditional causal discovery meth- knowledge. Learning proxy variables and projections from
ods cannot be directly applied to static entities. In order data is an exciting area left for future research.
to discover the causal relation between the pair of static
1
entities x and y, we introduce two main concepts: proxy The concept of causal footprint is relative to our assumptions.
variables W , and proxy projections π. For instance, when assuming an ANM Y ← f (X)+N , the causal
footprint is the statistical independence between X and N .
x = original image y = stylized image and y, we can apply a regular causal discovery algorithm to
(A, B) to estimate the causal relation between x and y.
4.1. Towards a Theoretical Understanding

The intuition behind causal discovery using proxy variables
is that, although we observe (x, y) as static entities, these are
underlyingly complex, high-dimensional, structured objects
that carry rich information about their causal relation. The
proxy variable W introduces randomness to sample differ-
ent views of the high-dimensional causal structures, and π
summarizes those views into scalar values. But why should
D E D E the causal footprint of these summaries cue the causal rela-
ai = bi = tion between x and y?
, ,
30 We formalize this question for the specific case of stylized
20
images, where x is the original image and y its stylized
10
version. Let the causal mechanism mapping x to y operate
locally. More precisely, assume that each k-subset ySi in
0
the stylized image is computed from the k-subset xSi in the
B
−10
original image, as described by the ANM:
−20
−30 ySi = f (xSi ) + Si .

−40
−40 −20 0 20 40 60 80 Then, the stylized image y = F (x) + , where F (x)Si =
A
f (xSi ). For simplicity, assume that f (xS ) = g(βxS ) where
Figure 3. Sampling random patches at paired locations produces a β is a k ×k matrix and g acts element-wise. Then, let P (W )
proxy variable to discover the causal relation between two images. be a distribution over masks extracting random k-subsets,
and let π(·, ·) = h·, ·i, to obtain:
4. Causal Discovery Using Proxies in Images A = π(W, x) = hW, xi,

Consider the two images shown in Figure 3. The image B = π(W, y) = hW, yi
!
on the left is an unprocessed photograph of the Tübingen k
X k
X
Neckarfront, while the one on the right is the same photo- = gj βjl (xS )l +N
graph after being stylized with the algorithm of Gatys et al. j=1 l=1
!
(2016). From a causal point of view, the unprocessed image k
X k
X
x is the cause of the stylized image y. How can we leverage = gj αj A +N
the ideas from Section 3 to recover such causal relation? j=1 l=1
The following is one possible solution. Assume that the two Pk

where N = j=1 (S )j , and where we assume that β is
images are represented by pixel intensity vectors x and y, such that αj = βjl for all j ≤ k. Since A ⊥ N , the pair
respectively. For n 1 and j = 1, . . . , n: (A, B) also follows an ANM. We leave for future work the
• Draw a mask-image wj , which contains ones inside a investigation on identifiability conditions for causal infer-
patch at random coordinates, and zeroes elsewhere. ence using proxy variables.
• Compute aj = hwj , xi, and bj = hwj , yi.
4.2. Numerical Simulations
This process returns a sample {(aj , bj )}nj=1 drawn from
P (A, B), the joint distribution of the two scalar random In order to illustrate the use of causal discovery using proxy
variables (A, B). The conversion from static entities (x, y) variables in images, we conducted two small experiments.
to random variables (A, B) is obtained by virtue of i) the In these experiments, we extract n = 1024 square patches
randomness generated by the proxy variable W , which in of size k = 10 pixels, and use the Additive Noise Model
this particular case is incarnated as random masks and ii) a (Hoyer et al., 2009) to estimate the causal relation between
causal projection π, here a simple dot product. the constructed scalar random variables A and B.
At this point, if the causal footprint between the random First, we collected 14 unprocessed images together with
entities (A, B) resembles the causal footprint between x 34 stylizations (including the one from Figure 3), made
such causal relations by using generic observational causal

discovery methods, such as the ones described in Section 2.
In the following, Section 5.1 frames this problem in the
language of causal discovery between static entities. Then,
Section 5.2 introduces a novel, human-generated, human-
validated dataset to test our methods. Section 5.3 reviews
prior work on causal discovery in language. Finally, Sec-
tion 5.4 presents experiments evaluating our methods.
5.1. Static Entities, Proxies, and Projections for NLP

Figure 4. Causal discovery using proxy variables uncovers the In the language of causal discovery with proxies, a pair of
causal time signal to reorder a shuffled sequence of video frames. words is a pair of static entities: (x, y) = (virus, death).
In order to discover the causal relation between x and y,
we are in need of a proxy variable W , as introduced in
using the algorithm of Gatys et al. (2016). When applying Section 3. We will use a simple proxy: let P (W = w) be
causal discovery using proxy variables to this dataset, we the probability of the word w appearing in a sentence drawn
can correctly identify the correct direction of causation from at random from a large corpus of natural language.
the original image to its stylized version in 80% of the cases. Using the proxy W , we need to define the pair of random
Second, we decomposed a video of drops of ink mixing with variables A = π(W, x) and B = π(W, y) in terms of a
water into 8 frames {(xi )}8i=1 , shown in Figure 4. Using causal projection π. Once we have defined the causal pro-
the same mask proxy variable as above, we construct an jection π, we can sample w1 , . . . , wn ∼ P (W ), construct
8 × 8 matrix M such that Mij = 1 if xi → xj according ai = π(wi , x), bi = π(wi , y), and apply a causal discovery
to our method and Mij = 0 otherwise. Then, we consider algorithm to the sample {(ai , bi )}ni=1 . Specifically, we es-
M to be the adjacency matrix of the causal DAG describing timate P (W ) from a large corpus of natural language, and
the causal structure between the 8 frames. By employing sample n = 10, 000 words without replacement.2
topological sort on this graph, we were able to obtain the Throughout our experimental evaluation, we will use and
true ordering, unique among the 40, 320 possible orderings. compare different proxy projections π(w, x):
1) πw2vii (w, x) = hvw i
, vxi i, where vzi ∈ Rd is the input
5. Causal Discovery Using Proxies in NLP word2vec representation (Mikolov et al., 2013) of the
As our main case study, consider discovering the causal word z. The dot-product hvw i
, vxi i measures the simi-
relation between pairs of words appearing in a large corpus larity in meaning between the pair of words (w, x).
of natural language. For instance, given the pair of words 2) πw2vio (w, x) = hvw i
, vxo i, where vzo ∈ Rd is the out-
(virus, death), which represent our static entities x and y, put word2vec representation of the word z. The dot-
together with a large corpus of natural language, we want to product hvw i
, vxo i is an unnormalized estimate of the
recover causal relations such as “virus → death”, “sun → conditional probability p(x|w) (Melamud et al., 2015).
radiation”, “trial → sentence”, or “drugs → addiction”. 3) πw2voi (w, x) = hvw o
, vxi i, an unnormalized estimate of
the conditional probability p(w|x).
This problem is extremely challenging for two reasons. First, 4) πcounts (w, x) = p(w, x), where the pmf p(w, x) is
word pairs are extremely varied in nature (compare “avo- directly estimated from counting within-sentence co-
cado causes guacamole” to “cat causes purr”), and some are occurrences in the corpus.
very rare (“wrestler causes pin”). Second, the causal rela- 5) πprec-counts (w, x) similar to the one above, but com-
tion between two words can always be tweaked in context- puted only over sentences where w precedes x.
specific ways. For instance, one can construct sentences 6) πpmi (w, x) = p(w, x)/(p(w)p(x)), where the pmfs
where “virus causes death” (e.g., the virus led to a quick p(w), p(x), and p(w, x) are estimated from counting
death), but also sentences where “death causes virus” (e.g., words and (sentence-based) co-occurrences in the cor-
the multiple deaths in the area further spread the virus). pus. The log of this quantity is known as point-wise
We are hereby interested in the canonical causal relation
2
between pairs of words, assumed by human subjects when This is equivalent to sampling approximately the top 10,000
specific contexts are not provided (see Section 5.2). Fur- most frequent words in the corpus. Due to the extremely skewed
nature of word frequency distributions (Baayen, 2001), sampling
thermore, our interest lies in discovering the causal relation with replacement would produce a list of very frequent words such
between pairs of words without the use of language-specific as a and the, sampled many times.
knowledge or heuristics. To the contrary, we aim to discover
mutual information, or PMI (Church & Hanks, 1990). They are thus outside the scope of the purely corpus-based
7) πprec-pmi (w, x), similar to the one above, but computed methods we are considering here.
only over sentences where w precedes x.
Most NLP work specifically focusing on the causality re-
Applying the causal projections to our sample from proxy lation relies on informative linking patterns co-occurring
W , we construct the n-vector with the target pairs (such as, most obviously, the conjunc-
tion because). These patterns are extracted and processed
Πxproj = (πproj (w1 , x), . . . , πproj (wn , x)), (1) with sophisticated methods, involving annotation, ontolo-
and similarly for Πyproj , where gies, bootstrapping and/or manual filtering (see, e.g., Blanco
et al. 2008 Hashimoto et al. 2012, Radinsky et al. 2012, and
proj ∈ {w2vii, w2vio, w2voi, references therein). We experimented with extracting link-
counts, prec-counts, pmi, prec-pmi}. (2) ing patterns from our corpus, but, due to the relatively small
size of the latter, results were extremely sparse (note that
patterns can only be extracted from sentences in which both
In particular, we use the skip-gram model implementa-
cause and effect words occur). More recent work started
tion of fastText (Bojanowski et al., 2016) to compute
looking at causal chains of events as expressed in text (see
300−dimensional word2vec representations.
Mirza & Tonelli 2016 and references therein). Applying our
generic method to this task is a direction for future work.
5.2. A Real-World Dataset of Cause-Effect Words
A semantic relation that received particular attention in NLP
We introduce a human-elicited, human-filtered dataset of is that of entailment between words (dog entails animal). As
10, 000 pairs of words with a known causal relation. This causality is intuitively related to entailment, we will apply
dataset was constructed in two steps: below entailment detection methods to cause/effect classi-
1) We asked workers from Amazon Mechanical Turk to fication. Most lexical entailment methods rely on distribu-
create pairs of words linked by a causal relation. We tional representations of the words in the target pair. Tradi-
provided the turks with examples of words with a clear tionally, entailing pairs have been identified with unsuper-
causal link (such as “sun causes radiation”) and exam- vised asymmetric similarity measures applied to distributed
ples of related words not sharing a causal relation (such word representations (Geffet & Dagan, 2005; Kotlerman
as “knife” and “fork”). For details, see Appendix A. et al., 2010; Lenci & Benotto, 2012; Weeds et al., 2004).
2) Each of the pairs collected from the previous step was We will test one of these related measures, namely, Weeds
randomly shuffled and submitted to 20 different turks, Precision (WS). More recently, Santus et al. 2014 showed
none of whom had created any of the word pairs. Each that the relative entropy of distributed vectors representing
turk was required to classify the pair of words (x, y) the words in a pair is an effective cue to which word is
as “x causes y”, “y causes x”, or “x and y do not share entailing the other, and we also look at entropy for our task.
a causal relation”. For more details, see Appendix B. However, the most effective method to detect entailment is
to apply a supervised classifier to the concatenation of the
This procedure resulted in a dataset of 10, 000 causal word vectors representing the words in a pair (Baroni et al., 2012;
pairs (x, y), each accompanied with three numbers: the Roller et al., 2014; Weeds et al., 2014).
number of turks that voted “x causes y”, the number of
turks that voted “y causes x”, and the number of turks that 5.4. Experiments
voted “x and y do not share a causal relation”.
We evaluate a variety of methods to discover the causal
5.3. Causal Relation Discovery in NLP relation between two words appearing in a large corpus of
natural language. We study methods that fall within three
The NLP community has devoted much attention to the prob- categories: baselines, distribution-based causal discovery
lem of identifying the semantic relation holding between methods, and feature-based supervised methods. These
two words, with causality as a special case. Girju et al. three families of methods consider an increasing amount of
(2009) discuss the results of the large shared task on relation information about the task at hand, and therefore exhibit an
classification they organized (their benchmark included only increasing performance up to 85% classification accuracy.
220 examples of cause-effect). The task required recogniz-
ing relations in context, but, as discussed by the authors, All our computations will be based on the full English
most contexts display the default relation we are after here Wikipedia, as post-processed by Matt Mahoney (see http:
(e.g., “The mutant virus gave him a severe flu” instantiates //www.mattmahoney.net/dc/textdata.html). We
the default relation in which virus is the cause, flu is the study the N = 1, 970 pairs of words out of 10, 000 from the
effect). All participating systems used extra resources, such dataset described in Section 5.2 that achieved a consensus
as ontologies and syntactic parsing, on top of corpus data. across at least 18 out of 20 turks. We use RCC to estimate
the causal relation between pairs of random variables. The methods in this family first split the dataset D into
a training set Dtr and a test set Dte . Then, the methods
5.4.1. BASELINES train RCC on the training set Dtr , and test its classification
accuracy on Dte . This process is repeated ten times, splitting
These are a variety of unsupervised, heuristic baselines.
at random D into a training set containing 75% of the pairs,
Each baseline computes two scores, denoted by Sx→y and
and a test set containing 25% of the pairs. Each method
Sx←y , predicting x → y if Sx→y > Sx←y , and x ← y if
builds on top of a causal projection from (2) above. Figure 5
Sx→y < Sx←y . The baselines are:
shows the test accuracy of these methods in green.
• frequency: Sx→y is the number of sentences where
x appears in the corpus, and Sx←y is the number of 5.4.3. F EATURE - BASED SUPERVISED METHODS
sentences where y appears in the corpus.
These methods use the same data generated by our causal
• precedence: considering only sentences from the cor-
projections, but treat them as fixed-size vectors fed to a
pus where both x and y appear, Sx→y is the number
generic classifier, rather than random samples to be analyzed
of sentences where x occurs before y, and Sx←y is the
with an observational causal discovery method. They can
number of sentences where y occurs before x.
be seen as an oracle to upper-bound the amount of causal
• counts (entropy): Sx→y is the entropy of Πxcounts , and
signals (and signals correlated to causality) contained in our
Sx←y is the entropy of Πycounts , as defined in (1).
data. Specifically, they use 2n-dimensional vectors given
• counts (WS): Using the WS measure of Weeds & Weir
by the concatenation of those in (1). Given N word pairs
(2003), Sx→y = WS(Πxcounts , Πycounts ), and Sx←y = N
WS(Πycounts , Πxcounts ). (xi , yi ), they build a dataset D = (Πxproj i
, Πyproj
i
), ì ,
i=1
• prec-counts (entropy): Sx→y is the entropy of where ì = +1 if xi → yi , ì = −1 if xi ← yi , and “proj”
Πxprec-counts , and Sx←y is the entropy of Πyprec-counts (1). is a projection from (2). Next, we split the dataset D into a
• prec-counts (WS): analogous to the previous. training set Dtr containing 75% of the pairs, and a disjoint
The baselines PMI (entropy), PMI (WS), prec-PMI (en- test set Dte containing 25% of the pairs. To evaluate the
tropy), prec-PMI (WS) are analogous to the last four, but use accuracy of each method in this family, we train a random
(Πx(prec-)pmi , Πy(prec-)pmi ) instead of (Πx(prec-)counts , Πy(prec-)counts ). forest of 500 trees using Dtr , and report its classification
Figure 5 shows the performance of these baselines in blue. accuracy over Dte . This process is repeated ten times, by
splitting the dataset D at random. The results are presented
5.4.2. D ISTRIBUTION - BASED CAUSAL DISCOVERY as red bars in Figure 5. We also report the classification
METHODS accuracy of training the random forest on the raw word2vec
representations of the pair of words (top three bars).
These methods implement our framework of causal dis-
covery using proxy variables. They classify n samples 5.4.4. D ISCUSSION OF RESULTS
from a 2-dimensional probability distribution as a whole.
Recall that a vocabulary (wj )nj=1 drawn from the proxy Baseline methods are the lowest performing, up to 59% test
is available. Given N word pairs (xi , yi ), this family of accuracy. We believe that the performance of the best base-
N line, precedence, is due to the fact that most Wikipedia is
methods constructs a dataset D = ({(aij , bij )}nj=1 , ì ) i=1 ,
written in the active voice, which often aligns with the tem-
where aij = πproj (wj , xi ), bij = πproj (wj , yi ), ì = +1 if
poral sequence of events, and thus correlates with causality.
xi → yi and ì = −1 otherwise. In short, D is a dataset of
N “scatterplots” annotated with binary labels. The i-th scat- The feature-based methods perform best, achieving up to
terplot contains n 2-dimensional points, which are obtained 85% test classification accuracy. However, feature-based
by applying the causal projection to both xi and yi , against methods enjoy the flexibility of considering each of the
the n vocabulary words drawn from the proxy. n = 10, 000 elements in the causal projection as a distinct
feature. Therefore, feature-based methods do not focus
The samples (aij , bij )nj=1 are computed using a determinis- on patterns to be found at a distributional level (such as
tic projection of iid draws from the proxy, meaning that causality), and are vulnerable to permutation or removal
{(aij , bij )}nj=1 ∼ P n (Ai , B i ). Therefore, we could permute of features. We believe that feature-based methods may
the points inside each scatterplot without altering the results achieve their superior performance by overfitting to biases
of these methods. In principle, we could also remove some in our dataset, which are not necessarily related to causality.
of the points in the scatterplot without a significant drop in
performance. Therefore, these methods search for causal Impressively, the best distribution-based causal discovery
footprints at the 2-dimensional distribution level, and we method achieves 75% test classification accuracy, which is
term them distribution-based causal discovery methods. a significant improvement over the best baseline method.
Importantly, our distribution-based methods take a whole
2-dimensional distribution as input to the classifier; as such,

these methods are robust with respect to permutations and
removals of the n distribution samples. We find it encourag- feat. w2v all
ing that the best distribution-based method is the one based feat. w2v input
on πw2voi . This suggests the intuitive interpretation that
feat. w2v output
the distribution of a vocabulary conditioned on the cause
word causes the distribution of the vocabulary conditioned feat. w2vii
on the effect word. Even more encouragingly, Figure 6 feat. w2voi

shows a positive dependence between the test classifica- feat. w2vio
tion accuracy of RCC and the confidence of human annota- feat. prec-PPMI
tions, when considering the test classification accuracy of feat. PPMI
all the causal pairs annotated with a human confidence of
feat. prec-counts
at least {0, 20, 40, 50, 60, 70, 80, 90}. Thus, our proxy vari-
feat. counts
ables and projections arguably capture a notion of causality
aligned with the one of human annotators. distr. w2voi
distr. w2vii
6. Proxy Variables in Machine Learning distr. prec-counts

distr. counts
The central concept in this paper is the one of proxy vari-
distr. PMI
able. This is a variable W providing a random source of
distr. w2vio
information related to x and y.
distr. prec-PMI
However, we can consider the reverse process of using a precedence
static entity w to augment random statistics about a pair of
frequency
random variables X and Y . As it turns out, this could be an
useful process in general prediction problems. prec-PMI (entropy)
PMI (entropy)
To illustrate, consider a supervised learning problem map-
counts (entropy)
ping a feature random variable X into a target random
variable Y . Such problem is often solved by considering prec-PMI (WS)
a sample {(xi , yi )}ni=1 ∼ P n (X, Y ). In this scenario, we PMI (WS)
may contemplate an unpaired, external, static source of in- prec-counts (entropy)
formation w (such as a memory), which might help solving prec-counts (WS)
baselines
the supervised learning problem at hand. One could incor- distribution-based counts (WS)
porate the information in the static source w by constructing feature-based
the proxy projection wi = π(xi , w), and add them to the
dataset to obtain {((xi , wi ), yi )}ni=1 to build the predictor 0.4 0.5 0.6 0.7 0.8 0.9
test accuracy
f (x, π(x, w)).
7. Conclusion Figure 5. Results for all methods on the NLP experiment. Ac-
curacies above 52% are statistically significant with respect to a
We have introduced the necessary machinery to estimate the
Binomial test at a significance level α = 0.05.
causal relation between pairs of static entities x and y — one
piece of art and its forgery, one document and its translation,
or the concepts underlying a pair of words appearing in a 0.75
corpus of natural language. We have done so by introducing
RCC accuracy
the tool of proxy variables and projections, reducing our

problem to one of observational causal inference between 0.70
random entities. Throughout a variety of experiments, we
have shown the empirical effectiveness of our proposed
method, and we have connected it to the general problem of 0 20 40 60 80
incorporating external sources of knowledge as additional Lower bound on human confidence
features in machine learning problems.
Figure 6. RCC accuracy versus human confidence.

References Kuipers, B. Commonsense reasoning about causality: de-

riving behavior from structure. Artificial intelligence,
Baayen, H. Word Frequency Distributions. Kluwer, 2001.
1984.
Baroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-C. En-
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-
tailment above the word level in distributional semantics.
man, S. J. Building machines that learn and think like
In EACL, 2012.
people. arXiv, 2016.
Beebee, H., Hitchcock, C., and Menzies, P. The Oxford
Lenci, A. and Benotto, G. Identifying hypernyms in distri-
handbook of causation. Oxford University Press, 2009.
butional semantic spaces. In *SEM, 2012.
Blanco, E., Castell, N., and Moldovan, D. Causal relation
Levesque, H., Davis, E., and Morgenstern, L. The Winograd
extraction. In LREC, 2008.
Schema Challenge. In KR, 2012.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. En-
Lopez-Paz, D. From dependence to causation. PhD thesis,
riching word vectors with subword information. arXiv,
University of Cambridge, 2016.
2016.
Lopez-Paz, D., Muandet, K., Schölkopf, B., and Tolstikhin,
Bottou, L. From machine learning to machine reasoning.
I. Towards a learning theory of cause-effect inference. In
Machine learning, 2014.
ICML, 2015.
Bottou, L., Peters, J., Charles, D. X., Chickering, M., Portu-
Melamud, O., Levy, O., Dagan, I., and Ramat-Gan, I. A
galy, E., Ray, D., Simard, P. Y., and Snelson, E. Counter-
simple word embedding model for lexical substitution.
factual reasoning and learning systems: the example of
In Workshop on Vector Space Modeling for Natural Lan-
computational advertising. JMLR, 2013.
guage Processing, 2015.
Church, K. and Hanks, P. Word association norms, mutual
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient
information, and lexicography. Computational linguistics,
estimation of word representations in vector space. arXiv,
1990.
2013.
Cohen, P. R. DARPA’s Big Mechanism program. Physical
Mirza, P. and Tonelli, S. CATENA: CAusal and TEmpo-
biology, 2015.
ral relation extraction from NAtural language texts. In
Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, COLING, 2016.
B., Zhang, K., and Schölkopf, B. Inferring deterministic Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., and
causal relations. arXiv, 2012. Schölkopf, B. Distinguishing cause from effect using
Gatys, L. A., Ecker, A. S., and Bethge, M. Image style observational data: methods and benchmarks. JMLR,
transfer using convolutional neural networks. In CVPR, 2016.
2016. Pearl, J. Causality: Models, Reasoning, and Inference.
Geffet, M. and Dagan, I. The distributional inclusion hy- Cambridge University Press, 2nd edition, 2009.
potheses and lexical entailment. In ACL, 2005. Peters, J. Restricted structural equation models for causal
Girju, R., Nakov, P., Nastase, V., Szpakowicz, S., Turney, P., inference. PhD thesis, ETH Zurich, 2012.
and Yuret, D. Classification of semantic relations between Peters, J., Mooij, J., Janzing, D., and Schölkopf, B. Causal
nominals. Language Resources and Evaluation, 2009. discovery with continuous additive noise models. JMLR,
Hashimoto, C., Torisawa, K., De Saeger, S., Oh, J.-H., and 2014.
Kazama, J. Excitatory or inhibitory: A new semantic Peters, J., Bühlmann, P., and Meinshausen, N. Causal infer-
orientation extracts contradiction and causality from the ence using invariant prediction: identification and confi-
web. In EMNLP, 2012. dence intervals. JRSS B, 2016.
Hoyer, P., Janzing, D., Mooij, J., Peters, J., and Schölkopf, Radinsky, K., Davidovich, S., and Markovitch, S. Learning
B. Nonlinear causal discovery with additive noise models. causality for news events prediction. In WWW, 2012.
In NIPS, 2009.
Reichenbach, H. The direction of time, 1956.
Kotlerman, L., Dagan, I., Szpektor, I., and Zhitomirsky-
Geffet, M. Directional distributional similarity for lexical Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J.
inference. Natural Language Engineering, 2010. Causal transfer in machine learning. arXiv, 2015.
Roller, S., Erk, K., and Boleda, G. Inclusive yet selec-

tive: Supervised distributional hypernymy detection. In
COLING, 2014.
Santus, E., Lenci, A., Lu, Q., and Schulte im Walde, S.
Chasing hypernyms in vector spaces with entropy. In
EACL, 2014.
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,

K., and Mooij, J. On causal and anticausal learning. In
ICML, 2012.
Shimizu, S., Hoyer, P., Hyvärinen, A., and Kerminen, A. A
linear non-gaussian acyclic model for causal discovery.
JMLR, 2006.
Smola, A., Gretton, A., Song, L., and Schölkopf, B. A
Hilbert space embedding for distributions. In ALT.
Springer, 2007.
Waldrop, M. M. Causality, structure, and common sense.
Science, 1987.
Weeds, J. and Weir, D. A general framework for distribu-
tional similarity. In EMNLP, 2003.
Weeds, J., Weir, D., and McCarthy, D. Characterising mea-
sures of lexical distributional similarity. In COLING,
2004.
Weeds, J., Clarke, D., Reffin, J., Weir, D., and Keller, B.
Learning to distinguish hypernyms and co-hyponyms. In
COLING, 2014.
Supplementary material to Causal discovery using proxy variables
A. Instructions for word pair creators

We will ask you to write word pairs (for instance, WordA and WordB) for which you believe the statement “WordA causes
WordB” is true.
To provide us with high quality word pairs, we ask you to follow these indications:
• All word pairs must have the form “WordA → WordB”. It is essential that the first word (WordA) is the cause, and the
second word (WordB) is the effect.
• WordA and WordB must be one word each (no spaces, and no “recessive gene → red hair”). Avoid compound words
such as “snow-blind”.
• In most situations, you may come up with a word pair that can be justified both as “WordA → WordB” and “WordB →
WordA”. In such situations, prefer the causal direction with the easiest explanation. For example, consider the word
pair “virus → death”. Most people would agree that “virus causes death“. However, “death causes virus” can be true
in some specific scenario (for example, “because of all the deaths in the region, a new family of virus emerged.”).
However, the explanation “virus causes death“ is preferred, because it is more general and depends less on the context.
• We do not accept word pairs with an ambiguous causal relation, such as “book - paper”.
• We do not accept simple variations of word pairs. For example, if you wrote down “dog → bark”, we will not credit
you for other pairs such as “dogs → bark” or “dog → barking”.
• Use frequent words (avoid strange words such as “clithridiate”).
• Do not rely on our examples, and use your creativity. We are grateful if you come up with diverse word pairs! Please do
not add any numbers (for example, “1 - dog → bark”). For your guidance, we provide you examples of word pairs that
belong to different categories. Please bear in mind that we will reward your creativity: therefore, focus on providing
new word pairs with an evident causal direction, and do not limit yourself to the categories shown below.
1) Physical phenomenon: there exists a clear physical mechanism that explains why “WordA → WordB”.
• sun → radiation (The sun is a source of radiation. If the sun were not present, then there would be no radiation.)
• altitude → temperature
• winter → cold
• oil → energy
2) Events and consequences: WordA is an action or event, and WordB is a consequence of that action or event.
• crime → punishment
• accident → death
• smoking → cancer
• suicide → death
• call → ring
3) Creator and producer: WordA is a creator or producer, WordB is the creation of the producer.
• writer → book (the creator is a person)

• painter → painting
• father → son
• dog → bark
• bacteria → sickness
• pen → drawing (the creator is an object)
• chef → food
• instrument → music
• bomb → destruction
• virus → death
4) Other categories! Up to you, please use your creativity!
• fear → scream
• age → salary
B. Instructions for word pair validators

Please classify the relation between pairs of words A and B into one of three categories: either “A causes B”, “B causes A”,
or “Non-causal or unrelated”.
For example, given the pair of words “virus and death”, the correct answer would be:
• virus causes death (correct);

• death causes virus (wrong);
• non-causal or unrelated (wrong).
Some of the pairs that will be presented are non-causal. This may happen if:
• The words are unrelated, like “toilet and beach”.

• The words are related, but there is no clear causal direction. This is the case of “salad and lettuce”, since we can eat
salad without lettuce, or eat lettuce in a burger.
To provide us with high quality categorization of word pairs, we ask you to follow these indications:
• Prefer the causal direction with the simplest explanation. Most people would agree that “virus causes death”. However,
“death causes virus” can be true in some specific scenario (for example, “because of all the deaths in the region, a
new virus emerged.”). However, the explanation “virus causes death” is preferred, because it is true in more general
contexts.
• If no direction is clearer, mark the pair as non-causal. Here, conservative is good!
• Think twice before deciding. We will present the pairs in random order!
Please classify all the presented pairs. If one or more has not been answered, the whole batch will be invalid. PLEASE
DOUBLE CHECK THAT YOU HAVE ANSWERED ALL 40 WORD PAIRS.
Examples of causal word pairs:
• “sun and radiation”: sun causes radiation

• “energy and oil”: oil causes energy
• “punishment and crime”: crime causes punishment
• “instrument and music”: instrument causes music
• “age and salary”: age causes salary
Examples of non-causal word pairs:
• “video and games”: non-causal or unrelated

• “husband and wife”: non-causal or unrelated
• “salmon and shampoo”: non-causal or unrelated
• “knife and gun”: non-causal or unrelated

• “sport and soccer”: non-causal or unrelated

Causal Discovery Using Proxy Variables

Uploaded by

Copyright:

Available Formats

Causal Discovery Using Proxy Variables

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Causal Discovery Using Proxy Variables

Uploaded by

Copyright:

Available Formats

Causal Discovery Using Proxy Variables

Mateo Rojas-Carulla 1 2 3 Marco Baroni 1 David Lopez-Paz 1

Abstract discovery algorithms are a necessary step towards machine

Second, we apply our framework to the task of inferring the 1

observation is an impossible task when considered in full x y

4.1. Towards a Theoretical Understanding

−30 ySi = f (xSi ) + Si .

4. Causal Discovery Using Proxies in Images A = π(W, x) = hW, xi,

The following is one possible solution. Assume that the two Pk

such causal relations by using generic observational causal

5.1. Static Entities, Proxies, and Projections for NLP

2-dimensional distribution as input to the classifier; as such,

on the effect word. Even more encouragingly, Figure 6 feat. w2voi

6. Proxy Variables in Machine Learning distr. prec-counts

the tool of proxy variables and projections, reducing our

Figure 6. RCC accuracy versus human confidence.

References Kuipers, B. Commonsense reasoning about causality: de-

Roller, S., Erk, K., and Boleda, G. Inclusive yet selec-

Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,

A. Instructions for word pair creators

• writer → book (the creator is a person)

4) Other categories! Up to you, please use your creativity!

B. Instructions for word pair validators

• virus causes death (correct);

• The words are unrelated, like “toilet and beach”.

• “sun and radiation”: sun causes radiation

• “age and salary”: age causes salary

Examples of non-causal word pairs:

• “video and games”: non-causal or unrelated

• “knife and gun”: non-causal or unrelated

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

−30 ySi = f (xSi ) + Si .