The
Causal Toolbox: A collection of programs for testing or exploring causal relationshipsBill
Shipley Département
de biologie Université
de Sherbrooke Sherbrooke
(Québec) J1K 2R1 CANADA www.callisto.si.usherb.ca:8080/bshipley This
manual describes how to use a series of DOS programs that have been written to
accompany the book Cause and Correlation in Biology.
A User’s Guide to Path Analysis, Structural Equations and Causal
Inference[1].
If you find any errors in these programs then please contact me.
No other guarantees are given and I accept no responsibility for analyses
done using these programs. The
theory and interpretation of the methods that are implemented in these programs
are described in the book; the purpose of this manual is to describe how to use
the programs. This set of programs is distributed as a selfextracting zip file. Just click here. ______________________________________________________________________ These
programs are free and should be referenced to the book. ______________________________________________________________________ How to run a programAll
of these programs are compiled to run in a DOS environment but can be used from
within a WINDOWS environment. Every
program in this Toolbox requires a DOS extender program called DOS4GW.EXE.
DOS4GW.EXE must be in the same directory as the program that you are
using. The easiest way is to create
a directory (you can call it Toolbox) and copy all of the programs, including
DOS4GW.EXE into this directory. There
are different ways to start a program. You
can simply doubleclick on the program from within the WINDOWS EXPLORER or you
can open a DOS window (START, then PROGRAMS, then choose the MSDOS prompt) and
then type the name of the program at the DOS prompt (usually the > symbol). All
programs are limited to a maximum of 30 variables.
If raw data files are to be used, then some programs also have a limit of
500 lines; in such cases you can enter a covariance matrix based on an unlimited
number of cases.
DGRAPH Reference:
Chapter 3 (dsep tests) Input
data set The
input data set must consist only of numbers.
No variable names, no formatting information and no missing data symbols
(and therefore no missing data!) are allowed.
It must be a simple ASCII file but this type of file can be obtained from
almost any word processor (text file) or a program like Microsoft EXCEL.
The file does not have to be formatted but there must be at least one
blank space separating numbers. We
will use the data set called DGRAPH.DAT. This
data set consists of 100 lines and 4 variables.
The first four lines of this file are: 0.4669 1.1607 0.8334 0.8416 0.6501 0.0008 1.1535 0.3417 0.1789 0.8183 0.188 0.1491 0.3755 2.2529 1.6236 0.7079 Start
DGRAPH.EXE. You will be asked if you
want information on using the program. Type
yes if you do, otherwise no. Type
yes. You will see that you can
either enter a raw data file, such as DGRAPH.EXE or a complete covariance
matrix. In the present case this
would be a 4X4 matrix (at least one blank space separating numbers) since there
are 4 variables. Next,
you are asked to give the name of the output file.
This is the file to which the full results will be written (it will also
be a simple text file and can be imported into any word processor).
The basic results are also echoed to the computer screen but the output
file will hold a permanent copy of the results and contains some additional
information. You must obey the file
naming rules of DOS (maximum of 8 characters beginning with a letter and
(optionally) a dot followed by a three letter extension.
If you want the file to be saved to a directory other than your current
directory then you must also give the directory name as well.
For instance, the following are valid output file names: DGRAPH.OUT Output C:\newdir\myname.r1. Type
dgraph.out; this will be our output file (be careful not to call it dgraph.dat
or you will overwrite the data file!). You
can then type a title (80 characters maximum) or else simply hit ENTER.
Type THIS IS AN EXAMPLE OF DGRAPH. You
must then tell the program if you will use a raw data file (type 1) or a
covariance matrix (type 2); if the latter then you will also be asked to give
the number of cases associated with the covariance matrix.
In this case, type 1 since we will use DGRAPH.DAT. You
must then tell the program how many columns (variables) are in the file.
You don’t have to actually use all of them if you only want to use a
subset. In this case, type 4. Now
you have to give the file name of your input data file including directory
information if your data file is not in your default directory.
Assuming that DGRAPH.DAT is in your default directory, type DGRAPH.DAT. Now
you have to tell the program what your path diagram (directed graph) looks like.
We will use the following path model: X1àX2àX3àX4.
The symbol “>” means “causes”.
Entering “1” <ENTER>, “>” <ENTER>, “2”
<ENTER> means that the variable in column 1 causes the variable in column
2; if there are n variables in the file (maximum of 30) then you can enter any
number between 1 and n. Here is how
you will enter our model: 1
<ENTER> >
<ENTER> 2 <ENTER> 2 <ENTER> > <ENTER> 3 <ENTER> 3 <ENTER> >
<ENTER> 4
<ENTER> 1
<ENTER> The
–1 tells the program that you have finished entering the path model. You
can now exclude some variables if you don’t want to use them in the analysis;
you would type in the number of variables to exclude and then the column numbers
of these variables. In our case we
will use all four variable in DGRAPH.EXE, so type 0. The
program prints out your path model. Check
to see that you have properly entered it, and then type <ENTER>. The
program prints the basic results to the screen.
It then asks if you want to continue. If so, type 1 and you will then be
able to either enter a new model or input a completely new data set.
We will test another (incorrect) model using the same data so type 1 and
then, at the next prompt, 2. After
entering a new title, enter this model: X1àX4àX2àX3
using the instructions already explained. You
should find a X^{2} value of 47.59 (6 df) and a probability less than
0.000005. To
quit, type 0 at the prompt. The
results are now stored in our output file. Here
is an example output file (my comments to the output are written in bold
italic): THIS IS AN EXAMPLE OF DGRAPH
_______________ DGRAPH
PROGRAM BY BILL
SHIPLEY bshipley@courrier.usherb.ca _______________ Directed
graph: 1> 2 2> 3 3> 4 Covariance
matrix based on 100
observations:
1.119 0.527
0.165 0.194
0.527 1.012
0.451 0.270
0.165 0.451
1.047 0.411
0.194 0.270
0.411 0.831 X
Y  Conditioning set(Q) The
following is the basis set Independence
statement 1
Q is the conditioning set X=
1 Y= 3 Q= 2 r(X,YQ)=0.0822
probability= 0.4192 Independence
statement 2 X= 1 Y= 4 Q=
3 r(X,YQ)=
0.1509 probability= 0.1362 Independence
statement 3 X=
2 Y= 4 Q= 1 3 r(X,YQ)=
0.0606 probability= 0.5544
Overall
test of the directed graph assuming the
data were generated by the
specified causal process: Chisquare
value = 6.91 Degrees
of freedom = 6 Probability
= 0.32968 Here is an
incorrect model
_______________ DGRAPH
PROGRAM BY BILL
SHIPLEY bshipley@courrier.usherb.ca _______________ Directed
graph: 1> 4 2> 3 4> 2 Covariance
matrix based on 100
observations:
1.119 0.527
0.165 0.194
0.527 1.012
0.451 0.270
0.165 0.451
1.047 0.411
0.194 0.270
0.411 0.831 X
Y  Conditioning set(Q) Independence
statement 1 X= 1 Y= 2 Q=
4 r(X,YQ)=
0.4659 probability= 0.0000 Independence
statement 2 X=
1 Y= 3 Q= 2 r(X,YQ)=0.0822
probability= 0.4192 Independence
statement 3 X= 3 Y= 4 Q=
1 2 r(X,YQ)=
0.3708 probability= 0.0001 Overall test
of the directed graph assuming the
data were generated by the
specified causal process: Chisquare
value = 47.59 Degrees
of freedom = 6 Probability
= 0.00000 PCOR Reference:
Chapter 3 This
program calculates and tests for correlations of different orders.
If the data are normally distributed and linearly related then the tests
are Pearson (partial) correlations. If
the data have been ranked (see RANKDATA) then the tests are Spearman (partial)
correlations. Input is done in the
same way as in DGRAPH and you can enter either a raw data file or a covariance
matrix. We
will use the raw data file called PCOR.DAT with 100 observations of 3 variables
generated from the following equations: X=N(0,1) Y=0.5*X+N(0,0.75) Z=0.5*X+N(0,0.75) Thus,
X and Z have a population correlation coefficient of 0.25 but are independent
upon conditioning on Z (i.e. the population partial correlation coefficient is
zero). After
giving you the resulting covariance matrix, the program asks you to give the
first variable in the correlation pair. We
will first test for an association between X and Z so we will enter 1 and then
3. When we are asked for the number
of conditioning variables we enter 0 since we want a zeroorder correlation.
The program echoes the partial correlation coefficient (0.3182) and its
null probability (0.00117). Next we
tell the program that we wish to continue and that we want to use the same
covariance matrix (i.e. we don’t want to read in a new data file or a new
covariance matrix). Now, we will
test for a zero partial correlation between X and Z (i.e. variables 1 and 3)
conditioned on one conditioning variable Y (i.e. variable 2).
The partial correlation – r(1,3)/2 – is 0.1457 with a null
probability of 0.15044. Here
is the output file:
PCOR program Covariance matrix, based on 100 observation 0.871
0.312 0.238
0.312 0.640
0.340
0.238 0.340
0.645
r( 1, 3)/ r(X,Y)Q=
0.3182 probability=0.00117
r( 1, 3)/
2 r(X,Y)Q=
0.1457 probability=0.15044 PERMR Reference:
Chapter 3, pp. 7982 This
program will test for independence between two variables using permutation
techniques as opposed to a linear relationship between them.
Either absolute or partial independence can be tested, depending on the
nature of the input variables. If
the original variables are normally distributed then this is equivalent to a
test of a Pearson correlation. If
you separately regress each of X and Y on Z and save the residuals of X and Y to
a file, then you can use these residuals in PERMR to test for a firstorder
partial correlation etc. No
distributional assumptions are needed but the measure of association used
assumes a linear relationship. To
illustrate we will use the data set PERMR.DAT which consists of 100 observations
of 3 variables generated according to: X=X2(2) Y=X2(1) Z=0.5X+N(0,0.75). Here,
X2(2) means a random variable generated from a chisquare distribution with 2
degrees of freedom. Thus, X and Y are independently distributed but X is
correlated with Z. After
giving the name for the input and output files as usual, you then give the
column numbers for the two variables whose independence you want to test.
We will first test for independence between X and Y (i.e. columns 1 and 2
in PERMR.DAT). The measure of
association is simply the sum of the two variables (X*Y) which, in these data,
is 190.24. We now specify how many
permutation iterations we wish to use; we will use 5000 runs.
The more runs you use the more precise the probability estimate but the
longer the analysis will take. On my
portable computer this takes less than 2 seconds.
You then must tell the computer whether you want to conduct a 2tailed
test or a 1tailed test and, if the latter, whether you want the left tail test
(i.e. probability of observing less than that in the data) or a right tail test.
We will do a 2tailed test. The
resulting null probability of independence is 0.437 with a standard error of
0.014; thus the true probability has a 95% chance of being between 0.4644 and
0.40956. To decrease the confidence
interval simply increase the number of iterations. Now
we will test for an association between X and Z (i.e. columns 1 and 3 of
PERMR.DAT) again using 5000 iterations and a 2tailed test.
This time the measure of association is 578.95 and the permutation
probability is less than 0.000005. Here
is the output file: PERMR , by Bill Shipley bshipley@courrier.usherb.ca PERMR.DAT data file with 3 variables Based on a sample size of 100 Probability is based on 5000 permutation runs Twotailed test The probability of independence is 0.43700 95% confidence interval of the estimate is (+/) 0.01375 PERMR  testing for association between X and Z Based on a sample size of 100 Probability is based on 5000 permutation runs Twotailed test The probability of independence is 0.00000 95% confidence interval of the estimate is (+/) 0.00000 RANKDATAReference: This
program inputs your raw data file and outputs either
the rank transformed values or the covariance matrix of the rank
transformed values. We will use the
PERMR.DAT data set whose first two variables follow a very rightskewed
distribution. You can have the
instructions printed to the screen. We
will ranktransform variables 1 and 2 and save these ranked data to a file
called RANKDATA.OUT. Note that
variables not chosen to be ranked are not saved to the output file.
We therefore tell the program that we want to rank 2 variables and enter
the column numbers (1 and 2). These
ranked data can then be used in other programs – for example DGRAPH, EPA2 or
PCOR – which use ranked data to calculate Spearman rank (partial)
correlations. MCX2 Reference:
Chapter 6 pp. 177178 The
probability estimate associated with the maximum likelihood chisquare statistic
that is given in commercial SEM programs is asymptotic and only becomes accurate
at large sizes; the actual size depends on a number of properties of the data.
The MXC2 program is to be used alongside your favourite SEM program.
One you have tested your model and obtained the chisquare value and the
degrees of freedom but have a small sample size, MXC2 allows you to estimate the
probability at this small sample size, assuming multivariate normality using
Monte Carlo techniques. First,
you give the number of degrees of freedom associated (for example 6).
Now you have to choose the number of iterations.
The program runs quite quickly so you can give a fairly high number (say
5000). You now specify the
chisquare value obtained for your model (say 14.32) and your sample size (say
30). The probability and its 95%
confidence interval is output (in this case 0.0344 +/ 0.005052). Note
that the algorithm described on page 178 of the first printing contains errors;
the corrections are posted on the ERRATA page of my web site.
Future printings will have the corrections included. MULTICOV Reference:
Chapter 7 This
program calculates the covariances,means, and scaling factor, appropriate for
multilevel SEM, given a data set that has hierarchical structure. There must be
no missing values in the data file. There
can be no more than 30 variables and no more than 5 levels in the hierarchy.
The form of the data matrix is as follows.
If there are x hierarchical levels, then the first x columns give the
group structure. The following
columns give the variables. For
example, if there where 3 levels, 2 variables and 8 observations, the data set
might look like the following; note that the highest level is column 1, level 2
is column 2 etc. 1
1 1 0.1 0.2 1
1 2 0.4 0.3 1
2 3 0.5 0.1 1
2 4 0.8 0.4 2
3 5 0.5 0.3 2
3 6 0.2 0.2 2
4 7 0.5 0.4 2
4 8 0.4 0.3 In
this data file there are 8 observations. The
first level has two groups (1 and 2), the second level has 4 groups (1,2,3,4)
and the last level has 8 "groups" (individual observations). EACH
GROUP ID NUMBER MUST BE DIFFERENT; ONE CANNOT HAVE TWO *DIFFERENT* GROUPS WITH
THE SAME ID NUMBER. The
following data set would not work properly: 1
1 1 0.1 0.2 1
1 2 0.4 0.3 1
2 1 0.5 0.1 1
2 2 0.8 0.4 2
1 1 0.5 0.3 2
1 2 0.2 0.2 2
2 1 0.5 0.4 2
2 2 0.4 0.3 because
you would have two different level2 groups called group "1". Note
that there must be at least 1 blank between each number in the data file (i.e.
free format). This program was
written to permit large data files without huge memory requirements.
This means that large files may take a few minutes to process.
You can save both the nested covariance matrices and
the group means to external data files. You
simply give the file names for your input data file and your output data file.
To illustrate, we will enter the above (correct) data file (called
MULTICOV.DAT), telling the program that there are 3 levels and 2 variables.
Here is the output file: MULTILEVEL COVARIANCE MATRICES GENERATED FROM THE FILE multicov.dat
COVARIANCE MATRIX FOR LEVEL 3 SCALING FACTOR C FROM THIS LEVEL TO THE NEXT LOWER LEVEL IS 4.000 NUMBER OF GROUPS AT THIS LEVEL: 2 GROUP SIZES ARE EQUAL, C IS EXACT DF ARE: 1. 0.00500 0.00500 0.00500 0.00500 COVARIANCE MATRIX FOR LEVEL 2 SCALING FACTOR C FROM THIS LEVEL TO THE NEXT LOWER LEVEL IS 2.000 NUMBER OF GROUPS AT THIS LEVEL: 4 GROUP SIZES ARE EQUAL, C IS EXACT DF ARE: 2. 0.08500 0.00500 0.00500 0.00500 COVARIANCE MATRIX FOR LEVEL 1 SCALING FACTOR C FROM THIS LEVEL TO THE NEXT LOWER LEVEL IS 1.000 NUMBER OF GROUPS AT THIS LEVEL: 8 GROUP SIZES ARE EQUAL, C IS EXACT DF ARE: 4. 0.03500 0.02000 0.02000 0.01500 EPA2 Reference:
Chapter 8 This
program does not test for a specified causal model; rather, it takes a series of
observations and searches for path models that are consistent with the data at a
specified probability level. It
outputs a partially directed graph[2]
from which you can obtain all those equivalent models that fit the data. The
input data file has exactly the same properties as described in DGRAPH and input
to the program is also exactly the same. In
particular, you can either input a raw data file or a covariance matrix and the
results are written to an output file whose name you choose. We
will use the input file called DGRAPH.DAT and write our results to the file
called EPA2.OUT. There are 4
variables in DGRAPH.DAT and we will first use all of these (the program asks you
how many of the variables you want to use).
We then give names for each variable that is to be used.
You are next asked to give the highest rejection level to be used by the
algorithm. The algorithm starts at a
small value and increases in increments of 0.05, each time searching for models
that fit the data at your chosen significance level.
I therefore usually give a high value (eg. 0.9).
This is the value that I will input. You are next asked to give the
significance level to be used in testing models.
This is the significance level that you would normally use to test
models. For instance, if you choose
a significance value of 0.05 then EPA2 will output all those models that are
consistent with the data at this significance level.
This is the level that we will use. You
are next asked if you want to exclude any two variables from sharing an edge.
If, for instance, you have reason to believe that variables 1 and 3
cannot share an edge (i.e. 1 isn’t a direct cause of 3, 3 isn’t a direct
cause of 1 and neither 1 nor 3 have a direct common unmeasured cause) then you
can instruct EPA2 to take this information into account.
We will exclude variables 1 and 3 from sharing an edge.
Note that after we have entered “1” and then “3” we must enter
“1” to tell EPA2 that we have finished excluding edges. Next,
you must choose which method to use in the search.
You would normally use method 1 but, if you have few variables and few
observations, you should choose method 2; this method will take a very long time
to complete if you have many variables. You
can also choose to use both methods. Let’s
do this. Now
you can force any two variables to share an edge if you have reason to do this
(i.e. if the first is a direct cause of the second, if the second is a direct
cause of the first or if the two share a direct common unmeasured cause).
Let’s force variables 1 and 2 to share an edge; remember to input
“1” when finished. The
program now derives the undirected dependency graph starting at the 0.01 level
and tests to see if there are any graphs that fit the data at your chosen
significance level. When EPA2 finds
such models it prints them to the screen and gives you the choice of stopping or
continuing at a higher rejection level. Here
is the output file EPA2.OUT: Exploratory Path Analysis 2, by Bill Shipley bshipley@courrier.usherb.ca input data is
DGRAPH.DAT
Analysis
based on 100
observations Column number
= variable name 1
= v1 2
= v2 3
= v3 4
= v4 Model based
on method 2 v1
oo v2 v2
oo v3 v3
oo v4 Model based
on method 2 v1
o> v2 v3
o> v2 v3
oo v4 Model based
on method 2 v1
o> v2 v2
<> v3
v4
o> v3 The
undirected dependency graph was obtained at a rejection
level of 0.010 These models
fit at a significance level of 0.050 __________________________________ Model based
on method 2 v1
oo v2 v2
oo v3 v3
oo v4 Model based
on method 2 v1
o> v2 v3
o> v2 v3
oo v4 Model based
on method 2 v1
o> v2 v2
<> v3
v4
o> v3 The
undirected dependency graph was obtained at a rejection
level of 0.050 These models
fit at a significance level of 0.050 __________________________________ The
following results were obtained when using method 1: Exploratory Path Analysis 2, by Bill Shipley bshipley@courrier.usherb.ca dgraph.dat
file using method 1
Analysis
based on 100
observations Column number
= variable name 1
= v1 2
= v2 3
= v3 4
= v4 Model based
on method 1 v1
o> v2 v2
<> v3
v4
o> v3 The
undirected dependency graph was obtained at a rejection
level of 0.010 These models
fit at a significance level of 0.050 __________________________________ Notice
that method 2 found 3 sets of equivalent models that would not have been
rejected had we subjected each to DGRAPH while method 1 found only 1 (which was
the second of the three sets found by method 2).
In fact the true model that generated these data was v1àv2àv3àv4
which is one of the equivalent models from the first of the three patterns found
by method 2. I
should point out a few points about the algorithms used in EPA2.
Method 1 is essentially the same as the SGS algorithm except that the SGS
algorithm doesn’t actually compare the data to the partiallydirected graph
using a statistical test. To do this
EPA2 generates one of the equivalent completely directed graphs and test this
model using dsep tests. Method 2
instead begins with the undirected dependency graph (UDG), identifies all sets
of 3 variables in the UDG that form an unshielded pattern, allows each to be
either an unshielded collider or an unshielded noncollider, and generates all
possible nonequivalent models from these, testing each against the data using
dsep tests. This can take a very
long time when there are more that a few variables. To
final points: First, this is an exploratory method and any model found will have
to be independently tested. Second,
it is possible that there exists no path model that is consistent with the data
at any rejection level. TETRAD Reference:
Chapter 8 This
program is also exploratory in nature. Furthermore,
the statistical test used is asymptotic in nature, meaning that the calculated
probability levels are only exact when the sample size at very large sample
sizes. Little is known about the
actual asymptotic properties of the test except that larger sample sizes are
needed when the correlations between the observed variables involved in the
tetrad equation are weak. See
Table 8.5 (p. 286) for some more details. If
the sample size is not large enough then the computed probability will tend to
be larger than the true probability and this means that some vanishing tetrad
equations that are accepted (i.e. whose null probability is larger than your
significance level) may in fact not exist. Input
to TETRAD is exactly the same as in DGRAPH.
We will use the data set called TETRAD.DAT involving four measured
variables (X1,X2,X3,X4). This data set is generated from the following
structural equations (figure 8.20, page 272): X1
= 0.5F1 + N(0,0.75) X2
= 0.5F1 + N(0,0.75) X3
= 0.5F2 + N(0,0.75) X4
= 0.5F2 + N(0,0.75) F2
= 0.5F1 + N(0,0.75). Thus,
there are two latent variables (F1 and F2).
After reading in the input file and giving the name of the output file,
we are asked to give the column numbers of the four variables involved in the
analysis. Since there are only four
columns in this data set we input 1, 2, 3 and 4.
We then specify our significance level.
Remember that all tetrad equations whose null probability is larger than
this value will be retained. We will
enter a value of 0.05. The results
are echoed to the screen as well as being printed to the output file. We can
then quit, specify another set of 4 variables or read in a new data set. Here
is the output file: TETRAD, by Bill Shipley bshipley@courrier.usherb.ca model page
272
Analysis
based on 500
observations Testing for
zero tetrad equations at a significance of.0500 The following tetrad equation is zero: 1* 3* 2* 4 The null
probability is 0.53637 All treks
going into the first and third variables or into
the second and fourth variables of this tetrad equation pass
through the same choke point. In
fact, given the causal structure of these data, there is only one vanishing
tetrad equation of the three implied by the four variables (see page 272).
We can state that all treks passing into either variables 1 and 2 (the
first and third variables in the equation) and/or variables 3 and 4 (the second
and fourth variables in the equation) pass through a single variable (the choke
point). In this example F1 is the
first choke point (through which all treks between the four variables that are
into variables 1 and 2 pass) and F2 (through which all treks between the four
variables that are into variables 3 and 4 pass) is a second choke point. IDEN Reference:
Chapter 6 pages 169171. This
program tests the identification status of a measurement model using the FC1
rule of Davis. No empirical data are
input to this program. You first
specify the number of observed indicator variables in your measurement model and
the number of latent variables and then answer the questions that are posed.
Consider
the model shown in Figure 6.3 (page 170). There
are 6 indicator variables and 2 latents. There
are no indicator variables that are caused by more than one latent.
You now follow the instructions on the screen and answer the questions
posed and the program tells you if the measurement model is structurally
identified according to the FC1 rule. DSEP Reference:
Chapter 2 This
program is designed to allow you to test your ability to read dseparation
statements from a directed acyclic graph. No
empirical data are read into the program; you simply input the DAG and then pose
questions concerning dseparation. As
an example, we will input the following DAG: v1àv2àv3ßV4.
There are four vertices (v1, v2, v3 and v4) and three direct effects.
We enter the three pairs of vertices involved in the direct effects as
cause <space> effect : 1
2 2
3 4
3 You
now enter dseparation queries of the form:
is X dseparated from Y given conditioning set Z?
For instance, to find out if v1 is dseparated from v3 given the
conditioning set {v2, v4} we would give the number of the first vertex (1), the
number of the second vertex (3), the number of vertices in the conditioning set
(2), and then these two numbers (1 and 4). The
program gives you the answer (yes) and you can then pose further dseparation
queries. [1] Shipley, B. 2000. Cause and correlation in biology. A user’s guide to path analysis, structural equations and causal inference. Cambridge University Press, Cambridge. ISBN: 0521791537. [2] This is called a partially directed inducing path graph
