Explanation

10/14/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Discussion of causality
- The notion of statistical explanation
- Confounding variables and covariates

Explanation

We have now explored the notion of a predictive model, but the notion of an explanatory model is more complicated philosophically.
Sometimes explanation means causation by physical principles,
- E.g., if we fit a linear model for a physical application, a paramater \( \beta \) might be physical constant;
- in particular, it could represent the ammount of energy lost due to friction in the moving parts of machine.
However, sometimes explanation is just a description of the (conditional/ stastical) relationships between the variables.
Causal conclusions require stronger assumptions than those used for the predictive models that we have already discussed.
Sometimes, we only wish to understand correlations, i.e., variables that (co)-vary together or asymmetrically…

Correlation models

… but sometimes we can't read much into correlation:

80.5 percent correlation between letters in winning word of Scripps National Spelling Bee and number of people killed by venomous spiders.

Correlation: \( 80.5\% \) – Courtesy of Tyler Vigen, Creative Commons License

A simple meaning of explanation

When we produce a linear model, we can consider the coefficients \( \hat{\boldsymbol{\beta}}_i \) to provide an “explanation” of the effect of an explanatory variable on the response, e.g.,

library("faraway")
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, gala)
sumary(lmod)

             Estimate Std. Error t value  Pr(>|t|)
(Intercept)  7.068221  19.154198  0.3690 0.7153508
Area        -0.023938   0.022422 -1.0676 0.2963180
Elevation    0.319465   0.053663  5.9532 3.823e-06
Nearest      0.009144   1.054136  0.0087 0.9931506
Scruz       -0.240524   0.215402 -1.1166 0.2752082
Adjacent    -0.074805   0.017700 -4.2262 0.0002971

n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77

in this case, if we found a new island with elevation 100m higher (and with all other variables held constant) would appear to produce \( \approx 32 \) addtional species on a given island.

A simple meaning of explanation – continued

However, if we try to perform the same analysis with respect to a different choice of model,

sumary(lm(Species ~ Elevation, gala))

             Estimate Std. Error t value  Pr(>|t|)
(Intercept) 11.335113  19.205288  0.5902    0.5598
Elevation    0.200792   0.034646  5.7955 3.177e-06

n = 30, p = 2, Residual SE = 78.66154, R-Squared = 0.55

we would have an explained response of about 20 additional species with a change of 100m of additional elevation.
In general, the change in response variable in terms of an explanatory variable needs to be qualified with respect to all variables in the model.
For example,
- “a unit increase in \( \mathbf{X}_1 \) with the other (named) predictors held constant will produce a change of \( \hat{\boldsymbol{\beta}}_1 \) in the response \( \mathbf{Y} \)”.
For the simple regression model with one variable explaning the response, the change in elevation is associated with a change in the other variables, not included in this simple model, but which in general may co-vary with this value.
- It was only in the special case where the predictors were not correlated that we saw the coefficient remains the same.
- This explains why the two predictions in the Gala data are different.
The regression coefficients thus give some explanation power, but it is conditional and lacks the notion of causality.

Causality

One common notion of causality is the following:
- “the causal effect of an action is the difference between the outcomes where the action was or was not taken”
In this understanding of causality we will typically consider prototypically an experiment with
- a participant indexed by the letter \( i \);
- a measured quantity \( \mathbf{Y}_i^T \) that may or may depend on the effect of some treatement;
- \( \mathbf{Y}_i^0 \) will correspond to a control, where treatement is not applied;
- \( \mathbf{Y}_i^1 \) will correspond to a test subject, where the treatment is applied; and
- an effect \( \delta_i \triangleq \mathbf{Y}_i^1 - \mathbf{Y}_i^0 \).
Usually, however, there will be no identical test subjects, and we can only observe one of \( \mathbf{Y}^T_i \) for \( T \in \{0, 1\} \).
The outcome that we cannot see is called the counterfactual.
In designed experiments, we may actually be able to control \( T \), but for example with the Galapagos data, we cannot feasibly test a change in the elevation.
Because many applications in data science and statistics will not (and often cannot) involve the best experimental design, we will focus our discussion on these cases.
- We will treat the special case of designed experiments towards the end of the course.

A concrete example

In the 2008 Democratic primary election in New Hampshire, Hillary Clinton won despite the pre-election polls suggesting that Barack Obama would win.

Two voting technologies were used in New Hampshire:

paper ballots counted by hand; and
optically scanned ballots, machine processed.

Controversially, on paper ballots Obama had more votes, whereas Clinton only won the primary in districts that used the machine counting.

colSums(newhamp[newhamp$votesys =='D',2:3])

  Obama Clinton 
  86353   96890

colSums(newhamp[newhamp$votesys =='H',2:3])

  Obama Clinton 
  16926   14471

Because the voting method shouldn't have any significant effect onthe outcome of the election, the integrity of the election was in question.

A concrete example – continued

In the following, we will try to model this to determine if the outcome would be surprising if it was due to random chance.
Particularly, we want to understand how the significance of the explanatory variable, the method of voting, changes with respect to different choices of models.
Note:

Strictly speaking, this should be modelled by a binomial distribution;
nonetheless, the normal can be a good approximation for the binomial given a large enough sample and probabilities not close to zero or one.
A binomial variance is \( np(1−p) \) for probability of success \( p \), and sample size \( n \).
For both the votes for Clinton and Obama, the probability and sample sizes differ such that our assumption of constant variance is violated;

with generalized least squares we can take this into account, but it doesn’t make an appreciable difference in this case so we save this for later.

A concrete example – continued

We take a quick look at several of the predictors of interest:

head(newhamp)

             votesys Obama Clinton  dem povrate   pci    Dean   Kerry   white
Alton              D   371     362  979  0.0653 25940 0.27820 0.32030 0.98312
Barnstead          D   345     333  913  0.0380 19773 0.24398 0.36747 0.97349
Belmont            D   375     570 1305  0.0428 19986 0.20096 0.41627 0.96739
CenterHarbor       H    92      89  268  0.0669 25627 0.28495 0.33333 0.97892
Gilford            D   668     595 1633  0.0332 32667 0.24937 0.37781 0.97986
Gilmanton          D   284     273  761  0.0586 23163 0.30503 0.39341 0.98301
             absentee population    pObama
Alton        0.059857       4693 0.3789581
Barnstead    0.050449       4266 0.3778751
Belmont      0.043649       7006 0.2873563
CenterHarbor 0.107356       1033 0.3432836
Gilford      0.074706       7033 0.4090631
Gilmanton    0.053191       3222 0.3731932

The data includes the number of votes for Obama and Clinton, as well as total votes including other candidates.
We have a variable votesys which is encoded as H or D for hand or digital vote counting respectively.
There are also demographic variables, as well as proportion of votes cast for candidates in the 2004 presidential primary.

A concrete example – continued

Here, we will re-code the votesys predictor as the trt, treatment variable.

newhamp$trt <- ifelse(newhamp$votesys == 'H',1,0)
lmodu <- lm(pObama ~ trt, newhamp)
sumary(lmodu)

             Estimate Std. Error t value  Pr(>|t|)
(Intercept) 0.3525171  0.0051728 68.1480 < 2.2e-16
trt         0.0424871  0.0085091  4.9932 1.059e-06

n = 276, p = 2, Residual SE = 0.06823, R-Squared = 0.08

The model takes the form:

\[ \begin{align} Y_i = \beta_0 + \beta_1T_i + \epsilon_i \end{align} \]
where:
- the response variable is the percentage of votes for Obama;
- the base percentage of votes is represented by \( \beta_0 \approx 35\% \);
- when the voting system is by hand counting, \( T_i = 1 \) and the effect of a hand counting in this model is to add approximately \( 4\% \) to the vote total.

A concrete example – continued

sumary(lmodu)

             Estimate Std. Error t value  Pr(>|t|)
(Intercept) 0.3525171  0.0051728 68.1480 < 2.2e-16
trt         0.0424871  0.0085091  4.9932 1.059e-06

n = 276, p = 2, Residual SE = 0.06823, R-Squared = 0.08

We note that with the extremely small p-value for \( \hat{\boldsymbol{\beta}}_1 \), the effect of hand counting is statistically significant.
In this case, we may conclude that Obama did receive significantly more votes in wards using hand counting, but this doesn't imply that the voting method itself is the explanation.
- Among the issues in this conclusion, we risk the chance of an outright false positive for statistical significance.
- Moreover, we haven't included how other possible predictors may covary with this value;
- particularly, we need to determine if voting districts using hand counting versus digital technology were all highly correlated (vary closely) with another factor we have not included in our analysis.
- Hand counting districts could be, for example, ones with significant demographic differences from the digital counting districts.

Confounding variables

We will suppose that this is the correct model, up to an additional variable we have left out of our analysis:

\[ \begin{align} Y_i = \beta_0 + \beta^\ast_1T_i + \beta^\ast_2 Z_i + \epsilon_i \end{align} \]
- Where, as an ansatz, we take \[ \begin{align} Z_i \triangleq \gamma_0 + \gamma_1 T_i + \epsilon^{\ast}_i \end{align} \]
We call the above \( Z_i \) the confounding variable.
By substition, we recover a new model

\[ \begin{align} Y_i = \left(\beta_0 + \beta_2^\ast \gamma_0 \right) + \left(\beta^\ast_1 + \beta_2^\ast \gamma_1\right)T_i + \epsilon_i + \beta_2^\ast \epsilon^\ast_i \end{align} \]
There are two scenarios where the effect of digital voting or hand counted voting agree between the original model and the model with the confounding variable:
1. \( \beta_2^\ast = 0 \), so that \( Z_i \) has no effect on the response; or
2. \( \gamma_1=0 \), such that the effect of the treatement (hand or digital voting) has no effect on \( Z_i \), the confounding variable.
In any other case, our original model excluding \( Z \) will be biased in its estimation of the effect of digital versus hand counting.
In a designed experiement, \( \gamma_1=0 \) by randomization of the populations to which the control or the treatment is applied, but in this case we cannot assume this.

Confounding variables – continued

We wish thus to determine if there is a possible effect of a confounding variable in the model.
In particular, we will consider the proportion of votes given to Howard Dean, the Democratic presidential candidate in 2004 as an explanatory variable on the response.
We construct the revised model, including the ansatz in the R language,

\[ \begin{align} y_i = \left(\beta_0 + \beta_2^\ast \gamma_0 \right) + \left(\beta^\ast_1 + \beta_2^\ast \gamma_1\right)T_i + \epsilon_i + \beta_2^\ast \epsilon^\ast_i \end{align} \]

lmodz <- lm(pObama ~ trt + Dean , newhamp)

Confounding variables – continued

Here, we study the model output:

sumary(lmodz)

              Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.2211192  0.0112502 19.6547   <2e-16
trt         -0.0047538  0.0077608 -0.6125   0.5407
Dean         0.5228967  0.0416500 12.5545   <2e-16

n = 276, p = 3, Residual SE = 0.05443, R-Squared = 0.42

Things we thus note immediately are:
1. the percentage of votes that went to Howard Dean in 2004 has a positive effect in this model, that dramatically counter-balances the effect of the counting method;
2. the counting method in this model is no longer statistically significant, where the p-value is on the order of \( 54\% \).
3. On the other hand, dropping the Dean variable out of the model seems like a bad choice;

Testing the ansatz

Recall the ansatz,

\[ \begin{align} Z_i \triangleq \gamma_0 + \gamma_1 T_i + \epsilon^{\ast}_i \end{align} \]
If we try to fit such a model, we see that

sumary(lm(Dean ~ trt, newhamp))

             Estimate Std. Error t value  Pr(>|t|)
(Intercept) 0.2512886  0.0059850 41.9861 < 2.2e-16
trt         0.0903446  0.0098451  9.1766 < 2.2e-16

n = 276, p = 2, Residual SE = 0.07895, R-Squared = 0.24

This says, we have found a statistically significant, linear relationship between the Dean variable \( Z_i \) and the treatment (hand or digital counting) with the adjustment term \( \gamma_1 \).
- This corroborates the ansatz we have proposed.
It appears that there is an active confounding variable, but we will need additional analysis to make conclusions…

The analogy of the treatment

Suppose we wish to follow the analogy of the clinical trial again,
- we might have individuals who differ in gender, age, health conditions, and other factors that can affect the outcome of the treatment;
- we want to have individuals with equal (or extremely similar) traits balanced between the control and treatment populations to ensure that the traits of either group don't bias the results.
- Knowing that there is some effect of a confounding variable in this analysis, we should specifically try to balance the effect of the confounding variable in both the control and treatment populations;
- matching certain participants based on their relationship to the confouding variable, we will randomly select individuals from the matched groups as either a control or a treatment case.
For the primary data, we will perform a similar procedure to match voting districts based on similar proportions of votes going to Dean in 2004, in order to evaluate the effect of the treatment, i.e., machine or hand counting of votes.
Because this is observed data, we can't randomly assign the treament or control to these matched populations, but we will try to emulate the process.

Matching cases

Dean is a continuous variable (precentage of district that voted for Dean) so we set a threshold for which pairs of districts are defined as “matches”.
The GenMatch is a (stochastic) selection algorithm based on genetic principles — by random initialization and selection, it goes through iterations to find a “best fit”.
- As many matches are made as possible in a first iteration, from which a mutation to the selection is given;
- if the number of matches increases from perturbing the selection, this is accepted as an “evolved solution” to the matching process.
This continues until a certain number of attempts are reached, or a certain number of matches are created in an iteration.
Because this is a random algorithm, we set a seed value so that we can reproduce the matching result in future attempts.

library("Matching")
set.seed(123)
mm <- GenMatch(newhamp$trt, newhamp$Dean, ties=FALSE, caliper=0.05, pop.size=1000)

Matching cases – continued

We print some of the matches:

head(mm$matches[,1:2])

     [,1] [,2]
[1,]    4  213
[2,]   17  148
[3,]   18    6
[4,]   19   83
[5,]   21  246
[6,]   22  185

and note the first case of the matches, where the proportion of votes is quite similar:

newhamp[c(4,213),c('Dean','pObama','trt')]

                Dean    pObama trt
CenterHarbor 0.28495 0.3432836   1
Newfields    0.28457 0.3829322   0

Matching cases – continued

Plot of matched pairs over the confounding variable Dean, and their differences in the response.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

In the left plot, we show the matched pairs — hand voting wards are shown as triangles and digital voting wards as circles.
- These are matched based on similar values for the “Dean” variable in the horizontal axis, with the response value “pObama” in the vertical axis.
Matched pairs are linked by a line. In the right plot, we show the matched pair differences in the observed response value.

We have now set up the close analog to the clinical experiment, with a control and treatment population of similar characteristics with respect to the confounding variable.
We note that some districts go un-matched because the algorithm (after repeated attempts) couldn’t find a maximal set of matches that could fit them in a pair with a similar treatment or control analogue based on our matching criterion.

Matching cases – continued

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

In the right plot, the matched pair differences gives us some idea of the effect of the “treatment” of hand counting votes.

That is, let each pair be indexed by a number \( i \) that identifies the pair;
\( T\in\{0, 1\} \) describes whether the district had hand counting (\( T=1 \)) or digital counting (\( T=0 \));
then, we denote \( Y^T_i \) to be the observed percentage of votes for Obama in the 2008 primary election.

The differences in the treatment and control populations (the effect based on the cause), \[ \begin{align} \delta_i = Y_i^1 - Y_i^0, \end{align} \] is thus represented by the differences in the right plot.
Recall, \( Y_i^T \) is assumed to be a random variable that is conditioned on the event of the treatment application or not.

Therefore, \( \delta_i \) is also a random variable

We want to understand the random process represented by \( \delta_i \).
In particular, if the treatment doesn’t introduce bias on the outcome of the vote, we should expect \[ \begin{align} \mathbb{E}[\delta] = 0 \end{align} \]

Testing for unbiased treatment

Suppose that we assume \( \epsilon_i \sim N(0, \sigma^2) \) for every \( i \);

Q: what is a way can we test that \( \mathbb{E}\left[\delta\right] =0 \)?

A: we can perform this with the student t-test.
Assuming:

Gaussiantiy of the errors \( \epsilon \);
independent cases; and
finite constant variance of the errors across all cases,

we can say \( \delta_i \) are Gaussian random variables, independently and identically distriubted with unknown true mean and standard deviation.
Q: supposing that the unknown variance is \( \sigma^2 \), can you use the above assumptions to derive the variance of the \( \delta_i \)?
A: note that the random component of \( \delta_i \) can be written \( \epsilon_i^1 - \epsilon_i^0 \), where these are uncorrelated Gaussian random variables, each with variance \( \sigma^2 \).
Therefore, \[ \begin{align} \mathrm{var}(\delta_i) &= \mathbb{E}\left[ \left(\epsilon_i^1 - \epsilon_i^0\right)^2\right] \\ &= \mathbb{E}\left[ \left(\epsilon_i^1\right)^2 - 2\epsilon_i^1 \epsilon_i^0 + \left(\epsilon_i^0\right)^2\right] \\ &= 2 \sigma^2 \end{align} \] due to uncorrelated errors and the constant variance.
Therefore, the t-test for a sample of iid random Gaussian variables appropriate.

Testing for unbiased treatment – continued

We will determine the significance of the alternative hypothesis that

\[ \begin{align} H_1: \mathbb{E}[\delta] \neq 0, \end{align} \] given the sample based mean and variance.
The null hypothesis is,

\[ \begin{align} H_0: \delta \sim N(0, 2\sigma^2). \end{align} \]
We construct the vector of differences and compute the t-test below:

pdiff <- newhamp$pObama[mm$matches[,1]] - newhamp$pObama[mm$matches[,2]]
t.test(pdiff)


    One Sample t-test

data:  pdiff
t = -1.8798, df = 86, p-value = 0.06352
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.0328560272  0.0009183504
sample estimates:
  mean of x 
-0.01596884

Testing for unbiased treatment – continued

t.test(pdiff)


    One Sample t-test

data:  pdiff
t = -1.8798, df = 86, p-value = 0.06352
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.0328560272  0.0009183504
sample estimates:
  mean of x 
-0.01596884

We note that we fail to reject the null hypothesis, i.e., it is plausible that the difference in the treatment is mean zero.
While this p-value is close to the threshold \( \alpha=5\% \), we note qualitatively from the \( 95\% \) confidence interval that,
- across the matched cases for similar percentage votes in 2004 for Dean,
- the effect of the treatment (hand counting) actually is skewed negative, contrary to the original concern of malfeasance.
Across matched districts, the case for the vote counting method being the deciding factor in the election is losing strength.

Testing for unbiased treatment – continued

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

By inspection, this corroborates the observation that the differences \( \delta_i \) don’t favor positive or negative impact on the percentage vote for Obama in the primaries.
The data appears sufficiently Gaussian, that the assumptions of this test seem to be satisfied.

However, we must acknowledge that we made a Gaussian approximation of the binomial distribution, and constant variance when the variances should probably be non-constant.
We will return to these points, and how to test for Gaussianity and non-constant variances later in the course.

A summary of the results

With this analysis we have established the following:
1. There is a statistically significant relationship between the districts where Obama held a majority in 2008 and those that used hand counted ballots.
2. However, in the presence of the popularity of Dean in 2004 as an explanatory variable, the counting method didn’t show significance;
3. Therefore, we are inclined to favor Dean as a confounding variable in the analysis.

A summary of the results – continued

In order to account for the confounding variable, we emulate the system of designed experiments with balanced treatment and control groups.
Using a (random) selection algorithm we try to enforced balanced populations with respect to the confounding variable:
1. We specify a small threshold for a match;
2. the algorithm searches for districts that have comparable values of the confounding variable Dean, within the threshold;
3. a match is set if one case used digital and one case used hand counting, representing the control and treament groups respectively;
4. the algorithm then iteratively searches for as many matches as possible.
By creating balanced populations with respect to the confounding variable, we can refine our analsis and determine if the treatment, hand counted votes had a statistically significant effect accounting for the stronger predictor.
We perform a hypothesis test to see if the treatment bias away from zero on the response (percent vote Obama 2008).
According to the t-test, we cannot reject the null hypothesis, that difference of the percent vote for Obama in the treatment and control groups is zero on average.
In the following, we will visualize this process along with the difference between the two approaches.

A visual analysis

Plot of matched pairs over the confounding variable Dean, and the regression functions including only the treatement, and/or including the confounding variable.

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

On the left we plot the mean of the response (percentage of 2008 primary votes to Obama), modeled as a response of:

the method of vote counting (hand or digital); and
the popularity of Dean in 2004.

Solid (respectively dashed) lines corresponds to digital (respectively hand) counted votes;
horizontal lines are irrespective of the confounding variable (Dean), and the models are fit only as a response of the voting method.

Lines with positve slopes are fitted with Dean as an explanatory variable.
Matched pairs (control/ treatment) are represented by (circles/ triangles) and connected by grey lines.
In expected value (or on average) the difference between hand vote counting and digital vote counting is represented by the difference of the two sloped lines.
The matched pairs can be seen as local variation around the mean differences.
If we average all of the pairwise differences, we can likewise estimate the difference of mean responses by the mean of the differences.

A visual analysis

Courtesy of: Faraway, J. Linear Models with R. 2nd Edition

In our original analysis, we found that the treatment was a significant explanatory variable.
However, this variable was found to have a statistically significant linear relationship with the covariate “Dean”.
In this case, we find that both:

matching samples to produce balanced control and treatment groups between populations, based on the covariate; and
including the covariate in the model itself

in this case these actually estimate the same phenomon.

Adjusting for the covariate

Both techniques of adjusting for possible covariates are useful for analysis.
- Including the confounding variable in model is known as “controlling for the covariate”.
- This has the advantage that it extends well to multiple explanatory variables, where we define the new model in terms of additional predictors.
- The disadvantage is that we must specify the form for the model itself.
On the other hand, matching doesn't require us to specify an actual form for the relationship.
- The weakness of this approach is that there has to be a sufficient number of matching pairs to form meaningful analysis over the control and treatment groups.
Both methods coincide in this case because there is sufficient balance between the control and treatment group proportionally to the whole population to make meaningful conclusions.
When there isn't much overlap, controlling for the covariate can still provide useful analysis but we have to be concious of extrapolation issues and uncertainty.

Notes on explanation

In general, there are limits to the conclusions we can draw statistically from observational data.
Sir Bradford Hill, a figure in establishing the causal link between smoking and lung cancer made several reccomendations for how to study a causal link:
1. Study the practical strength of the effect on response. This is not neccessarily in terms of high correlation or a small p-value but that the impact of \( \hat{\beta}_i \) is large in the scale of the response. If there is a large practical effect on the response variable, it is unlikely that another (un-modeled) variable would have a counter-active effect.
2. Study the consistency of the results. If smokers over many possible confounding variables all tend to get lung cancer, then the evidence is stronger. Independent replication of results is an important link to causality.
3. Consider the specificity of the causual factor. If a particular lung disease is only prevalent in workers in a particular industry and those workers do not suffer from other problems any more than other industrial workers, then the case is stronger.
4. Study the temporality to establish cause and effect. You want to determine whether X causes Y or Y causes X, where there are many possible variables.
5. Consider if the results corroborate other substantial evidence. Determine the overal plausibility within existing understanding.