10/14/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
We have now explored the notion of a predictive model, but the notion of an explanatory model is more complicated philosophically.
Sometimes explanation means causation by physical principles,
However, sometimes explanation is just a description of the (conditional/ stastical) relationships between the variables.
Causal conclusions require stronger assumptions than those used for the predictive models that we have already discussed.
Sometimes, we only wish to understand correlations, i.e., variables that (co)-vary together or asymmetrically…
library("faraway")
lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, gala)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.068221 19.154198 0.3690 0.7153508
Area -0.023938 0.022422 -1.0676 0.2963180
Elevation 0.319465 0.053663 5.9532 3.823e-06
Nearest 0.009144 1.054136 0.0087 0.9931506
Scruz -0.240524 0.215402 -1.1166 0.2752082
Adjacent -0.074805 0.017700 -4.2262 0.0002971
n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
sumary(lm(Species ~ Elevation, gala))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.335113 19.205288 0.5902 0.5598
Elevation 0.200792 0.034646 5.7955 3.177e-06
n = 30, p = 2, Residual SE = 78.66154, R-Squared = 0.55
we would have an explained response of about 20 additional species with a change of 100m of additional elevation.
In general, the change in response variable in terms of an explanatory variable needs to be qualified with respect to all variables in the model.
For example,
For the simple regression model with one variable explaning the response, the change in elevation is associated with a change in the other variables, not included in this simple model, but which in general may co-vary with this value.
Gala
data are different.The regression coefficients thus give some explanation power, but it is conditional and lacks the notion of causality.
One common notion of causality is the following:
In this understanding of causality we will typically consider prototypically an experiment with
Usually, however, there will be no identical test subjects, and we can only observe one of \( \mathbf{Y}^T_i \) for \( T \in \{0, 1\} \).
The outcome that we cannot see is called the counterfactual.
In designed experiments, we may actually be able to control \( T \), but for example with the Galapagos data, we cannot feasibly test a change in the elevation.
Because many applications in data science and statistics will not (and often cannot) involve the best experimental design, we will focus our discussion on these cases.
colSums(newhamp[newhamp$votesys =='D',2:3])
Obama Clinton
86353 96890
colSums(newhamp[newhamp$votesys =='H',2:3])
Obama Clinton
16926 14471
head(newhamp)
votesys Obama Clinton dem povrate pci Dean Kerry white
Alton D 371 362 979 0.0653 25940 0.27820 0.32030 0.98312
Barnstead D 345 333 913 0.0380 19773 0.24398 0.36747 0.97349
Belmont D 375 570 1305 0.0428 19986 0.20096 0.41627 0.96739
CenterHarbor H 92 89 268 0.0669 25627 0.28495 0.33333 0.97892
Gilford D 668 595 1633 0.0332 32667 0.24937 0.37781 0.97986
Gilmanton D 284 273 761 0.0586 23163 0.30503 0.39341 0.98301
absentee population pObama
Alton 0.059857 4693 0.3789581
Barnstead 0.050449 4266 0.3778751
Belmont 0.043649 7006 0.2873563
CenterHarbor 0.107356 1033 0.3432836
Gilford 0.074706 7033 0.4090631
Gilmanton 0.053191 3222 0.3731932
The data includes the number of votes for Obama and Clinton, as well as total votes including other candidates.
We have a variable votesys
which is encoded as H
or D
for hand or digital vote counting respectively.
There are also demographic variables, as well as proportion of votes cast for candidates in the 2004 presidential primary.
votesys
predictor as the trt
, treatment variable.newhamp$trt <- ifelse(newhamp$votesys == 'H',1,0)
lmodu <- lm(pObama ~ trt, newhamp)
sumary(lmodu)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3525171 0.0051728 68.1480 < 2.2e-16
trt 0.0424871 0.0085091 4.9932 1.059e-06
n = 276, p = 2, Residual SE = 0.06823, R-Squared = 0.08
The model takes the form:
\[ \begin{align} Y_i = \beta_0 + \beta_1T_i + \epsilon_i \end{align} \]
where:
sumary(lmodu)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3525171 0.0051728 68.1480 < 2.2e-16
trt 0.0424871 0.0085091 4.9932 1.059e-06
n = 276, p = 2, Residual SE = 0.06823, R-Squared = 0.08
We note that with the extremely small p-value for \( \hat{\boldsymbol{\beta}}_1 \), the effect of hand counting is statistically significant.
In this case, we may conclude that Obama did receive significantly more votes in wards using hand counting, but this doesn't imply that the voting method itself is the explanation.
We will suppose that this is the correct model, up to an additional variable we have left out of our analysis:
\[ \begin{align} Y_i = \beta_0 + \beta^\ast_1T_i + \beta^\ast_2 Z_i + \epsilon_i \end{align} \]
We call the above \( Z_i \) the confounding variable.
By substition, we recover a new model
\[ \begin{align} Y_i = \left(\beta_0 + \beta_2^\ast \gamma_0 \right) + \left(\beta^\ast_1 + \beta_2^\ast \gamma_1\right)T_i + \epsilon_i + \beta_2^\ast \epsilon^\ast_i \end{align} \]
There are two scenarios where the effect of digital voting or hand counted voting agree between the original model and the model with the confounding variable:
In any other case, our original model excluding \( Z \) will be biased in its estimation of the effect of digital versus hand counting.
In a designed experiement, \( \gamma_1=0 \) by randomization of the populations to which the control or the treatment is applied, but in this case we cannot assume this.
We wish thus to determine if there is a possible effect of a confounding variable in the model.
In particular, we will consider the proportion of votes given to Howard Dean, the Democratic presidential candidate in 2004 as an explanatory variable on the response.
We construct the revised model, including the ansatz in the R language,
\[ \begin{align} y_i = \left(\beta_0 + \beta_2^\ast \gamma_0 \right) + \left(\beta^\ast_1 + \beta_2^\ast \gamma_1\right)T_i + \epsilon_i + \beta_2^\ast \epsilon^\ast_i \end{align} \]
lmodz <- lm(pObama ~ trt + Dean , newhamp)
sumary(lmodz)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2211192 0.0112502 19.6547 <2e-16
trt -0.0047538 0.0077608 -0.6125 0.5407
Dean 0.5228967 0.0416500 12.5545 <2e-16
n = 276, p = 3, Residual SE = 0.05443, R-Squared = 0.42
Recall the ansatz,
\[ \begin{align} Z_i \triangleq \gamma_0 + \gamma_1 T_i + \epsilon^{\ast}_i \end{align} \]
If we try to fit such a model, we see that
sumary(lm(Dean ~ trt, newhamp))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2512886 0.0059850 41.9861 < 2.2e-16
trt 0.0903446 0.0098451 9.1766 < 2.2e-16
n = 276, p = 2, Residual SE = 0.07895, R-Squared = 0.24
This says, we have found a statistically significant, linear relationship between the Dean variable \( Z_i \) and the treatment (hand or digital counting) with the adjustment term \( \gamma_1 \).
It appears that there is an active confounding variable, but we will need additional analysis to make conclusions…
Suppose we wish to follow the analogy of the clinical trial again,
For the primary data, we will perform a similar procedure to match voting districts based on similar proportions of votes going to Dean in 2004, in order to evaluate the effect of the treatment, i.e., machine or hand counting of votes.
Because this is observed data, we can't randomly assign the treament or control to these matched populations, but we will try to emulate the process.
Dean is a continuous variable (precentage of district that voted for Dean) so we set a threshold for which pairs of districts are defined as “matches”.
The GenMatch is a (stochastic) selection algorithm based on genetic principles — by random initialization and selection, it goes through iterations to find a “best fit”.
This continues until a certain number of attempts are reached, or a certain number of matches are created in an iteration.
Because this is a random algorithm, we set a seed value so that we can reproduce the matching result in future attempts.
library("Matching")
set.seed(123)
mm <- GenMatch(newhamp$trt, newhamp$Dean, ties=FALSE, caliper=0.05, pop.size=1000)
head(mm$matches[,1:2])
[,1] [,2]
[1,] 4 213
[2,] 17 148
[3,] 18 6
[4,] 19 83
[5,] 21 246
[6,] 22 185
newhamp[c(4,213),c('Dean','pObama','trt')]
Dean pObama trt
CenterHarbor 0.28495 0.3432836 1
Newfields 0.28457 0.3829322 0
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
We will determine the significance of the alternative hypothesis that
\[ \begin{align} H_1: \mathbb{E}[\delta] \neq 0, \end{align} \] given the sample based mean and variance.
The null hypothesis is,
\[ \begin{align} H_0: \delta \sim N(0, 2\sigma^2). \end{align} \]
We construct the vector of differences and compute the t-test below:
pdiff <- newhamp$pObama[mm$matches[,1]] - newhamp$pObama[mm$matches[,2]]
t.test(pdiff)
One Sample t-test
data: pdiff
t = -1.8798, df = 86, p-value = 0.06352
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.0328560272 0.0009183504
sample estimates:
mean of x
-0.01596884
t.test(pdiff)
One Sample t-test
data: pdiff
t = -1.8798, df = 86, p-value = 0.06352
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.0328560272 0.0009183504
sample estimates:
mean of x
-0.01596884
We note that we fail to reject the null hypothesis, i.e., it is plausible that the difference in the treatment is mean zero.
While this p-value is close to the threshold \( \alpha=5\% \), we note qualitatively from the \( 95\% \) confidence interval that,
Across matched districts, the case for the vote counting method being the deciding factor in the election is losing strength.
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
In order to account for the confounding variable, we emulate the system of designed experiments with balanced treatment and control groups.
Using a (random) selection algorithm we try to enforced balanced populations with respect to the confounding variable:
By creating balanced populations with respect to the confounding variable, we can refine our analsis and determine if the treatment, hand counted votes had a statistically significant effect accounting for the stronger predictor.
We perform a hypothesis test to see if the treatment bias away from zero on the response (percent vote Obama 2008).
According to the t-test, we cannot reject the null hypothesis, that difference of the percent vote for Obama in the treatment and control groups is zero on average.
In the following, we will visualize this process along with the difference between the two approaches.
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Courtesy of: Faraway, J. Linear Models with R. 2nd Edition
Both techniques of adjusting for possible covariates are useful for analysis.
On the other hand, matching doesn't require us to specify an actual form for the relationship.
Both methods coincide in this case because there is sufficient balance between the control and treatment group proportionally to the whole population to make meaningful conclusions.
When there isn't much overlap, controlling for the covariate can still provide useful analysis but we have to be concious of extrapolation issues and uncertainty.
In general, there are limits to the conclusions we can draw statistically from observational data.
Sir Bradford Hill, a figure in establishing the causal link between smoking and lung cancer made several reccomendations for how to study a causal link: