11/18/2020
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
In the following lecture, we will go through a relatively complete example of the process of data analysis, regression, diagnostics, and remediation.
Particularly, we will discuss a difficult to analyze question of systematic bias in insurance practices.
Insurance, by nature of their business, need to price the cost of covering the risk of damages, based on the probability of these damages.
However, we are also in a society that has historically enforced segregation, unequal rights and unequal access to services.
We will investigate a complicated question:
The term “insurance redlining” refers litterally drawing a red line in a map, that excludes certain neighborhoods from services.
In the late 1970s, the US Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods.
Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago by zip code for the months of December 1977 through May 1978 was recorded.
The FAIR plan was offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market.
Information on other variables that might affect insurance writing such as fire and theft rates was also collected at the zip code level.
We note here that we do not actually know the ethnicity of the individuals who are denied insurance in this data set.
Rather, we only know the ethnic makeup of the zip code where an individual was denied.
This is an important difficulty that needs to be considered carefully before starting our analysis.
When data are collected at the group level, we may observe a correlation between two variables.
The ecological fallacy is concluding that the same correlation holds at the individual level.
For example the ecological fallacy was discussed in a court challenge to the Washington gubernatorial election of 2004;
library("faraway")
head(eco)
usborn income home pop
Alabama 0.98656 21442 75.9 4040587
Alaska 0.93914 25675 34.0 550043
Arizona 0.90918 23060 34.2 3665228
Arkansas 0.98688 20346 67.1 2350725
California 0.74541 27503 46.4 29760021
Colorado 0.94688 28657 43.3 3294394
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(income ~ usborn, data=eco, xlab="Proportion US born", ylab="Mean Annual Income", col=1, cex=3, cex.lab=3, cex.axis=1.5)
lmod <- lm(income ~ usborn, eco)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(income ~ usborn, data=eco, xlab="Proportion US born", ylab="Mean Annual Income",xlim=c(0,1),ylim=c(15000,70000),xaxs="i", col=1, cex=3, cex.lab=3, cex.axis=1.5)
abline(coef(lmod))
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68642.2 8739.0 7.8547 3.188e-10
usborn -46018.6 9279.1 -4.9594 8.891e-06
n = 51, p = 2, Residual SE = 3489.54138, R-Squared = 0.33
Q: can we conclude, therefore, that naturalized citizens earn more than US born citizens in general?
A: no, this is precisely the ecological fallacy.
From the US Census Bureau, we know that generally US born citizens earn slightly more than their naturalized peers.
It is unreasonable to conclude that naturalized citizens have higher income in general just because there is a correlation at the state demographic level between more naturalized citizens and higher per-capita income.
In our case, with the Chicago insurance data, we have to note likewise the danger of the ecological fallacy.
Particularly, we only have aggregate data on the zip code level describing the ethnic composition of different neighborhoods, and the number of FAIR plan policies per neighborhood.
Even if there is a strong correlation between neighborhoods that have a high proportion of minority residents and a higher number of fair plans,
Dealing with this subtlety will be a challenge of this dataset in terms of drawing conclusions.
We start by performing some exploratory analysis:
head(chredlin, n=4)
race fire theft age involact income side
60626 10.0 6.2 29 60.4 0.0 11.744 n
60640 22.2 9.5 44 76.5 0.1 9.323 n
60613 19.6 10.5 36 73.5 1.2 9.948 n
60657 17.3 7.7 37 66.9 0.5 10.656 n
summary(chredlin)
race fire theft age
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60
Median :24.50 Median :10.40 Median : 29.00 Median :65.00
Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33
3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30
Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10
involact income side
Min. :0.0000 Min. : 5.583 n:25
1st Qu.:0.0000 1st Qu.: 8.447 s:22
Median :0.4000 Median :10.694
Mean :0.6149 Mean :10.696
3rd Qu.:0.9000 3rd Qu.:11.989
Max. :2.2000 Max. :21.480
We see that there is a wide range of values for “race” and that the values are skewed towards zero versus 100.
The third quartile is just over \( 57\% \), indicating that about a quarter of the zip codes in Chicago are majority minority neighborhoods.
This is a useful fact for our analysis, because if all neighborhoods were homogeneous (approximately equal percentages for all zip codes) we wouldn't be able to distinguish any differences from our aggregate data.
summary(chredlin)
race fire theft age
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60
Median :24.50 Median :10.40 Median : 29.00 Median :65.00
Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33
3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30
Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10
involact income side
Min. :0.0000 Min. : 5.583 n:25
1st Qu.:0.0000 1st Qu.: 8.447 s:22
Median :0.4000 Median :10.694
Mean :0.6149 Mean :10.696
3rd Qu.:0.9000 3rd Qu.:11.989
Max. :2.2000 Max. :21.480
We see likewise that theft and income are skewed variables.
Also, the number of FAIR plans has many zeros, including the entire first quartile – this actually excludes using Box-Cox as a hypothesis test for a scale transform due to the non-positive values.
Moreover, this is potentially an issue for our linear model assumptions that we have used so far…
We will perform some quick exploratory plots to examine relationships among variables.
require(ggplot2)
ggplot(chredlin,aes(race,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
ggplot(chredlin,aes(fire,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
ggplot(chredlin,aes(theft,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
With theft, we appear to have a point with a high leverage that is influencing the linear fit relationship between the FAIR coverage and theft
ggplot(chredlin,aes(age,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
With age, there appears to be a linear relationship with older housing and higher number of FAIR plans;
ggplot(chredlin,aes(income,involact)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
With income, there appears to be some relationship between higher income and a lower number of FAIR plans, but outliers and nonlinearity appear to affect the relationship.
Particularly, the high income observation appears to have high leverage, and to produce unphysical (negative) values for the relationship.
ggplot(chredlin,aes(side,involact)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
sumary(lm(involact ~ race,chredlin))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1292180 0.0966106 1.3375 0.1878
race 0.0138824 0.0020307 6.8361 1.784e-08
n = 47, p = 2, Residual SE = 0.44883, R-Squared = 0.51
There is a non-zero parameter, with significance — there appears to be some statistical relationship between neighborhoods with high minority composition and higher number of FAIR plans.
However, we have to note at this moment:
Indeed, there may be a number of confounding variables that may provide a better explanation.
It could be legitimate buisness practice on the part of insurance companies to deny normal insurance when a neighborhood has too high of a risk;
ggplot(chredlin,aes(race,fire)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
ggplot(chredlin,aes(race,theft)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
When plotting theft versus race, we see a trend of slightly higher rate of theft with neighborhoods of higher minority concentration.
We note, also, this plot has an extreme outlier that may be important for the analysis – this might be skewing the trend positively in a way that isn't consistent with the overall behavior of the city;
We move to the question of model selection, to determine the best possible predictors for the number of FAIR plans in a neighborhood.
However, this is fundamentally an explanatory model in which we want to quantify the effect of the ethnic composition of a neighborhood on the number of plans.
With this purpose, we will thus emphasize techniques which help analyze the explanatory power of the model.
This also indicates an important consideration, where we need to control for certain variables in the model.
In this case, we will belabor some of the model selection (and the sensitivity of the model parameters) so that we can make conclusions and explanations with greater confidence.
One change that we will make is to use log of income, as it is often done, due to the skewness of the variable; this is also because money operates practically in society on a logarithmic scale.
library("leaps")
sum_life <- summary(regsubsets(involact~ race + fire + theft + age + log(income) + side, data=chredlin))
sum_life
Subset selection object
Call: regsubsets.formula(involact ~ race + fire + theft + age + log(income) +
side, data = chredlin)
6 Variables (and intercept)
Forced in Forced out
race FALSE FALSE
fire FALSE FALSE
theft FALSE FALSE
age FALSE FALSE
log(income) FALSE FALSE
sides FALSE FALSE
1 subsets of each size up to 6
Selection Algorithm: exhaustive
race fire theft age log(income) sides
1 ( 1 ) "*" " " " " " " " " " "
2 ( 1 ) "*" "*" " " " " " " " "
3 ( 1 ) "*" "*" "*" " " " " " "
4 ( 1 ) "*" "*" "*" "*" " " " "
5 ( 1 ) "*" "*" "*" "*" "*" " "
6 ( 1 ) "*" "*" "*" "*" "*" "*"
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$bic, xlab = "Number explanatory variables", ylab = "Bayesian Information Criterion", col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(1:6), sum_life$adjr2, xlab = "Number explanatory variables", ylab = "Adjusted R squared", col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(c(2:7), sum_life$cp, xlab = "Number of parameters", ylab = "Mallows Cp", col=1, cex=3, cex.lab=3, cex.axis=1.5)
abline(0,1)
lmod <- lm(involact ~ race + fire + theft + age + log(income), data=chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1855396 1.1002549 -1.0775 0.2875500
race 0.0095022 0.0024896 3.8168 0.0004485
fire 0.0398560 0.0087661 4.5466 4.758e-05
theft -0.0102945 0.0028179 -3.6533 0.0007276
age 0.0083356 0.0027440 3.0377 0.0041345
log(income) 0.3457615 0.4001234 0.8641 0.3925401
n = 47, p = 6, Residual SE = 0.33453, R-Squared = 0.75
ggplot(chredlin,aes(race,income)) + geom_point() +stat_smooth(method="lm") + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
cor(chredlin$income,chredlin$race)
[1] -0.7037328
The strong anti-correlation of log-income and race makes the variable selection indicates a strong element of confounding variables in our analysis.
x <- model.matrix(lmod)
e <- eigen(t(x)%*%x)$values
e[1] / e[length(e)]
[1] 4042968
vif(x)
(Intercept) race fire theft age log(income)
1.985775 2.705490 2.733322 1.621824 1.577322 4.050811
We note the indication of high sensitivity in the parameter estimates due to the very large condition number, and the high VIF relative to the scale of the parameters.
The best case scenario would be that we would perform the model analysis both with and without the log-income and compare the results
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,1:1, col=1, cex=3, cex.lab=3, cex.axis=1.5)
lmod$fitted[lmod$fitted < 0 ]
60611 60656 60638 60652 60655
-0.05706070 -0.23408790 -0.05930948 -0.28650119 -0.15827619
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,1:1, col=1, cex=3, cex.lab=3, cex.axis=1.5)
These small negative numbers are problematic, due in part to the high number of zero values for the FAIR plans;
Unfortunately, a scale change won't fix this realistically — a square root will make nominal difference, but not enough to justify the loss of interpretability of the response.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,2, col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,3, col=1, cex=3, cex.lab=3, cex.axis=1.5)
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(lmod,5, col=1, cex=3, cex.lab=3, cex.axis=1.5)
To verify the outlier analysis numerically, we can perform the Bonferroni corrected \( 5\% \) signficance test on the Studentized Residuals.
Remember, the parameters are \( \frac{\alpha}{2 * n} \) for the correct critical value of the t distribution of \( n-p-1 \) degrees of freedom:
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1855396 1.1002549 -1.0775 0.2875500
race 0.0095022 0.0024896 3.8168 0.0004485
fire 0.0398560 0.0087661 4.5466 4.758e-05
theft -0.0102945 0.0028179 -3.6533 0.0007276
age 0.0083356 0.0027440 3.0377 0.0041345
log(income) 0.3457615 0.4001234 0.8641 0.3925401
n = 47, p = 6, Residual SE = 0.33453, R-Squared = 0.75
stud <- rstudent(lmod)
alpha_crit <- qt(0.05/(2 * 47), 40)
abs(stud) > abs(alpha_crit)
60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631 60646
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607 60623 60608
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60616 60632 60609 60653 60615 60638 60629 60636 60621 60637 60652 60620 60619
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60649 60617 60655 60643 60628 60627 60633 60645
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We note that there is a structural issue arising due to the high number of zero FAIR plans (the response variable), but that there isn't a realistic choice of scale transformation to fix this.
Normality and variance assumptions of the error are OK, bearing in mind slightly heavy tails in the residuals.
We will probably need to evaluate the effect of points of high leverage in the analysis, which may contribute to the heavy tails in the residuals.
We wish thus to study the sensitivity of the main parameter of interest “race”, with respect to the leverage points and controlling for covariates in the analysis.
We will begin by looking at partial residual plots, which are used for nonlinearity detection when factoring out for covariates.
We will also look at the confidence intervals for the parameters.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=1, cex=3, cex.lab=3, cex.axis=1.5, col.res="black")
coefficients(lmod)[2]
race
0.009502223
confint(lmod, parm = "race")
2.5 % 97.5 %
race 0.004474458 0.01452999
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=2, cex=3, cex.lab=3, cex.axis=1.5, col.res="black")
coefficients(lmod)[3]
fire
0.03985604
confint(lmod, parm = "fire")
2.5 % 97.5 %
fire 0.02215246 0.05755963
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=3, cex=3, cex.lab=3, cex.axis=1.5, col.res="black")
coefficients(lmod)[4]
theft
-0.01029451
confint(lmod, parm = "theft")
2.5 % 97.5 %
theft -0.01598536 -0.004603655
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=4, cex=3, cex.lab=3, cex.axis=1.5, col.res="black")
coefficients(lmod)[5]
age
0.0083356
confint(lmod, parm = "age")
2.5 % 97.5 %
age 0.002793968 0.01387723
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=5, cex=3, cex.lab=3, cex.axis=1.5, col.res="black")
coefficients(lmod)[6]
log(income)
0.3457615
confint(lmod, parm = "log(income)")
2.5 % 97.5 %
log(income) -0.4623041 1.153827
For each of:
(confint(lmod, parm = "race")[2] - confint(lmod, parm = "race")[1])/coefficients(lmod)[2]
race
1.058229
(confint(lmod, parm = "fire")[2] - confint(lmod, parm = "fire")[1])/coefficients(lmod)[3]
fire
0.8883765
(confint(lmod, parm = "age")[2] - confint(lmod, parm = "age")[1])/coefficients(lmod)[5]
age
1.32963
As well as,
(confint(lmod, parm = "theft")[2] - confint(lmod, parm = "theft")[1])/coefficients(lmod)[4]
theft
-1.105609
There is some moderate uncertainty, evidenced by the confidence intervals, and discussed earlier due to the high sensitivity due to collinearity.
Log-income, as noted earlier, is not significant and the uncertainty of the parameter is noted by the confidence interval,
confint(lmod, parm = "log(income)")
2.5 % 97.5 %
log(income) -0.4623041 1.153827
(confint(lmod, parm = "log(income)")[2] - confint(lmod, parm = "theft")[1])/coefficients(lmod)[6]
log(income)
3.383293
With the OK diagnostics, in terms of constant variance of the errors and slightly heavy tails in the residuals we might make some conclusions with some serious qualifications.
Likewise, the effects of confounding variables limits the exact identifiability of effects;
Loosely speaking, it appears that there is a small, but statistically significant, effect in which neighborhoods with higher concentrations of ethnic minorities have a higher number of FAIR plans, when all other variables are held equal.
This effect is not nearly as strong as e.g., the rate of fires with all other variables held equal.
Our analysis is not even nearly complete, and we cannot say if any application for insurance was rejected based upon their ethnic background;
In terms of analyzing the sensitivity of the race parameter with respect to the inclusion or exclusion of other covariates, we can automate this to determine if the effect on the response changes dramatically.
This is similar to the case when we used matching in the voting districts in New Hampshire, where we needed to see if an additional covariate was acting as a confounding variable on our analysis.
By analyzing the inter-model stability of this parameter, we have a better sense of if this explanation is actually robust.
beta pvalue
race 0.0139 0.0000
race + fire 0.0089 0.0002
race + theft 0.0141 0.0000
race + age 0.0123 0.0000
race + log ( income ) 0.0082 0.0087
race + fire + theft 0.0082 0.0002
race + fire + age 0.0089 0.0001
race + fire + log ( income ) 0.0070 0.0160
race + theft + age 0.0128 0.0000
race + theft + log ( income ) 0.0084 0.0083
race + age + log ( income ) 0.0099 0.0017
race + fire + theft + age 0.0081 0.0001
race + fire + theft + log ( income ) 0.0073 0.0078
race + fire + age + log ( income ) 0.0085 0.0041
race + theft + age + log ( income ) 0.0106 0.0010
race + fire + theft + age + log ( income ) 0.0095 0.0004
The above output shows the parameter for race and the associated p-values for all 16 models.
There is some variance in the magnitude of the effect, depending on the other variables we control for (corroborating the earlier sensitivity), but in no case does the p-value rise above \( 5\% \) nor does the parameter change sign.
This suggests some uncertainty over the magnitude of the effect, we can be sure that the significance of the effect is not sensitive to the choice of adjusters.
If we suppose that we were able to find models in which the composition of the neighborhood in terms of minorities was not statistically significant, our ability to control for other factors would be more difficult.
We would then need to consider more deeply which covariates should be adjusted for and which not.
This level of model analysis and selection is at the heart of many real world problems;
Now, we want to analyze the effect of leverage and influence on the model and our interpretation.
We recall that there are several influential points, and we wish to determine if their inclusion or exclusion would radically change the interpretation for the parameter for race.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(dfbeta(lmod)[,2],ylab="Change in race coef", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
diags <- data.frame(lm.influence(lmod)$coef)
ggplot(diags,aes(row.names(diags), race)) + geom_linerange(aes(ymax=0, ymin=race)) + theme(axis.text.x=element_text(angle=90), axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold")) + xlab("ZIP") + geom_hline(yintercept = 0)
ggplot(diags,aes(x=fire,y=theft))+geom_text(label=row.names(diags)) +
theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
chredlin[c("60607","60610"),]
race fire theft age involact income side
60607 50.2 39.7 147 83.0 0.9 7.459 n
60610 54.0 34.1 68 52.6 0.3 8.231 n
lmode <- lm(involact ~ race + fire + theft + age + log(income), chredlin,subset=-c(6,24))
sumary(lmode)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5767365 1.0800462 -0.5340 0.59638
race 0.0070527 0.0026960 2.6160 0.01259
fire 0.0496474 0.0085701 5.7931 1.004e-06
theft -0.0064336 0.0043489 -1.4794 0.14708
age 0.0051709 0.0028947 1.7863 0.08182
log(income) 0.1157028 0.4011132 0.2885 0.77453
n = 45, p = 6, Residual SE = 0.30320, R-Squared = 0.8
We have verified that our conclusions are also robust to the exclusion of one or perhaps two cases from the data.
In any case, it would be very important to disclose this choice in the analysis, and any suppression of the information would be extremely dishonest.
x <- model.matrix(lmod)
x0 <- apply(x,2,median)
x0 <- data.frame(t(x0))
# note here that R won't accept the variable that we've rescaled until we re-name it back to its original name...
colnames(x0)[6] <- "income"
predict(lmod,new=x0,interval="prediction")
fit lwr upr
1 0.003348934 -1.398822 1.40552
predict(lmod,new=x0,interval="confidence")
fit lwr upr
1 0.003348934 -1.225333 1.232031
x <- model.matrix(lmod)
x0 <- apply(x,2,quantile, 0.05)
x0 <- data.frame(t(x0))
# note here that R won't accept the variable that we've rescaled until we re-name it back to its original name...
colnames(x0)[6] <- "income"
predict(lmod,new=x0,interval="prediction")
fit lwr upr
1 -0.8398986 -2.599252 0.919455
predict(lmod,new=x0,interval="confidence")
fit lwr upr
1 -0.8398986 -2.464369 0.7845713
We have shown thus far that even in the presence of confounding variables, and under exclusion of observations,
This doesn't mean conclusively, however, that there was systematic discrimination.
If we tried another hypothetical model, supressing theft and the two high influence observations:
modalt <- lm(involact ~ race+fire+log(income),chredlin, subset=-c(6,24))
sumary(modalt)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7532557 0.8358798 0.9012 0.37277
race 0.0042061 0.0022757 1.8483 0.07178
fire 0.0510220 0.0084504 6.0378 3.822e-07
log(income) -0.3623825 0.3191620 -1.1354 0.26279
n = 45, p = 4, Residual SE = 0.30919, R-Squared = 0.79
the parameter for race is no longer significant.
This is just a hypothetical example, because our relatively exhaustive analysis has shown that this is a cherry-picked model, versus the other possible models.
Generally, however, this shows some of the subtlety in statistical analysis like this
Part of robust analysis is thus to perform several model selections and the associated diagnostics to deal with the overall uncertainty in the model selection;
The strength of our conclusions will increase if we can show that several different models all have similar interpretations.
If there are several reasonable models with different interpretations or conclusions, there is a fundamental issue in the data, and we should be cautious and be skeptical if there is really a signal to be found in the data.
We have sketched what a rigorous analysis would look like (pending completion of additional analysis with regards to different scales of the response, and a more robust analysis with heavy tails).
However, there are still several ways we should be cautious about our conclusions:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "s"),chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9946613 1.6949252 -1.1768 0.256470
race 0.0093509 0.0046055 2.0304 0.059285
fire 0.0510812 0.0170314 2.9992 0.008493
theft -0.0102565 0.0090972 -1.1274 0.276184
age 0.0067778 0.0053061 1.2774 0.219700
log(income) 0.6810543 0.6493336 1.0489 0.309832
n = 22, p = 6, Residual SE = 0.34981, R-Squared = 0.76
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "n"),chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7902939 1.7793768 -0.4441 0.66196
race 0.0133200 0.0053903 2.4711 0.02310
fire 0.0246654 0.0154219 1.5994 0.12623
theft -0.0079383 0.0039815 -1.9938 0.06073
age 0.0090040 0.0046471 1.9375 0.06769
log(income) 0.1666853 0.6233641 0.2674 0.79204
n = 25, p = 6, Residual SE = 0.35115, R-Squared = 0.76
Generally, sub-dividing data into smaller and smaller groups, we can dillute the significance.
However, as noted before, aggregating data too much causes its own issues – the balance between these is contextual and subjective based on our understanding of the task at hand.
ggplot(chredlin,aes(side,race)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
summary(chredlin$race[chredlin$side=="n"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.80 10.00 21.95 24.50 94.40
summary(chredlin$race[chredlin$side=="s"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 34.27 48.10 49.80 72.92 99.70
sum_life <- summary(regsubsets(involact~ side * race + fire + theft + age + log(income) + side, data=chredlin))
sum_life
Subset selection object
Call: regsubsets.formula(involact ~ side * race + fire + theft + age +
log(income) + side, data = chredlin)
7 Variables (and intercept)
Forced in Forced out
sides FALSE FALSE
race FALSE FALSE
fire FALSE FALSE
theft FALSE FALSE
age FALSE FALSE
log(income) FALSE FALSE
sides:race FALSE FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
sides race fire theft age log(income) sides:race
1 ( 1 ) " " "*" " " " " " " " " " "
2 ( 1 ) " " "*" "*" " " " " " " " "
3 ( 1 ) " " "*" "*" "*" " " " " " "
4 ( 1 ) " " "*" "*" "*" "*" " " " "
5 ( 1 ) " " "*" "*" "*" "*" "*" " "
6 ( 1 ) " " "*" "*" "*" "*" "*" "*"
7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
We performed initial, exploratory analysis, checking for structure in the data, and in the plots of variables versus the response.
We fit a model based on a mix of criterion methods and some qualitative judgment.
We performed diagnostics to see how our assumptions might be breaking down:
Despite these issues, many of the conclusions on the explanation in the model remain consistent, even when removing one or more unusual cases.
When we evaluated the partial residuals, and confidence intervals for each of the parameters in our model, we indicated some parameter uncertainty, but generally a good systematic structure.
As this is an explanatory model, we were interested in “predicting” the parameter for “race”;
Performing this after the criterion selection for the model, we limit the effect of saturation of p-values to an extent;
By thereafter comparing the effect of race in multiple models, we evaluated an addition kind of model uncertainty;
Just for fun (to demonstrate skills in this class), we saw that it has awful predictive power…
We made a final assessment of the uncertainty that we could not account for with our data.
After all this analysis, everything we say needs to be hedged with “ifs” and “buts” and qualified carefully.
This is the reality of statistical modeling, where casual conclusions are extremely difficult to draw.
To quote Faraway, quoting Winston Churchill:
“Indeed, it has been said that Democracy is the worst form of government, except all those other forms that have been tried from time to time.”
We might say the same about statistics with respect to how it helps us reason in the face of uncertainty.
This applies both to frequentist and Bayesian analysis, in which we must be extremely careful about our approach and conclusions, including our own biases.