Final project

STAT 757 – Section 1001
Instructor: Colin Grudzien
Due: 12/14/2019 – 12:00 PM

Instructions:

You may work with others on this project but you must turn in your own work. You may type your solutions in any way you like (LaTeX, Markdown, Office, etc…) as long as you present your work clearly and in an organized way. Unless otherwise specified, you must hand in a printed copy of your work at the beginning of class. Whenever plotting:

Your plot must be clearly labeled in all axes, legends, and the plot must include a clear title.

The plot must be sensible and easy to read.

Summary

For the final project, you are to revise the work of your midterm project using the more advanced perspective and tools that we have developed by the end of the course. You will once again write a research report in RMarkdown. The report must be uploaded in a PDF format to Webcampus. The report should be around 10-12 pages, including figures, but not including any references or appendices in this page count. Include figures and tables for the most important components of your analysis, and for explanation purposes.

The task:

You will be expected to perform the following:

Perform diagnostic analysis of your model. You should check for our usual assumptions:
- evaluate the structure of the model (e.g., nonlinearity in the relationship, residual versus fitted and partial residual plots);
- discuss possible correlations in the errors (it goes beyond the scope of our class to estimate the covariance matrix, so a discussion of possible common correlations is sufficient);
- constant variance of the error (residual versus fitted plots and hypothesis tests);
- Gaussianity of the error (Q-Q plots and hypothesis tests);
- identify outliers and evaluate for influential observations changing the fit of the model (studentized residuals, Cook’s distance).
- Compute the k-fold cross validation RMSE to evaluate a more realistic estimate of predictive performance.
- Not all of the above may be appropriate, so you should only include in the write-up your relevant and interesting figures. Other non-interesting results can be summarized in words or tables with the corresponding plots and code in the appendix. If a test is not performed, this choice should be justified in the context of the model with quantitative and qualitative reasoning.
Revise the model and perform remediation of the issues:

if reasonable to do so, handle missing data systematically with methods from class;
compare the model selection in the midterm with model instead based upon an information criterion method;
if reasonable to do so, consider transformation of scale of the response with e.g., Box-Cox, shifted log, logit or Fisher’s z-transformation;
if reasonable to do so, exclude outliers and re-fit the model;
if reasonable to do so, perform generalized or weighted least squares (generalized least squares is particularly difficult in most cases, so unless you know the covariance structure of the errors already, don’t worry about estimating it or performing GLS);
if reasonable to do so, perform polynomial regression with the degree selected systematically;
finally, use the usual diagnostics to determine if the remediation has had an effect on the issues previously noted with using the usual assumptions.
Not all of the above will be reasonable to perform with your data. Some of the above remediation (like outlier exclusion) depend on the other remedial steps (like variable transformation) and should be considered simultaneously. You are encouraged to iterate on this several times, but only to summarize your process on how you arrived at your final version of the model, and how (or if) it has addressed the issues noted in the diagnostics.

Compare the goodness of fit, uncertainty, explanatory and predictive power of the new model with the model selected in the midterm. The above model evaluations should follow the same guidelines and expectations of the midterm. If you have arrived at the same model, use advanced tools such as partial residual plots to discuss the structural uncertainty of the model. Expand upon the earlier analysis to include comparisons of the effects of parameters based on, e.g., change of scale or exclusion of influential observations. Discuss the differences in predictive power observed with the confidence intervals and with the cross validation RMSE.
Make conclusions based on the above comparison. Does it appear that there is a reasonable linear signal in the data? Do these methods perform adequately for actionable prediction or explanation purposes regarding the research question? Does it suggest that other (e.g. nonlinear) methods should be used?

What to turn in:

Your corresponding code and work should be included in the final appendix, section 8; I reserve the right to request a copy of the original analysis. If there isn’t sufficient documentation in the appendix and this cannot be provided by the student at request, the midterm will not receive any credit. Cases of plagiarism will be handled furthermore with respect to the policy on academic dishonesty.

Your report should be written clearly and structured as follows:

Section 1: Introduction. Discuss the data set, your opening research question and why this question is meaningful.
Section 2: Summarize the work performed in the midterm, the issues encountered and your plan to address these in this final work. Discuss your (possibly) revised research question and why this is meaningful.
Section 3: Discuss the diagnostics of the model constructed for the midterm, what assumptions may not be satisfied, and which seem to be ``OK".
Section 4: Discuss your remediation steps and your process of re-selecting/ fitting/ revising the model and its variables.
Section 5: Compare the goodness of fit, uncertainty, predictive and explanatory power of the model when revised according to the diagnostics. How does your view of the predictive and explanatory power change as you revise the model according to the diagnostics and remediation techniques?
Section 6: Discuss what conclusions can be made based on this analysis. Is there a reasonable linear signal in the data? How sure can you be of the effects in this data set? What could be performed as future work?
Section 7: References to data sets, papers, books or other works consulted.
Section 8: An appendix including relevant code and work.

Whenever plotting:

Your plot should be clearly labeled in all axes, legends, and the plot includes a clear title.
The plot must be sensible and easy to read.

Rubric

The rubric below describes the necessary work delivered per category and associated points in this assignment for full credit. Reports that do not address all of these points, or give inadequate attention to these points will receive partial credit. Adequate attention is contextual and subjective, based on the problem itself and the overall work performed in the report. Students are encouraged to discuss their report in a rough draft with the instructor to get feedback on how to better address these points. Additionally, reports that do not follow document outline, do not use clear language, have formatting or writing errors, or unprofessional figures may be penalized for some of the points below.

Category	Expected results	Total points
Research question	The student effectively discusses their research question. The student clearly describes why this question is relevant and interesting.	10 points
Diagnostics	The student effectively discusses whether the standard regressions hypotheses seem to be satisfied for the midterm model based on statstical and visual tests. Hypotheses that do not seem to be fully satisfied are discussed, and the implications of these hypotheses failing is qualified to the reader in the analysis.	12 points
Remediation	The student discusses what remedial measures have been tried to obtain a better model. Statistical tests such as, e.g., Box-Cox should be used to discuss the rationale for transformation of variables or not. The student uses diagnostics to evaluate if remedial measures have improved the suitability of the model.	12 points
Goodness of fit and sources uncertainty	The student discusses measure of goodness of fit such as adjusted \(R^2\) and how this value is interpreted in this problem. Differences between the model from the midterm and the model selected in the final (if any) are discussed and are connected to the research problem. If the model selected is the same, the student discusses if there appears to be an indentifiable linear signal in the data, or if the methods are inconclusive.	12 points
Predictive power	The student discusses differences (if any) between predictions of the model from the midterm and the model in the final exam. The student discusses if the confidence intervals are practical / actionable for the research question. The student discusses the cross validation RMSE of the midterm and the final model, and if there are substantial differences. The student discusses if the confidence intervals for predicting new cases provide similar results to the cross validation RMSE.	12 points
Explanatory power	The student effectively interprets the effects of the predictors on the response variable, in terms of the sign of the parameter, the size of the parameter relative the scales of the variables, and the relative importance of different effects in the model. The student discusses possible confounding variables, effects of correlations between predictors and if the interpretation of parameters is stable across multiple models.	12 points
Discussion and Conclusion	The student effectively summarizes their work in the project and draws final connections between the statistical relationships observed and the research question posed at the beginning. The student discusses how the original research question might be revised and what the research question will be for the final.	10 points
Grand total	All of the above	80/80 points

In addition, reports that fail to follow the instructions of this assignment, the structure for the proposal, or to meet standards of scientific writing may be subject to a loss of points.