STAT 757 – Section 1001
Instructor: Colin Grudzien
Due: 11/06/2020, 11:59 PM
You may work with others on this project but you must turn in your own work. You may type your solutions in any way you like (LaTeX, Markdown, Office, etc…) as long as you present your work clearly and in an organized way. The assignment should be uploaded to Whenever plotting:
- Your plot must be clearly labeled in all axes, legends, and the plot must include a clear title.
- The plot must be sensible and easy to read.
For the midterm project, you are to to write a research report in RMarkdown. The report must be uploaded in a PDF format to Webcampus. The report should be around 7-10 pages, including figures and written text, but not including the references or appendix in this page count. This is an approximate number, and you will only be graded on the quality of your analysis – if you can do quality analysis in fewer pages, this is encouraged. Concisely address the points below. Include figures and tables for the most important components of your analysis, and for explanation purposes.
You will identify a research question that you would like to answer with linear regression by studying a particular data set. This question should have the following attributes,
For example, you may consider the question “When factoring for behavioral differences and the variability of the health of US adults, does smoking cigarettes cause a practically significant increase in the rate of cancer?” The relationship between smoking is now widely accepted, but it was established largely by statistical techniques. You should consider how this question satisfies the above criteria when you formulate your own question.
This question will give you context for the methods studied in the course. Over the rest of the semester, the focus of this course is to learn statistical modeling frameworks that can help answer this question.
This midterm assignment is designed to prepare you for your final project. In particular, you must begin your own open ended investigation into a data set of your choice. Each individual in class must study a unique data set, different from each other individuals’ data in this class.You will be expected to perform the following:
Your corresponding code and work should be included in the final appendix, section 6; I reserve the right to request a copy of the original analysis. If there isn’t sufficient documentation in the appendix and this cannot be provided by the student at request, the midterm will not receive any credit. Cases of plagiarism will be handled furthermore with respect to the syllabus’ policy on academic dishonesty.
Upon completing the report, the student will demonstrate:
The rubric below describes the necessary work delivered per category and associated points in this assignment for full credit. Reports that do not address all of these points, or give inadequate attention to these points will receive partial credit. Adequate attention is contextual and subjective, based on the problem itself and the overall work performed in the report. Students are encouraged to discuss their report in a rough draft with the instructor to get feedback on how to better address these points. Additionally, reports that do not follow document outline, do not use clear language, have formatting or writing errors, or unprofessional figures may be penalized for some of the points below.
|The student effectively discusses their research question, demonstrating the attributes described above. The student clearly describes why this question is relevant and interesting.
|The student effectively discusses connections between their research question and the summary statistics and frequency distributions of the data. The student evaluates the data for the presence of outliers, multimodal and / or skewed distributions. The student makes effective use of plots to demonstrate relationships.
|The student examines more than one model for the possible relationship for the research question. The student effectively uses hypothesis tests and / or confidence intervals to systematically select a model that is a plausible alternative to the null model, and other possible models as the null hypothesis.
|Goodness of fit and sources uncertainty
|The student discusses measures of goodness of fit such as \(R^2\) and how this value is interpreted in this problem. The student discusses if there is reason to believe there is plausibly a nonlinear or null relationship between the response and the predictors, leading to structural uncertainty in the model. For parameters that have significance, the width of confidence intervals relative the scale of the response and the scale of the predictor is analyzed.
|The student examines the differences between the prediction confidence intervals for the mean response and for a new observation. Predictions are evaluated at the center of the data and at extreme measurements. The student discusses if the confidence intervals are practical / actionable for the research question.
|The student effectively interprets the effects of the predictors on the response variable, in terms of the sign of the parameter, the size of the parameter relative the scales of the variables, and the relative importance of different effects in the model. The student discusses possible confounding variables, effects of correlations between predictors and if the interpretation of parameters is stable across multiple models.
|Discussion and Conclusion
|The student effectively summarizes their work in the project and draws final connections between the statistical relationships observed and the research question posed at the beginning. The student discusses how the original research question might be revised and what the research question will be for the final.
|All of the above
In addition, reports that fail to follow the instructions of this assignment, the structure for the proposal, or to meet standards of scientific writing may be subject to a loss of points.