Midterm 2

STAT 757 – Section 1001
Instructor: Colin Grudzien
Due: 11/06/2020, 11:59 PM

Instructions:

You may work with others on this project but you must turn in your own work. You may type your solutions in any way you like (LaTeX, Markdown, Office, etc…) as long as you present your work clearly and in an organized way. The assignment should be uploaded to Whenever plotting:

Summary

For the midterm project, you are to to write a research report in RMarkdown. The report must be uploaded in a PDF format to Webcampus. The report should be around 7-10 pages, including figures and written text, but not including the references or appendix in this page count. This is an approximate number, and you will only be graded on the quality of your analysis – if you can do quality analysis in fewer pages, this is encouraged. Concisely address the points below. Include figures and tables for the most important components of your analysis, and for explanation purposes.

Reseach question

You will identify a research question that you would like to answer with linear regression by studying a particular data set. This question should have the following attributes,

  • the question is clear: it provides enough specifics that one’s audience can easily understand its purpose without needing additional explanation.
  • the question is focused: it is narrow enough that it can be answered concretely.
  • the question is concise: it is expressed in the fewest possible words.
  • the question is complex: it is not answerable with a simple “yes” or “no,” but rather requires synthesis and analysis of ideas and sources prior to composition of an answer.
  • the question is arguable: its potential answers are open to debate rather than accepted facts.

For example, you may consider the question “When factoring for behavioral differences and the variability of the health of US adults, does smoking cigarettes cause a practically significant increase in the rate of cancer?” The relationship between smoking is now widely accepted, but it was established largely by statistical techniques. You should consider how this question satisfies the above criteria when you formulate your own question.

This question will give you context for the methods studied in the course. Over the rest of the semester, the focus of this course is to learn statistical modeling frameworks that can help answer this question.

The task:

This midterm assignment is designed to prepare you for your final project. In particular, you must begin your own open ended investigation into a data set of your choice. Each individual in class must study a unique data set, different from each other individuals’ data in this class.

You will be expected to perform the following:
  1. Select a data set of your choice providing a URL/ DOI or other documentation of the authenticity and uniqueness with respect to the other projects in class. You may not use data from the Faraway library.

  2. You must submit your data set and tentative research question to the instructor for approval. When the (possibly revised) research question and data set have been approved, these should registered in the Canvas Midterm 2 signup sheet.

  3. Perform exploratory analysis. You should interrogate the data for patterns, multi-modality, correlation between variables, summary statistics, trends, outliers, power laws (nonlinear scaling) and any points of interest. Plot relationships between variables and see how much the covary together or not. You should only include the interesting parts of your analysis in your writeup, and limit figures to the most relevant ones.

  4. Select at least one variable to study as a response variable and construct a linear model to describe at least one relationship related to your research question.

  5. Test the explanatory variables in the model for correlation and significance. Try to make the model as simple as possible, but without leaving anything important out. Justify your choices using hypothesis testing and confidence intervals for selection of parameters. You should compare multiple model choices in describing this relationship, including the null model.

  6. Evaluate the goodness of fit of the model in terms of \(R^2\) and the standard errors, and the major sources of uncertainty. This includes parameter uncertainty, as well as structural uncertainty in the model.

  7. Evaluate the predictive power of the model – particularly, how effective does the model appear to be at making predictions of future observations or the mean response. How might these predictions be unreliable? What are the limits of the prediction power, and where do we fall into extrapolation?

  8. Evaluate the explanatory power of the model. Are there other plausible explanations for cause and effect in the response? How does this compare to other choices of the model? Can you account for confounding variables?

  9. Evaluate issues that you think you might encounter with the assumptions we have utilized so far. For the final, we will diagnose the issues quantitatively, and take remedial measures to improve the model. It is not necessary to formally test the model as this will include new material, however, you can discuss generally what you anticipate.

What to turn in:

Your corresponding code and work should be included in the final appendix, section 6; I reserve the right to request a copy of the original analysis. If there isn’t sufficient documentation in the appendix and this cannot be provided by the student at request, the midterm will not receive any credit. Cases of plagiarism will be handled furthermore with respect to the syllabus’ policy on academic dishonesty.

Your report should be written clearly and structured as follows:

  • Section 1: Introduction. Discuss the data set, your opening research question and why this question is meaningful.

  • Section 2: Describe your exploratory analysis, including relevant tables and figures.

  • Section 3: Describe your model, how you arrived at it, its goodness of fit, its significance versus other choices of models, and its uncertainty. Describe the predictive power, and the uncertainty. Include relevant tables and figures.

  • Section 4: Describe your proposed research question for the final. How will you revise your original research question? What issues have you encountered so far? What assumptions do you think you need to (re-)evaluate?

  • Section 5: References to data sets, papers, books or other works consulted.

  • Section 6: An appendix including relevant code and work.

Learning outcomes

Upon completing the report, the student will demonstrate:

Rubric

The rubric below describes the necessary work delivered per category and associated points in this assignment for full credit. Reports that do not address all of these points, or give inadequate attention to these points will receive partial credit. Adequate attention is contextual and subjective, based on the problem itself and the overall work performed in the report. Students are encouraged to discuss their report in a rough draft with the instructor to get feedback on how to better address these points. Additionally, reports that do not follow document outline, do not use clear language, have formatting or writing errors, or unprofessional figures may be penalized for some of the points below.

Category Expected results Total points
Research question The student effectively discusses their research question, demonstrating the attributes described above. The student clearly describes why this question is relevant and interesting. 10 points
Exploratory analysis The student effectively discusses connections between their research question and the summary statistics and frequency distributions of the data. The student evaluates the data for the presence of outliers, multimodal and / or skewed distributions. The student makes effective use of plots to demonstrate relationships. 12 points
Model selection The student examines more than one model for the possible relationship for the research question. The student effectively uses hypothesis tests and / or confidence intervals to systematically select a model that is a plausible alternative to the null model, and other possible models as the null hypothesis. 12 points
Goodness of fit and sources uncertainty The student discusses measures of goodness of fit such as \(R^2\) and how this value is interpreted in this problem. The student discusses if there is reason to believe there is plausibly a nonlinear or null relationship between the response and the predictors, leading to structural uncertainty in the model. For parameters that have significance, the width of confidence intervals relative the scale of the response and the scale of the predictor is analyzed. 12 points
Predictive power The student examines the differences between the prediction confidence intervals for the mean response and for a new observation. Predictions are evaluated at the center of the data and at extreme measurements. The student discusses if the confidence intervals are practical / actionable for the research question. 12 points
Explanatory power The student effectively interprets the effects of the predictors on the response variable, in terms of the sign of the parameter, the size of the parameter relative the scales of the variables, and the relative importance of different effects in the model. The student discusses possible confounding variables, effects of correlations between predictors and if the interpretation of parameters is stable across multiple models. 12 points
Discussion and Conclusion The student effectively summarizes their work in the project and draws final connections between the statistical relationships observed and the research question posed at the beginning. The student discusses how the original research question might be revised and what the research question will be for the final. 10 points
Grand total All of the above 80/80 points

In addition, reports that fail to follow the instructions of this assignment, the structure for the proposal, or to meet standards of scientific writing may be subject to a loss of points.