Introduction to statistical concepts part II

01/23/2020

Instructions:

Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.

FAIR USE ACT DISCLAIMER:
This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.

Outline

The following topics will be covered in this lecture:
- Review of key statistical concepts and vocabulary
- Review of the process of a statistical investigation
- More pitfalls in statistical investigations
- What is a “statistic” and what is a “parameter”?
- How do we differentiate different “types” of data?
- What are levels of measurement?

The process of statistical thinking

Statistical thinking or a statistical inquiry has a natural progression.

Statistics always relies on some kind of data, which must be collected somehow.
Polls must be administered, weather balloons need to be released, etc…

The steps of statistical thinking can be loosely summarized as:

Prepare;
Analyze; and
Conclude.

Each of these steps includes several elements which we will discuss in the following.

Prepare

In preparing a statistical study, we should consider the following:

Context

What does the data mean?
What was the purpose of the study / data collection?
Is this data appropriate for our question of interest?

Source of the data

Is the data from a source with a special interest?
Would there be pressure to obtain results that are favorable to the source?

Sampling Method

Was the data collected in a way that was unbiased?
Are there reasons why the sample wouldn't reflect the population?
For example, are the answers all voluntary?
Do all participants self-select or are the participants selected methodically?
Are there reasons why certain segments of the population wouldn't respond to the poll?
In an experiment, was the measurement instrument used appropriate in this context?
Is there missing data or are there errors in the data?

Examples

In each of the following, discuss with a neighbor what we should consider about this data set when we prepare a statistical inquiry:

Example: The Physicians Committee for Responsible Medicine tends to oppose the use of meat and dairy products in our diets, and that organization has received hundreds of thousands of dollars in funding from the Foundation to Support Animal Protection.

Concern: The Committee collecting the data may benefit from certain conclusions over others, compromising the objectivity of the data.
We should consider if this data is actually fit for the purpose of an objective study on diet.

Example: A survey of 721 subjects involved the providing of personal information when using Wi-Fi hotspots. The survey subjects were Internet users who responded to a question that was posted on the electronic edition of USA Today.

Concern: These are voluntary answers and easily falsified by participants who don’t want to disclose their actual personal data for hotspot accesss.
We may have reason to believe the responses are unreliable because participants self-reported, and that they may be un-representative because the responses are voluntary.

Example: In a survey of beliefs about evolution, Gallup pollsters randomly selected and telephoned 1018 adults in the United States.

Concern: The random selection of a large number of samples is a good data collection strategy, but we should also be aware that fewer individuals are using listed numbers or land-lines currently. This also excludes various margninalized groups who are homeless or who don’t have a fixed residence.
This may mean that the demographics of those surveyed will not be representative of the full US population. We may be able to make conclusions only about a smaller population.

Analyze

After prepaing the data, we use mathematical techniques to analyze the data.
We should remember that there are many of common errors in the process of data analysis that can dramatically affect the conclusions of a statistical inquiry.
A non-exhaustive list includes:
- Relying on reported results – it is usually preferable to make direct measurements than to let participants in a study report their own measurements.
- Small number of samples – if we have too few samples of a population, it may be grossly missrepresentative of the population.
- Loaded questions – different phrasings of a survey question can dramatically change the responses of those surveyed. When a survey question is written to ellicit a certain response, we call this a loaded question.
- Poorly ordered questions – sometimes the order of questions can lead to a reaction from those responding, intentional or not. If the order of questions isn't chosen carefully, this can affect the results.
- Not considering non-responses – it is possible that those who don't respond to polls do so for specific reasons and not by random chance. The non-responses as a class of data can have an important impact on our analysis of the population.
- Precise numbers – if we see a very specific number in a data summary such as “there are 241,472,385 adults in the United States”, the accuracy of this statement shouldn't be trusted.
- It is very unlikely that even when the data was collected, there were exactly this number of adults, and there should be some estimate of uncertainty provided with this number.

Percentages

Percentages should be treated carefully – we will refresh on the approapriate usage of percentages and their meanings:
Percent – comes from Latin, meaining per 100 or “per centum”.
Percentage of: To find a percentage of an amount, we drop the % symbol and divide the percentage value by 100, then multiply. For example :

\[ 6\%\text{ of } 1200 \text{ responses } = \frac{6}{100} \times 1200 = 72 \]

Convert from a decimal to a percentage: we multiply by 100%, keeping the units of percent at the end of the operation. This example shows that 0.25 is equivalent to 25%:

\[ 0.25 \times 100\% = 25\% \]

Convert from a fraction to a percentage: we divide the denominator into the numerator to get an equivalent decimal number; we can then follow the same steps as with the decimal (possibly rounding). For example:

\[ \frac{3}{4} = 0.75 \rightarrow .75 \times 100\% = 75\% \]

Convert from a percentage to a decimal number: we divide the percent by 100%, cancelling the units of %:

\[ \begin{align} \frac{85\%}{100\%} = 0.85 \end{align} \]

Percent change

We must be more careful when we discuss the notion of percent change. Language about percent change is often used in measleading or incorrect ways, and it takes more care to analyze these statements.
Percent change – suppose our starting value is \( X_1 \) and this changes over time to the value \( X_2 \). We assume that \( X_1 \) and \( X_2 \) refer to physical quantities, so that \( X_1 , X_2 \geq 0 \), and \( X = 0 \) refers to none of some physical quantity.
Let's suppose that \( X_2 \geq X_1 \), so that the change is an increase. In this case, we can compute the percent increase as the difference of \( X_2 \) and \( X_1 \), relative to the orignal value \( X_1 \), converted to percent units:

\[ \text{percent increase} = \frac{X_2 - X_1}{X_1} \times 100 \% \]
Discuss with a neighbor: if \( X_2 \geq X_1 \), what possible values can a percent increase take? Particularly, what are the smallest and largest values?
- A: at smallest \( X_2 = X_1 \) so that the smallest percent increase is given by
\[ \frac{X_2 - X_1}{X_1} \times 100\% = \frac{ 0 }{X_1} \times 100\% = 0\% \]
- \( X_2 \geq X_1 \) can be arbitrarily large, so we can have aribitrarily large percent increases. E.g., \( X_1 = 1 \) and \( X_2 = 11 \) then there is an increase of
\[ \frac{11 - 1}{1} \times 100\% = \frac{ 10 }{1} \times 100\% = 1000\%. \]

Percent change continued

Percent change – suppose our starting value is \( X_1 \) and this changes over time to the value \( X_2 \). We assume that \( X_1 \) and \( X_2 \) refer to quantities, so that \( X_1 , X_2 \geq 0 \), and \( X = 0 \) refers to none of some quantity.
However, let's suppose that \( X_2 \leq X_1 \) so that the change is a decrease.
In this case, we can compute the percent decrease as the difference of \( X_1 \) and \( X_2 \), relative to the orignal value \( X_1 \), converted to percent units:

\[ \text{percent decrease} = \frac{X_1 - X_2}{X_1} \times 100 \% \]
Discuss with a neighbor: if \( X_2 \leq X_1 \), what possible values can a percent decrease take? Particularly, what are the smallest and largest values?
- A: at largest \( X_2 = X_1 \) so that the smallest percent decrease is given by
\[ \frac{X_2 - X_1}{X_1} \times 100\% = \frac{ 0 }{X_1} \times 100\% = 0\% \]
- If \( X_2 \leq X_1 \) can only take a value as small as \( 0 \) when the units of measurement refer to some physical quantity.
- Thus for \( X_2 = 0 < X_1 \) then there is a decrease of
\[ \frac{X_1 - 0}{X_1} \times 100\% = \frac{X_1}{X_1} \times 100\% = 100\%. \]

Percentages examples

A Gallup poll of 1018 adults reported that 39% believe in evolution. We notice that using the rules of percentages, we find that accordingly:

\[ .39 \times 1018 = 397.02 \]

respondents said they believed in evolution.

Discuss with a neighbor: can this number be correct? Did we make a mistake?
- Most likely, the precise number of \( 39\% \) provided wasn't accurate.
- Suppose that it had been exactly 397 or 398 respondents, then we would have gotten the percent values as
\[ \frac{397}{1018} \approx 0.3899804, \text{ so the true percent is } \approx 0.3899804 \times 100\% \approx 38.99804\%; \]
- Similarly,
\[ \frac{398}{1018} \approx 0.3909627, \text{ so the true percent is } \approx 0.3909627 \times 100\% \approx 39.09627\%; \]
This is a relatively minor way in which stating \( 39% \) of respondents was misleading – we cannot tell which way the value was rounded, or therefore whether it was exactly 397 or 398 participants.
The difference of 397 or 398 responses isn't really important in this context. However, many times percentages are used in much more misleading ways.

Percentages examples

Quite often, percentages over 100% are used incorrectly. Discuss with a neighbor: What is wrong with the following statement? How does this relate to the range of possible values of “size?”

An ad for Big Skinny wallets included the statement that one of their wallets “reduces your filled wallet size by 50%–200%.”

The percent decrease of 100% or greater is physically impossible…
Consider, we can measure a wallet’s size in cubic centimeters, \( cm^3 \).
Suppose our current wallet is approximately \( 50cm^3 \).
If our new wallet is \( 0cm^3 \), then we have, a reduction in \( cm^3 \) of \[ \frac{50 - 0 }{50} \times 100\% = \frac{50}{50} \times 100\% = 100\%. \]
Thus if the new wallet takes up zero physical space, it is a \( 100\% \) reduction in wallet size.
If we had a reduction of greater than \( 100\% \), the new wallet would have to take up negative values of space…
This illustrates the importance of the units we are discussing in percent change.
When zero has a well defined meaining as none of the quantity, we cannot have greater than a 100% decrease.
When we see percent decreases greater than 100%, the value must be on a scale that permits negative numbers.
In this case, it does not relate to an actual physical quantity of something.

Percentages examples

Example: The Newport Chronicle ran a survey by asking readers to call in their response to this question:

“Do you support the development of atomic weapons that could kill millions of innocent people?”

It was reported that 20 readers responded and that 87% said “no,” while 13% said “yes.”
Discuss with a neighbor: can you identify four major issues with this survey?
Here are four examples:
- This is a volutary poll, where most likely the callers will respond because they already have strong feelings on the topic.
- The question is loaded – it is designed in a way to elicit a certain emotional response.
- The percentages are misleading at least. If we took these precise numbers literally, it would imply that
\[ 0.87 \times 17.4 = 17.4 \]

respondents said “no,” while

\[ 0.13 \times 20 = 2.6 \]

respondents said “yes.”
- This is also a very few number of samples with only 20 responses. The samples for various reasons are unlikely to accurrately reflect a population of interest, even just all readers of the Newport Chronical.

Statistics versus parameters

We have begun to consider the differences between a sample and population. Discuss with a neighbor : what is the difference between a sample and a population?
- A sample refers to the available observations within a data set, such as the actual responses from a poll of UNR students.
- A population refers to the entire collection of possible-to-measure data points, even if they aren't measured in our data set. This would correspond to all UNR students, not just those who responded to the above poll.
We will introduce two new definitions related to samples and populations:
- A parameter is a numerical measurement describing some characteristic of a populations.
- A statistic is a numerical measurement describing some characteristic of a sample.
- The alliteration in “population parameter” and “sample statistic” helps us remember the meaning of these terms.

Statistics versus parameters continued

Discuss with a neighbor: suppose we want to find out the average age of students at UNR.
- Suppose out of a voluntary poll of UNR students, the average age of all respondents was 21. Is this average value of 21 a statistic or a parameter?
- This is a statistic, because this is the value computed from a sample of the full population of UNR students.
- Suppose we look at the enrollment records of all students and compute the average age from this data. Is the average age from enrollment records a statistic or a parameter?
- This is a parameter, because this is a numerical value computed from the entire collection of all UNR students.
- What is a scenario in which the sample statistic equals the population parameter?
- This can occur when we have observations of the entire population in our sample data. For example, if every student responded to the poll above, the statistic and the parameter would agree.
In many cases, such as if we studied the average age of all people living in the USA, we don't know and have no effective way to compute the parameter exactly.
- We cannot possibly observe the exact age of every person living in the USA.
Parameters are usually much more uncertain for this reason, and we generally must estimate parameters using statistical methodology.
If we use good methodology, we can provide good estimates of the population parameters, with estimates of how certain or uncertain we are about the value.

Quantitative versus qualitative data

We will make an important distinction between data which we call quantitative and qualitative data.
Discuss with a neighbor: what is qualitative data? Can you provide an example of a piece of qualitative data? How is this distinguished from quantitative data?
- Qualitative data is a kind of data that is not numeric. For example if we talk about cars, the weight is numeric and thus quantitative but the color of the car is qualitative.
- In many cases, qualitative data can be described by a number of categories.
- The categories for colors of a car can include values like red, blue, green, etc…

Continuous versus discrete data

Quantitative data most often carries some additional descriptors.
For example, quantitative data often caries a unit of measurement, which we should include in our analysis and discussion.
We can also distinguish whether the quantitative data is in continuous or discrete units.
Discuss with a neighbor: can you give an example of a continuous unit of measurement? Can you give an example of a discrete unit of measurement? What is the difference?
- Continuous data – continuous data exists on a scale that we can aribtrarily sub-divide to a finer and finer scale.
- For example, degrees Celsius can be computed to an arbitrary decimal place.
- We might find that the temperature in Reno is 10 degrees Celsius with a basic thermometer, but is 10.23436546323423 degrees with a more accurate instrument.
- We could carry this sub-divsion of the units to an arbitrary decimal place if we had a precise enough instrument.
- Discrete data – discrete data uses counting units.
- For example, if we are counting the number of students who drive to campus at least once per week, the “student” unit of measurement should not be sub-divided into smaller parts.
- A value of, e.g, 0.25 students doesn't make any sense.

Levels of measurement

A final way of thinking about data types we will discuss is in terms of the level of measurement.
Level of measurement refers to the ways in which data of this type has structure in terms of order or intercomparion mathematically.
For example, in the category of car colors, the observations of red, green and blue has no natural order.
There is no natural mathematical comparison like green \( \leq \) blue.
We will categorize the levels of measurement as follows:

Nominal
Ordinal
Interval
Ratio

These are discussed at times in the book, but the important take-home message is common sense:

Don’t do computations and don’t use statistical methods that are not appropriate for the data.

We must be careful, and think about where mathematical operations make sense – sometimes categories or qualitative data are given numerical names.
For example, it would not make sense to compute an average of Social Security numbers, because those numbers are data that are used for identification, and they don’t represent measurements or counts of anything.

Levels of measurement – Nominal level

Nominal level of measurement – this data consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme (such as low to high) and mathematical operations have no meaning.
Discuss with a neighbor: can you give two examples of nominal leveled data, that is data without a natural order or mathematical operation?
- If the response to a poll question can take the form of “YES”, “NO” or “UNDECIDED”, the observations will nominal level data.
- There are no mathematical operations that apply to these values and there is no order in terms of YES \( \leq \) NO.
- If students are asked to give the name of their favorite sport “Basketball”, “Baseball”, “Football”, “Soccer”, or “Other”, the data is nominal level.
- There are no mathematicals operations that apply to these values and there is no order in terms of Baseball \( \leq \) Other.

Levels of measurement – Ordinal level

Ordinal level of measurement – this data can can be arranged in some order, but differences (obtained by subtraction) between data values either cannot be determined or are meaningless.
A simple example of ordinal level data can seen in food – suppose we have salsa labeled as “MILD”, “MEDIUM” and “SPICY”.
Discuss with a neighbor: using the definition above, why are these labels of ordinal level of measurement?
- There is a natural order given by MILD \( < \) MEDIUM \( < \) SPICY.
- However, the statement MILD \( - \) SPICY is meaningless.
Rankings are ordinal leveled data.
Suppose we are asked to rank our first four favorite musicians – there is a natural order here between 1st, 2nd, 3rd and 4th place.
However, if we subtract \( 4\text{th} - 3\text{rd} \), does this mean we get first place?
If we take the average of the rankings, \[ \frac{1 + 2 + 3 + 4}{4} = 2.5 \] does this average have any meaning?
There aren't meaningful mathematical operations to be performed.

Levels of measurement – Interval level

Interval level of measurement – this data can be arranged in order, and differences between data values can be found and are meaningful.
- However, data at this level does not have a natural zero starting point at which none of the quantity is present.
Temperature in degrees Celsius is a basic example of an interval measurement.
We can meaningfully order \( 10^\circ > -3^\circ \), and take the difference \( -3^\circ - 10^\circ \).
However, the value \( 0^\circ \) is arbitrary because it doesn't correspond to the physical quantity of heat – that is \( 0^\circ \) doesn't mean the absence of heat in the Celsius scale (as opposed to e.g., Kelvin scale).
For this reason, ratios don't have a consistent meaning in the Celsius (or Fahrenheit) scale.
- The melting point of ice in degrees Celsius is approximately \( 0.001C^\circ \).
- However, an increase of water temperature to \( 0.002C^\circ \) from \( 0.001C^\circ \) doesn't correspond to twice the ammount of heat.
- We can convert from degrees Celsius to degrees Kelvin by adding \( 273.150 \) to degrees Celsius.
- In Kelvin (the physical quantity of heat), \( 0.001C^\circ \) equals \( 273.151K \), while \( 0.002C^\circ \) equals \( 273.152K \).
- In Kelvin, the actual increase of heat is given as \[ \frac{273.152 - 273.151}{273.151}\times 100\% = \frac{0.001}{273.151}\approx 3\times10^{-6}\times 100\% = 3\times 10^{-4}\%. \]
- The inconsistency in Celsius is because the zero value is arbitrary with respect to the physical quantity of heat.

Levels of measurement – Ratio level

Ratio level of measurement – this data can be aranged in order, differences can be found and are meaningful, and there is a natural zero starting point (where zero indicates that none of the quantity is present).
The presence of the natural zero corresponding to none of the quantity solves the issues we saw with Celsius – for data at this level, differences and ratios are both meaningful.
From an earlier example, the size of the wallet measured in \( cm^3 \) is data at the ratio level of measurment.
A wallet of size \( 0cm^3 \) corresponded to no wallet, which is why the percent decrease of 100% or greater was nonsense.
Discuss with a neighbor: can you identify data that is at the ratio level? Identify what is the natural zero corresponding to none of the quantity.