We will focus on the inference about population proportion $p$ based on a single sample in this lecture, and the inference about population mean $\mu$ in the next lecture.
And for today's lecture, we will begin by Hypothesis Testing!
A famous example described by Ronald Fisher: Lady Tasting Tea.
Who is Ronald Fisher?
A genius who almost single-handedly created the foundations for modern statistical science
YouTubeVideo('lgs7d5saFFc')
In statistics, a hypothesis is a statement about a population, usually claiming that a population parameter takes a particular numerical value or falls in a certain range of values.
Null hypothesis ($H_0$, “H-naught”, "H-null", "H-zero" or "H-oh"): a statement that the parameter takes a particular value (or a particular range of values.) This is the hypothesis that we wish to reject.
Alternative hypothesis ($H_a$): the opposite of $H_0$. This is the hypothesis that we wish to establish.
A significance test is a method for using data to summarize the evidence about $H_0$ versus $H_a$.
A defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant.
Only when there is enough evidence for the prosecution is the defendant convicted.
A test statistic measures how far the sample statistic falls from the null hypothesis value. It is usually in the form of a z-score (in this lecture) or a t-score (in the next lecture).
Assuming $H_0$ is true, the P-value is the probability that the test statistic equals the observed value or a value even more extreme against $H_0$.
Comparing the P-value to a predetermined significance level $\alpha$, we make decisions as the following:
The significance level α is usually chosen to be 0.05, 0.10 or 0.01.
Here are a couple of remarks about interpreting a hypothesis test.
Too many concepts and I don't know what the hack are they? 😖
Here is the recipe of doing Hypothesis Testing of popuplation proportion(next slides). 😄
First, set up the null $H_0$ and the alternative $H_a$:
$H_a$ is what we are interested in, and $H_0$ is the opposite of $H_a$;
The hypothesis should have one of the three forms according to Table 1 (Left/Right tail, Two sided).
Third, compute the test statistic $z^*$ through:
$$ z^* = \frac{\hat{p} - p_0}{ \sqrt{ \frac{p_0(1-p_0)}{n} }} $$
Fourth, compute the P-value based on the $z^∗$ according to Table 2.
What is confidence interval?
We use sample proportion $\hat{p}$ to estimate $p$. $\hat{p}$ is called the point estimate for the population proportion.
An interval estimate is an interval of numbers that is believed to contain the actual value of the parameter.
A confidence interval (CI) is an interval estimate containing the most believable values for a parameter.
It is formed by combining the point estimate and a margin of error (more details in next slide).
The probability that this method produces an interval that contains the parameter is called the confidence level.
Confidence level is a number less than 1 that we subjectively choose before constructing the interval. Common choices for confidence level are 0.95 (95%), 0.90 (90%) or 0.99 (99%).
Find the correct z-score according to the given confidence level from Table 3, then compute the margin of error (MOE) through $$ MOE = z \times se$$
The lower limit (LL) of the CI is $$ \hat{p} - MOE$$ and The upper limit (UL) of the CI is $$ \hat{p} + MOE$$
Hence the desired ci is given by $(LL, UL)$.
Interpretation: We are ...% confident that the interval (LL, UL) covers the population proportion $p$.
Recall last week, we did the one sample inference about population proportion $p$
We will talk about one sample inference about population mean $\mu$ today, especially in the case where we do not have a large sample ($n < 30$).
Recall from Chapter 7, by the Central Limit Theorem (CLT), if we have a sample of size $n \geq 30$ from any population with unknown parameters mean $\mu$ and standard deviation $\sigma$, the sampling distribution of the sample mean $\bar{X}$ is
$$ \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}})$$
Or equivalently, $$ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) $$
Actually, we can use the fact $ \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) $ to make statistical inference, but
For a normally distributed population with population mean $\mu$, a random sample of size $n$ from this population with sample mean $\bar{X}$ and sample standard deviation $s$, we have the expression (t-score) $$t = \frac{\bar{X} - \mu}{s / \sqrt{n}}$$
follows Student’s t-distribution with $(n − 1)$ degrees of freedom ($df$).
Like the standard normal distribution, t-distribution is bell-shaped and symmetric about 0.
The shape of the density curve of t-distribution varies for each distinct value of df. Probabil- ities and t-scores depend on the specific value of df, i.e. on the sample size.
Comparing to the standard normal distribution, t-distribution has more density at the tails, and hence more variability. This is caused by replacing the fixed quantity σ with the random quantity S. However, as the sample size n increases, the shape of t-distribution gradually approaches that of a standard normal distribution.
The t-Distribution Table (or t-table for short) lists t-scores for certain values of right-tail probabilities and df. We use tα to denote the t-score that has a right-tail probability α. For instance, for $df = 1$, $t_{.100} = 3.0784$. This means $P(t > 3.078) = .100$.
IFrame("https://statcao.github.io/teaching/t-table.pdf", width=800, height=600)
Recall from the last lecture that the confidence interval (CI) for a population parameter is a collec- tion of the most probable values of that parameter, and is of the form
point estimate ± margin of error
CI for the population mean $\mu$, the point estimate is the sample mean $\bar{x}$, and the margin of error is the standard error multiplied by a t-score.
Suppose we have a random sample of size $n$ with mean $bar{x}$ and standard deviation $s$, ...
Third, find the t-score from the t-table according to the given confidence level and $df = n−1$. Then, compute the margin of error (MOE) through $$ MOE = t \times se $$
Fouth, the CI is given by
$$CI = (\bar{x} − MOE, \bar{x} + MOE)$$
Fifth, the interpretation: we are ...% confident that the population mean is in between $\bar{x} − MOE$ and $\bar{x} + MOE$.
A hypothesis test that utilizes the t-distribution is usually referred to as a t-test.
Suppose we have a sample of size $n$ with sample mean $\mu$ and standard deviation $s$.
Third, compute the test-statistic t^∗ through
$$t^* = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$