20.3 Sampling distribution: Unknown proportion

In the die example (Sect. 20.1), an equation was given for computing the standard error for the sample proportion for samples of size n, when the value of p was known.

However, usually the value of p (the parameter) is unknown; after all, the reason for taking a sample is to estimate the unknown value of p. When p is unknown, the best available estimate can be used, which is p^. When the value of p is unknown, the standard error of the sample proportion (written s.e.(p^)) is approximately

s.e.(p^)=p^×(1p^)n.

Definition 20.2 (Sampling distribution of a sample proportion when p is unknown) When the value of p is unknown, the sampling distribution of the sample proportion is described by

  • an approximate normal distribution,
  • centred around the (unknown) mean of p,
  • with a standard deviation (called the standard error of p^) of

(20.3)s.e.(p^)=p^×(1p^)n,

when certain conditions are met, where n is the size of the sample, and p^ is the sample proportion.

In general, the approximation gets better as the sample size gets larger.

Let’s pretend for the moment that the proportion of even rolls of a fair die is unknown (to demonstrate some points). In this case, an estimate of the proportion of even rolls can be found by rolling a die n=25 times and computing p^.

Suppose 11 of the n=25 rolls produced an even number, so that p^=11/25=0.44. Then (from Definition 20.2),

s.e.(p^)=0.44×(10.44)25=0.099277. (This is very similar to the value of 0.1, the value of the standard error when the value of p was known; see Eq. (20.2).)

Hence, the sample proportions would vary with an approximate normal distribution (Fig.20.3), centred around the unknown value of p with a standard deviation of s.e.(p^)=0.099277.

The normal distribution, showing how the proportion of even rolls varies when a die is rolled 25 times

FIGURE 20.3: The normal distribution, showing how the proportion of even rolls varies when a die is rolled 25 times

Using the 68–95–99.7 rule again:

About 95% of the values of p^ are expected to be between p0.199 and p+0.199.

Though we are pretending the value of p is unknown, the value of p^ is known however. What if the roles of p and p^ were ‘reversed?’ Then,

About 95% of the values of p are expected to be between p^0.199 and p^+0.199.

Since p^=0.44, this is equivalent to:

About 95% of the values of p are expected to be between 0.24 and 0.64.

This interpretation is not quite correct, but the idea seems reasonable. This is called a confidence interval (or CI), based on ideas from Sect. 20.2.

In summary, using p^=0.44 and s.e.(p^)=0.0993, the (approximate) 95% CI is

0.44±(2×0.0993), or from 0.241 to 0.639. This CI straddles the population proportion of p=0.5, though we would not know this if p truly was unknown.

In this case, we know the value of the population parameter: p=0.5.

Usually we do not know the value of the parameter: that’s why we are taking a sample.