Confidence Intervals

When we want to describe the distribution of a large population, it is not practical or impossible to measure every member / item of the population. Therefore, a random sample is taken to obtain information about the population.

The sample can be described in terms of the sample mean and sample standard deviation. If the sample is not Normally distributed further descriptive statistics can describe the sample. The sample statistics are used to describe / make inferences about the whole population.

The sample mean is a good estimator of the population mean (unbiased estimator). However, every time you take a different sample of the population you will get a different mean. The distribution of means will be a Normal distribution (central limit theorem) even if the samples or population are not Normally distributed!

The distribution of the mean is a Normal distribution with as mean the sample mean. The dispersion (spread) is indicated by the standard error of the mean (SEM) which is the sample standard deviation divided by the square root of the sample size (n):

\(SEM = \frac{SD(sample)}{\sqrt{n}} \)

Therefore, the bigger the sample the less dispersion (spread) in the distribution of the mean.

Confidence intervals can be constructed by the mean plus or minus the standard error of the mean:

Mean ± 1 × SEM = 68 %

Mean ± 2 × SEM = 95 % (more accurately 1.96 times SEM)

Mean ± 3 × SEM = 99 %

Please note the confidence interval is NOT a probability! The true value of the population mean is unknown and lies either within the confidence interval (probability = 1) or outside it (probability = 0). The confidence interval only displays the confidence in the estimate and this is influenced by the sample size. The larger the sample size, the narrower the confidence interval.

For example, the table below shows a sample of 100 men’s heights taken at random from the population:

Height	Number or People
168	5
169	25
170	40
171	25
172	5
Total	100

To calculate the sample mean of the variable height:

\(Mean(sample)= \frac{5 \cdot 168 + 25 \cdot 169 + 40 \cdot 170 +25 \cdot 171 + 5 \cdot 172}{100} = 170 \)

To calculate the sample standard deviation of the variable height:

The sum of the squares about the mean is:

\( SumSquares = \sum_{i=1}^{n} (Height(i) – Mean(Height))^2 \)

\(SumSquares = 5 \cdot (168-170)^2 + 25 \cdot (169-170)^2 + \) \(40 \cdot (170-170)^2 + \) \( 25 \cdot (171 – 170)^2 + 5 \cdot (172-170)^2 \)

\(SumSquares = 5 \cdot (-2)^2 + 25 \cdot (-1)^2 + \) \(40 \cdot (0)^2 + \) \( 25 \cdot (1)^2 + 5 \cdot (2)^2 \)

\(SumSquares = 5 \cdot 4 + 25 \cdot 1 + \) \(40 \cdot 0+ \) \( 25 \cdot 1+ 5 \cdot 4 \)

\(SumSquares =20 + 25 + 25 + 20 = 90 \)

So the sample variance is:

\(Variance(sample) = \frac{1}{100-1} \cdot 90 \approx 0.91 \)

And the sample standard deviation is:

\(SD(sample) = \sqrt{0.91} \approx 0.95 \)

Using the central limit theorem, the distribution of the population mean of the variable height has:

Population mean: 170 cm

Standard Error of the Mean:

\(SEM = \frac{0.95}{\sqrt{100}} = 0.095 \)

To calculate the 95% confidence interval of the mean:

Mean ± 1.96 × SEM = 170 ± 0.19

The 95% confidence interval therefore is:

(169.81, 170.19).

To calculate in R:

The data are stored in exampleheights.rda. The data frame is called heights and the variable height.

Sample mean:

mean(heights$height)
[1] 170

Sample standard deviation:

sd(heights$height)
[1] 0.9534626

Population mean:

The same as the sample mean: 170 cm.

Standard Error of the mean:

sd(heights$height)/sqrt(nrow(heights))
[1] 0.09534626

Therefore, to estimate the 95% confidence interval; 1.96 times the SEM:

1.96*sd(heights$height)/sqrt(nrow(heights))
[1] 0.1868787

So, the 95% confidence interval is:

mean(heights$height)-1.96*sd(heights$height)/sqrt(nrow(heights))
[1] 169.8131
mean(heights$height)+1.96*sd(heights$height)/sqrt(nrow(heights))
[1] 170.1869

Or (169.81, 170.19)

Alternatively, perform an one sample t-test:

t.test(heights$height)

	One Sample t-test

data:  heights$height
t = 1783, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 169.8108 170.1892
sample estimates:
mean of x 
      170