Confidence Intervals

When we want to describe the distribution of a large population, it is not practical or impossible to measure every member / item of the population. Therefore, a random sample is taken to obtain information about the population.

The sample can be described in terms of the sample mean and sample standard deviation. If the sample is not Normally distributed further descriptive statistics can describe the sample. The sample statistics are used to describe / make inferences about the whole population.

The sample mean is a good estimator of the population mean (unbiased estimator). However, every time you take a different sample of the population you will get a different mean. The distribution of means will be a Normal distribution (central limit theorem) even if the samples or population are not Normally distributed!

The distribution of the mean is a Normal distribution with as mean the sample mean. The dispersion is indicated by the standard error of the mean (SEM) which is the sample standard deviation divided by the square root of the sample size (n):

SEM=\frac{SD(sample)}{\sqrt{n}}

Therefore, the bigger the sample the less dispersion in the distribution of the mean.

Confidence intervals can be constructed by the mean plus or minus the standard error of the mean:

Mean + / – 1 × SEM = 68 %

Mean + / – 2 × SEM = 95 % (more accurately 1.96 times SEM)

Mean + / – 3 × SEM = 99 %

For example, the table below shows a sample of 100 men’s heights taken at random from the population:

HeightNumber or People
Total100
1685
16925
17040
17125
1725

To calculate the sample mean of the variable height:

Mean(Height)=\frac{5\times168+25\times169+40\times170+25\times171+5\times172}{100}=170

To calculate the sample standard deviation of the variable height:

The sum of the squares about the mean is:

\sum_{i=1}^{n} (Height(i)-Mean(Height))^2 = 5\times(168-170)^2+25\times(169-170)^2+40\times(170-170)^2+25\times(171-170)^2+5\times(172-170)^2 =5\times(-2)^2+25\times(-1)^2+40\times(0)^2+25\times(1)^2+5\times(2)^2 = 5\times4+25\times1+40\times0+25\times1+5\times420+25+25+20=90

So the sample variance is:

Variance=\frac{1}{100-1}\times90\approx 0.91

And the sample standard deviation is:

StandardDeviation=\sqrt{0.91}\approx 0.95

Using the central limit theorem, the distribution of the population mean of the variable height has:

Population mean: 170 cm

Standard Error of the Mean: SEM=\frac{0.95}{\sqrt{100}}=0.095

To calculate the 95% confidence interval of the mean:

Mean + / – 1.96 × SEM = 95 % = 170 + / – 0.19 or:

The 95% confidence interval therefore is: (169.81, 170.19).

To calculate in R:

The data are stored in exampleheights.rda. The data frame is called heights and the variable height.

Sample mean:

mean(heights$height)
[1] 170

Sample standard deviation:

sd(heights$height)
[1] 0.9534626

Population mean:

The same as the sample mean: 170 cm.

Standard Error of the mean:

sd(heights$height)/sqrt(100)
[1] 0.09534626

Therefore, to estimate the 95% confidence interval; 1.96 times the SEM:

1.96*sd(heights$height)/sqrt(100)
[1] 0.1868787

So, the 95% confidence interval is:

mean(heights$height)-1.96*sd(heights$height)/sqrt(100)
[1] 169.8131
mean(heights$height)+1.96*sd(heights$height)/sqrt(100)
[1] 170.1869

Or (169.81, 170.19)

Alternatively:

one.sample.test(variables=d(height), data=heights, test=t.test, alternative=”two.sided”)
                                                               One Sample t-test                                                                
       mean of x 95% CI Lower 95% CI Upper        t df       p-value
height       170     169.8108     170.1892 1782.975 99 6.651567e-225
  HA: two.sided
  H0:  mean = 0