# Sampling Distribution: Difference Between Means

Statistics problems often involve comparisons between sample means from two independent populations. This lesson describes the sampling distribution for the difference between sample means and explains how to compute the standard error for the difference between sample means.

## Sampling Distribution

Consider the following scenario. We have two
populations
with population means equal to μ_{1} and μ_{2}. We take all possible
simple random samples
of size n_{1} from population 1, and all possible simple random samples of size n_{2} from population 1. For each
sample from population 1, we compute the sample mean x_{1}; and for each
sample from population 2, we compute the sample mean x_{2}.
And finally, for every possible pairing of sample 1 with sample 2, we compute the difference between sample means; that is, we
compute x_{1} - x_{2}.

Assume that the samples are independent; that is, observations in one sample are not affected by observations in the other sample. Given that assumption, we know the following about the sampling distribution for the difference between sample means.

- If both sample sizes are sufficiently large, the sampling distribution
for the difference between independent sample means will be approximately normally distributed. We know this from the
central limit theorem.
**Note:**If a population distribution is roughly bell-shaped, a sample size of 30 is big enough to justify an assumption of normality. If a population distribution is not bell-shaped, the sample size should be bigger. - If both populations are approximately normally distributed, the sampling distribution will be described by a t distribution with
(n
_{1}+ n_{2}- 2) degrees of freedom. - The mean of the sampling distribution (μ
_{d}) is the expected value of the difference between all possible sample means. Thus,μ

_{d}= E(x_{1}- x_{2}) = μ_{1}- μ_{2}where x

_{i}is the mean of sample*i*, and μ_{i}is the mean of population*i*.

**Note:** If both populations are large and normally distributed, you could use a t distribution or a normal distribution to describe
the sampling distribution. Use a t distribution when the population standard deviation is unknown, and use a normal distribution when
the population standard deviation is known.

## Standard Deviation of Sampling Distribution

The standard deviation of the difference between sample
means (σ_{d}) is approximately equal to:

σ_{d} = sqrt( σ_{1}^{2} / n_{1} + σ_{2}^{2} / n_{2} )

It is straightforward to derive this formula, based on material covered in previous lessons. The derivation starts with a recognition that the variance of the difference between independent random variables is equal to the sum of the individual variances. Thus,

σ^{2}_{d} =
σ^{2}
_{(x1 -
x2)} =
σ^{2}
_{x1} +
σ^{2}
_{x2}

If the populations N_{1} and N_{2} are both large
relative to n_{1} and n_{2}, respectively,
then

σ^{2}
_{x1} =
σ^{2}_{1} / n_{1}

σ^{2}
_{x2} =
σ^{2}_{2} / n_{2}

σ_{d}^{2} =
σ_{1}^{2} / n_{1} +
σ_{2}^{2} / n_{2}

σ_{d} =
sqrt( σ_{1}^{2} / n_{1} +
σ_{2}^{2} / n_{2} )

## Standard Error of Sampling Distribution

Typically, we don't know the values for population standard deviations, σ_{1} and σ_{2}.
And, if we don't know the population standard deviations, we cannot compute the standard deviation of the difference between sample means (σ_{d}).

However, we can estimate the population standard deviation from sample data, as shown below:

*s* = sqrt [ Σ ( x_{i} - x )^{2} / ( n - 1 ) ]

where *s* is the sample standard deviation (i.e., the sample estimate of the population standard deviation), x is
the sample mean, x_{i} is the *i*th element from the sample, n
is the number of elements in the sample.

Substituting sample estimates of each population standard deviation into the equation for σ_{d}, we get:

SE_{d} =
sqrt( s^{2}_{1} / n_{1} + s^{2}_{2} / n_{2} )

In this equation, SE_{d} is a sample estimate of the standard deviation of the difference between sample means (σ_{d}), which is known as the standard error of the
difference between sample means. Also, s_{1} is the standard deviation of sample 1 (i.e., the sample estimate of σ_{1}),
s_{2} is the standard deviation of sample 2 (i.e., the sample estimate of σ_{2}),
n_{1} is the sample size in sample 1, and n_{2} is the sample size in sample 2.

In future lessons, you will see that being able to compute the standard error from sample data is essential for inferential statistics. It will allow us to compute confidence intervals for the difference between means and to test hypotheses about the difference between means.

## Difference Between Means: Sample Problem

In this section, we work through a sample problem to show how to apply the theory presented above. In this example, we will use Stat Trek's Normal Distribution Calculator to compute probabilities.

## Normal Distribution Calculator

The Normal Distribution Calculator solves common statistical problems, based on the normal distribution. The calculator computes cumulative probabilities, based on three simple inputs. Clear instructions guide you to an accurate solution, quickly and easily. If anything is unclear, frequently-asked questions and sample problems provide straightforward explanations. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.

Normal Distribution Calculator**Problem 1**

For boys, the average number of absences in the first grade is 15 with a standard deviation of 7; for girls, the average number of absences is 10 with a standard deviation of 6.

In a nationwide survey, suppose 100 boys and 50 girls are
sampled. What is the probability that the male sample
will have *at most* three more days of absences than
the female sample?

(A) 0.025

(B) 0.035

(C) 0.045

(D) 0.055

(E) None of the above

**Solution**

The correct answer is B. The solution involves three or four steps, depending on whether you work directly with raw scores or z-scores. The "raw score" solution appears below:

- Find the mean difference (male absences minus female absences)
in the population.
μ

_{d}= μ_{1}- μ_{2}= 15 - 10 = 5 - Find the standard deviation of the difference.
σ

_{d}= sqrt( σ_{1}^{2}/ n_{1}+ σ_{2}^{2}/ n_{2})σ

_{d}= sqrt(7^{2}/100 + 6^{2}/50)σ

_{d}= sqrt(49/100 + 36/50)σ

_{d}= sqrt(0.49 + .72) = sqrt(1.21) = 1.1 - Find the probability. This problem requires us to find the probability that the average number of absences in the boy sample minus the average number of absences in the girl sample is less than 3. To find this probability, we use Stat Trek's Normal Distribution Calculator. Specifically, we enter the following inputs: 3, for the normal random variable; 5, for the mean; and 1.1, for the standard deviation.

We find that the probability of the mean difference (male absences minus female absences) being 3 or less is about 0.035.

Alternatively, we could have worked with z-scores (which have a mean of 0 and a standard deviation of 1). Here's the z-score solution:

- Find the mean difference (male absences minus female absences)
in the population.
μ

_{d}= μ_{1}- μ_{2}= 15 - 10 = 5 - Find the standard deviation of the difference.
σ

_{d}= sqrt( σ_{1}^{2}/ n_{1}+ σ_{2}^{2}/ n_{2})σ

_{d}= sqrt(7^{2}/100 + 6^{2}/50) = sqrt(49/100 + 36/50)σ

_{d}= sqrt(0.49 + .72) = sqrt(1.21) = 1.1 - Find the
z-score
that is produced when boys have three more days of absences than
girls. When boys have three more days of absences, the number of
male absences minus female absences is three.
And the associated z-score
is
z = (x - μ)/σ = (3 - 5)/1.1 = -2/1.1 = -1.818

- Find the probability. To find this probability, we use Stat Trek's Normal Distribution Calculator. Specifically, we enter the following inputs: -1.818, for the z-score; 0, for the mean; and 1, for the standard deviation.

We find that the probability of probability of a z-score being -1.818 or less is about 0.035. Of course, the result is the same, whether you work with raw scores or with z-scores.

**Note:** Some analysts might have used the t distribution to compute probabilities
for this problem. We used the normal distribution because the population standard deviation was known
and the sample size was large. If the population standard deviation had been unknown, we would have
used the t distribution. In a previous lesson, we presented
guidelines for choosing between the normal distribution and the t distribution.