How to Analyze Survey Data for Hypothesis Tests

Traditionally, researchers analyze survey data to estimate population parameters. But very similar analytical techniques can also be applied to test hypotheses.

In this lesson, we describe how to analyze survey data to test statistical hypotheses.

The Logic of the Analysis

In a big-picture sense, the analysis of survey sampling data is easy. When you use sample data to test a hypothesis, the analysis includes the same seven steps:

Estimate a population parameter.
Estimate population variance.
Compute standard error.
Set the significance level.
Find the critical value (often a z-score or a t-score).
Define the upper limit of the region of acceptance.
Define the lower limit of the region of acceptance.

It doesn't matter whether the sampling method is simple random sampling, stratified sampling, or cluster sampling. And it doesn't matter whether the parameter of interest is a mean score, a proportion, or a total score. The analysis of survey sampling data always includes the same seven steps.

However, formulas used in the first three steps of the analysis can differ, based on the sampling method and the parameter of interest. In the next section, we'll list the formulas to use for each step. By the end of the lesson, you'll know how to test hypotheses about mean scores, proportions, and total scores using data from simple random samples, stratified samples, and cluster samples.

Data Analysis for Hypothesis Testing

Now, let's look in a little more detail at the seven steps required to conduct a hypothesis test, when you are working with data from a survey sample.

Estimate a population parameter. The first step in the analysis to estimate the value of the population parameter that appears in the null hypothesis. To accomplish this, we compute a point estimate of the population parameter; that is, we compute a sample statistic. Here are formulas for different scenarios:
- Mean score (simple random sampling): Use this formula to estimate the population mean, using data from a simple random sample:
  Sample mean = x = Σx / n
  
  where x is a sample estimate of the population mean, Σx is the sum of all the sample observations, and n is the number of sample observations.
- Proportion (simple random sampling): A proportion is a special case of the mean. It represents the number of observations that have a particular attribute divided by the total number of observations in the group. To estimate the population proportion (P) from sample data, use this formula for the sample proportion (p):
  
  p = Sample observations with attribute
  Total sample size (n)
- Total score (simple random sampling): If we know the sample mean, we can estimate the population total (t) from the following formula:
  Population total = t = N * x
  
  where N is the number of observations in the population, and x is the sample mean.
  
  Or, if we know the sample proportion, we can estimate the population total (t) as:
  
  Population total = t = N * p
  
  where t is an estimate of the number of elements in the population that have a specified attribute, N is the number of observations in the population, and p is the sample proportion.
- Mean score (stratified sampling): Use this formula to estimate the population mean from a stratified sample:
  Sample mean = x = Σ( N_h / N ) * x_h
  
  where N_h is the number of observations in stratum h of the population, N is the number of observations in the population, and x_h is the mean score from the sample in stratum h.
- Proportion (stratified sampling): Use this formula to estimate the population proportion from a stratified sample:
  Sample proportion = p = Σ( N_h / N ) * p_h
  
  where N_h is the number of observations in stratum h of the population, N is the number of observations in the population, and p_h is the sample proportion in stratum h.
- Total score (stratified sampling): If we know the sample mean in each stratum, we can estimate the population total (t) from the following formula:
  Population total = t = ΣN_h * x_h
  
  where N_h is the number of observations in the population from stratum h, and x_h is the sample mean from stratum h.
  
  Or if we know the population proportion in each stratum, we can use this formula to estimate a population total:
  
  Population total = t = ΣN_h * p_h
  
  where t is an estimate of the number of observations in the population that have a specified attribute, N_h is the number of observations from stratum h in the population, and p_h is the sample proportion from stratum h.
- Mean score (cluster sampling): Use this formula to compute the sample mean (x) from a cluster sample:
  x = ( N / ( n * M ) ] * Σ ( M_h * x_h )
  
  where N is the number of clusters in the population, n is the number of clusters in the sample, M is the number of observations in the population, M_h is the number of observations in cluster h, and x_h is the mean score from the sample in cluster h.
- Proportion (cluster sampling): Use this formula to compute the sample proportion (p) from a cluster sample:
  p = ( N / ( n * M ) ] * Σ ( M_h * p_h)
  
  where N is the number of clusters in the population, n is the number of clusters in the sample, M is the number of observations in the population, M_h is the number of observations in cluster h, and p_h is the proportion from the sample in cluster h.
- Total score (cluster sampling): If we know the sample mean in each cluster, we can estimate the population total (t) from the following formula:
  Population total = t = N/n * ΣM_h * x_h
  
  where N is the number of clusters in the population, n is the number of clusters in the sample, M_h is the number of observations in the population from cluster h, and x_h is the sample mean from cluster h.
  
  And, if we know the sample proportion for each cluster, we can estimate a population total:
  
  Population total = t = N/n * ΣM_h * p_h
  
  where t is an estimate of the number of elements in the population that have a specified attribute, N is the number of clusters in the population, n is the number of clusters in the sample, M_h is the number of observations from cluster h in the population, and p_h is the sample proportion from cluster h.

Estimate population variance. The formula(s) to estimate variance will vary, depending on the sampling method and the parameter in the null hypothesis.
- Proportions. If you are testing a hypothesis about a population proportion, use this formula to estimate population variance (s²):
  s² = P * (1 - P)
  
  where s² is an estimate of population variance, and P is the value of the proportion in the null hypothesis.
- Simple random sampling with means or totals. If you use a simple random sample to test a hypothesis about a mean or a total score, use this formula to estimate variance:
  s² = Σ ( x_i - x )² / ( n - 1 )
  
  where s² is a sample estimate of population variance, x is the sample mean, x_i is the ith element from the sample, and n is the number of elements in the sample.
- Stratified sampling. If you use a stratified sample to test a hypothesis about a mean or a total score, you will need to estimate variance within each stratum. Use this formula:
  s²_h = Σ ( x_i_h - x_h )² / ( n_h - 1 )
  
  where s²_h is a sample estimate of population variance in stratum h, x_i_h is the value of the ith element from stratum h, x_h is the sample mean from stratum h, and n_h is the number of sample observations from stratum h.
- Variance within clusters. If you use two-stage cluster sampling to test a hypothesis about a mean or total score, you need to estimate the variance within clusters. Use this formula:
  s²_h = Σ ( x_i_h - x_h )² / ( m_h - 1 )
  
  where s²_h is a sample estimate of population variance in cluster h, x_i_h is the value of the ith element from cluster h, x_h is the sample mean from cluster h, and m_h is the number of observations sampled from cluster h.
- Variance between clusters. If you use cluster sampling to estimate a total score, you need to estimate the variance between clusters. Use this formula:
  s²_b = Σ ( t_h - t/N )² / ( n - 1 )
  
  where s²_b is a sample estimate of the variance between sampled clusters, t_h is the total from cluster h, t is the sample estimate of the population total, N is the number of clusters in the population, and n is the number of clusters in the sample.
  
  You can estimate the population total (t) from the following formula:
  
  Population total = t = N/n * ΣM_h * x_h
  
  where M_h is the number of observations in the population from cluster h, and x_h is the sample mean from cluster h.

Compute standard error. The right formula to compute standard error will vary, depending on the sampling method and the parameter under study.

Simple random sampling (mean or proportion). When we estimate a mean or a proportion from a simple random sample, the standard error (SE) of the estimate is:
SE = sqrt [ (1 - n/N) * s² / n ]

where n is the sample size, N is the population size, and s is a sample estimate of the population standard deviation.
Simple random sampling (total score). When we use a mean or a proportion to estimate a population total from a simple random sample, the standard error (SE) of the estimate is:
SE = sqrt [ N² * (1 - n/N) * s² / n ]

where N is the population size, n is the sample size, and s² is a sample estimate of the population variance.
Stratified sampling (mean or proportion). When we estimate a mean or a proportion from a stratified random sample, the standard error (SE) of the estimate is:
SE = (1 / N) * sqrt { Σ [ N²_h * ( 1 - n_h/N_h ) * s²_h / n_h ] }

where n_h is the number of sample observations from stratum h, N_h is the number of elements from stratum h in the population, N is the number of elements in the population, and s²_h is a sample estimate of the population variance in stratum h.
Stratified sampling (total score). When we estimate a total from a stratified random sample, the standard error (SE) of the estimate is:
SE = sqrt { Σ [ N²_h * ( 1 - n_h/N_h ) * s²_h / n_h ] }

where N_h is the number of elements from stratum h in the population, n_h is the number of sample observations from stratum h, and s²_h is a sample estimate of the population variance in stratum h.

Cluster sampling (mean). When we estimate a population mean from a cluster sample, the standard error (SE) of the estimate is:

SE =	( 1 / M ) * sqrt { [ N² * ( 1 - n/N ) / n ] * Σ ( M_h * x_h - t / N )² / ( n - 1 )
	+ ( N / n ) * Σ [ ( 1 - m_h / M_h ) * M²_h * s²_h / m_h ] }

where M is the number of observations in the population, N is the number of clusters in the population, n is the number of clusters in the sample, M_h is the number of elements from cluster h in the population, m_h is the number of elements from cluster h in the sample, x_h is the sample mean from cluster h, s²_h is a sample estimate of the population variance in stratum h, and t is a sample estimate of the population total. For the equation above, use the following formula to estimate the population total.

t = N/n * Σ M_hx_h

With one-stage cluster sampling, the formula for the standard error reduces to:

SE =	( 1 / M ) * sqrt { [ N² * ( 1 - n/N ) / n ] * Σ ( M_h * x_h - t / N )² / ( n - 1 )

Cluster sampling (proportion). When we estimate a population proportion from a cluster sample, the standard error (SE) of the estimate is:

SE =	( 1 / M ) * sqrt [ ( N² * ( 1 - n/N ) / n ] * Σ ( M_h * p_h - t / N )² } / ( n - 1 )
	+ ( N / n ) * Σ [ ( 1 - m_h / M_h ) * M²_h * p_h * ( 1 - p_h ) / ( m_h - 1 ) ] }

where M is the number of observations in the population, N is the number of clusters in the population, n is the number of clusters in the sample, M_h is the number of elements from cluster h in the population, m_h is the number of elements from cluster h in the sample, p_h is the value of the proportion from cluster h, and t is a sample estimate of the population total. For the equation above, use the following formula to estimate the population total.

t = N/n * Σ M_hp_h

With one-stage cluster sampling, the formula for the standard error reduces to:

SE =	( 1 / M ) * sqrt [ ( N² * ( 1 - n/N ) / n ] * Σ ( M_h * p_h - t / N )² } / ( n - 1 )

Cluster sampling (total score). When we estimate a population total from a cluster sample, the standard error (SE) of the estimate is:

SE =	N * sqrt { [ ( 1 - n/N ) / n ] * s²_b/n +
	N/n * Σ ( 1 - m_h/M_h ) * M²_h * s²_h/m_h ) }

where N is the number of clusters in the population, n is the number of clusters in the sample, s²_b is a sample estimate of the variance between clusters, m_h is the number of elements from cluster h in the sample, M_h is the number of elements from cluster h in the population, and s²_h is a sample estimate of the population variance in cluster h.

With one-stage cluster sampling, the formula for the standard error reduces to:

SE = N * sqrt { [ ( 1 - n/N ) / n ] * s²_b/n }

Choose a significance level. The significance level (denoted by α) is the probability of committing a Type I error. Researchers often set the significance level equal to 0.05 or 0.01.
Find the critical value. Often expressed as a t-score or a z-score, the critical value is a factor used to determine upper and lower limits of the region of acceptance.
When the null hypothesis is two-tailed, the critical value is the z-score or t-score that has a cumulative probability equal to 1 - α/2. When the null hypothesis is one-tailed, the critical value has a cumulative probability equal to 1 - α.

Researchers use a t-score when sample size is small; a z-score when it is large (at least 30). You can use the Normal Distribution Calculator to find the critical z-score, and the t Distribution Calculator to find the critical t-score.

If you use a t-score, you will have to find the degrees of freedom (df). With simple random samples, df is often equal to the sample size minus one.

Note: The critical value for a one-tailed hypothesis does not equal the critical value for a two-tailed hypothesis. The critical value for a one-tailed hypothesis is smaller.
Find the upper limit (UL) of the region of acceptance. There are two possibilities, depending on the form of the null hypothesis.
- If the null hypothesis is μ < M or if the null hypothesis is μ = M: The upper limit of the region of acceptance will be:
  UL = M + SE * CV
  where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value.
- If the null hypothesis is μ > M: The theoretical upper limit of the region of acceptance is plus infinity, unless the parameter in the null hypothesis is a proportion or a percentage. The upper limit is 1 for a proportion, and 100 for a percentage.
In a similar way, we find the lower limit (LL) of the range of acceptance. There are two possibilities, depending on the form of the null hypothesis.
- If the null hypothesis is μ > M or if the null hypothesis is μ = M: The lower limit of the region of acceptance will be:
  LL = M - SE * CV
  where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value.
- If the null hypothesis is μ < M: The theoretical lower limit of the region of acceptance is minus infinity, unless the test statistic is a proportion or a percentage. The lower limit for a proportion or a percentage is zero.

The region of acceptance is the range of values between LL and UL. If the sample estimate of the population parameter falls outside the region of acceptance, the researcher rejects the null hypothesis. If the sample estimate falls within the region of acceptance, the researcher does not reject the null hypothesis.

By following the steps outlined above, you define the region of acceptance in such a way that the chance of making a Type I error is equal to the significance level.

Test Your Understanding

In this section, two hypothesis testing examples illustrate how to define the region of acceptance. The first problem shows a two-tailed test with a mean score; and the second problem, a one-tailed test with a proportion.

Sample Size Calculator

As you probably noticed, defining the region of acceptance can be complex and time-consuming. Stat Trek's Sample Size Calculator can do the same job quickly, easily, and error-free.The calculator is easy to use, and it is free. You can find the Sample Size Calculator in Stat Trek's main menu under the Stat Tools tab. Or you can tap the button below.

Sample Size Calculator

Problem 1

An inventor has developed a new, energy-efficient lawn mower engine. He claims that the engine will run continuously for 5 hours (300 minutes) on a single ounce of regular gasoline. Suppose a random sample of 50 engines is tested. The engines run for an average of 295 minutes, with a standard deviation of 20 minutes.

Consider the null hypothesis that the mean run time is 300 minutes against the alternative hypothesis that the mean run time is not 300 minutes. Use a 0.05 level of significance. Find the region of acceptance. Based on the region of acceptance, would you reject the null hypothesis?

Solution: The analysis of survey data to test a hypothesis takes seven steps. We work through those steps below:

Estimate a population parameter. For this problem, we are given the sample mean. It is 295 minutes.
However, if we had to compute the sample mean from raw data, we could do it, using the following formula:
Sample mean = x = Σx / n

where Σx is the sum of all the sample observations, and n is the number of sample observations.
Estimate population variance. For this problem, we are given a sample estimate of the standard deviation. It is 20 minutes. Since the variance is the square of the standard deviation, we can estimate that the population variance is 20² or 400.
If we hadn't been given the standard deviation, we could have computed it from the raw sample data, using the following formula:

s² = Σ ( x_i - x )² / ( n - 1 )

where s² is a sample estimate of population variance, x is the sample mean, x_i is the ith element from the sample, and n is the number of elements in the sample.
Compute standard error. The right formula to compute standard error will vary, depending on the sampling method and the parameter under study. The right equation for a mean score from a simple random sample is:
SE = sqrt [ (1 - n/N) * s² / n ]

where n is the sample size, N is the population size, and s is a sample estimate of the population standard deviation.

For this problem, we know that the sample size is 50, and the standard deviation is 20. The population size is not stated explicitly; but, in theory, the manufacturer could produce an infinite number of motors. Therefore, the population size is a very large number. For the purpose of the analysis, we'll assume that the population size is 100,000. Plugging those values into the formula, we find that the standard error is:

SE = sqrt [ (1 - n/N) * s² / n ]

SE = sqrt [ (1 - 50/100,000) * 20² / 50 ]

SE = sqrt(0.9995 * 8) = 2.828
Choose a significance level. The significance level (α) is chosen for us in the problem. It is 0.05. (Researchers often set the significance level equal to 0.05 or 0.01.)

Find the critical value. The critical value is a factor used to determine upper and lower limits of the region of acceptance. When the sample size is large (at least 30), researchers can express the critical value as a t-score or a z-score. Here, the sample size is much larger than 30 (n=50), so we will express the critical value as a z-score.

When the null hypothesis is two-tailed, the critical value has a cumulative probability equal to 1 - α/2. When the null hypothesis is one-tailed, the critical value has a cumulative probability equal to 1 - α.

For this problem, the null hypothesis and the alternative hypothesis can be expressed as:

Null hypothesis	Alternative hypothesis	Number of tails
μ = 300	μ ≠ 300	2

Since this problem deals with a two-tailed hypothesis, the critical value will be the z-score that has a cumulative probability equal to 1 - α/2. Here, the significance level (α) is 0.05, so the critical value will be the z-score that has a cumulative probability equal to 0.975.

We use the Normal Distribution Calculator to find that the z-score with a cumulative probability of 0.975 is 1.96. Thus, the critical value is 1.96.

Find the lower limit of the region of acceptance. The lower limit (LL) of the region of acceptance is:
LL = M - SE * CV

where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value. So, for this problem, we compute the lower limit of the region of acceptance as:

LL = 300 - 2.828 * 1.96

LL = 300 - 5.54

LL = 294.46
Find the upper limit of the region of acceptance. The upper limit (UL) of the region of acceptance is:
UL = M + SE * CV

where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value. So, for this problem, we compute the lower limit of the region of acceptance as:

LL = 300 + 2.828 * 1.96

LL = 300 + 5.54

LL = 305.54

Thus, given a significance level of 0.05, the region of acceptance is range of values between 294.46 and 305.54. In the tests, the engines ran for an average of 295 minutes. That value is within the region of acceptance, so the inventor cannot reject the null hypothesis that the engines run for 300 minutes on an ounce of fuel.

Problem 2

Suppose the CEO of a large software company claims that at least 80 percent of the company's 1,000,000 customers are very satisfied. A survey of 100 randomly sampled customers finds that 73 percent are very satisfied. To test the CEO's hypothesis, find the region of acceptance. Assume a significance level of 0.05.

Solution: The analysis of survey data to test a hypothesis takes seven steps. We work through those steps below:

Estimate a population parameter. For this problem, we are interested in the population proportion; and we are given the sample proportion as an estimate. It is 0.73.
However, if we had to compute the sample proportion (p) from raw data, we could do it by using the following formula:

p = Sample observations with attribute
Total sample size (n)
Estimate population variance. To compute the population variance when the true population proportion is P, we use the following formula:
s² = P * (1 - P)

where s² is the population variance when the true population proportion is P, and P is the value of the proportion in the null hypothesis.

For the purpose of estimating population variance, we assume the null hypothesis is true. In this problem, the null hypothesis states that the true proportion of satisfied customers is 0.8. Therefore, to estimate population variance, we insert that value in the formula:

s² = 0.8 * (1 - 0.8)

s² = 0.8 * 0.2 = 0.16
Compute standard error. The right formula to compute standard error will vary, depending on the sampling method and the parameter under study. The right equation for a proportion score from a simple random sample is:
SE = sqrt [ (1 - n/N) * s² / n ]

where n is the sample size, N is the population size, and s is a sample estimate of the population standard deviation.

For this problem, we know that the sample size is 100, the variance (s²) is 0.16, and the population size is 1,000,000. Plugging those values into the formula, we find that the standard error is:

SE = sqrt [ (1 - n/N) * s² / n ]

SE = sqrt [ (1 - 100/1,000,000) * 0.16 / 100 ]

SE = sqrt(0.9999 * 0.0016) = 0.04
Choose a significance level. The significance level (α) is chosen for us in the problem. It is 0.05. (Researchers often set the significance level equal to 0.05 or 0.01.)

Find the critical value. The critical value is a factor used to determine upper and lower limits of the region of acceptance. When the sample size is large (at least 30), researchers can express the critical value as a t-score or a z-score. Here, the sample size is much larger than 30 (n=100), so we will express the critical value as a z-score.

For this problem, the null hypothesis and the alternative hypothesis can be expressed as:

Null hypothesis	Alternative hypothesis	Number of tails
μ = 0.8	μ < 0.8	1

Since this problem deals with a one-tailed hypothesis, the critical value will be the z-score that has a cumulative probability equal to 1 - α. Here, the significance level (α) is 0.05, so the critical value will be the z-score that has a cumulative probability equal to 0.95.

We use the Normal Distribution Calculator to find that the z-score with a cumulative probability of 0.95 is 1.645. Thus, the critical value is 1.645.

Find the lower limit of the region of acceptance. The lower limit (LL) of the region of acceptance is:
LL = M - SE * CV

where M is the parameter value in the null hypothesis, SE is the standard error, and CV is the critical value. So, for this problem, we compute the lower limit of the region of acceptance as:

LL = 0.8 - 0.04 * 1.645

LL = 0.8 - 0.0658 = 0.7342
Find the upper limit of the region of acceptance. For this type of one-tailed hypothesis, the theoretical upper limit of the region of acceptance is 1; since any proportion greater than 0.8 is consistent with the null hypothesis, and 1 is the largest value that a proportion can have.

Thus, given a significance level of 0.05, the region of acceptance is the range of values between 0.7342 and 1.0. In the sample survey, the proportion of satisfied customers was 0.73. That value is outside the region of acceptance, so null hypothesis must be rejected.

Last lesson Next lesson