Using Simple Alternative Hypothesis to Increase Statistical Power in Sparse Categorical Data

There are numerous statistical hypothesis tests for categorical data including Pearson’s Chi-Square goodness-of-fit test and other discrete versions of goodness-of-fit tests. For these hypothesis tests, the null hypothesis is simple, and the alternative hypothesis is composite which negates the simple null hypothesis. For power calculation, a researcher specifies a significance level, a sample size, a simple null hypothesis, and a simple alternative hypothesis. In practice, there are cases when an experienced researcher has deep and broad scientific knowledge, but the researcher may suffer from a lack of statistical power due to a small sample size being available. In such a case, we may formulate hypothesis testing based on a simple alternative hypothesis instead of the composite alternative hypothesis. In this article, we investigate how much statistical power can be gained via a correctly specified simple alternative hypothesis and how much statistical power can be lost under a misspecified alternative hypothesis, particularly when an available sample size is small.


Introduction
A researcher formulates a hypothesis based on scientific knowledge and then presents data to support the hypothesis.In this article, we focus on categorical data with three or more levels.In an expensive experiment or an observational study with a small sample size, despite researcher's deep and broad knowledge, the researcher may fail to provide empirical evidence due to a lack of statistical power.In such a case, many researchers may wish to increase statistical power without increasing sample size.
There are numerous methods of hypothesis testing for categorical data.Some tests are based on statistics with null sampling distributions following Chi-Square distributions (Pearson, 1900;Wilks, 1935;Neyman, 1949;Kullback, 1959), and some tests are based on discrete versions of goodness-of-fit statistics (Cramer, 1928;Kolmogorov, 1933;Smirnov, 1939;Anderson & Darling, 1952).These methods are equipped in statistical computing tools and widely used in practice.
We can perform power analysis by specifying a sample size, a significance level, a simple null hypothesis (denoted by H 0 ), and a simple alternative hypothesis (denoted by H 1 ).The power analysis can be done analytically (often based on asymptotic theory) or numerically (simulation).The general operating characteristic is that statistical power increases for a larger sample size, a larger significance level, and a larger degree of discrepancy between simple H 0 and simple H 1 (Cohen, 1988).Ampadu (2008) and Steel et al. (2009) compared various hypothesis tests for categorical data, and their results showed that the best test (in terms of statistical power) depends on H 1 when H 1 is true.For example, the Pearson's goodness-of-fit (GOF) test is outperformed by other tests when H 1 follows a monotonic trend, but it is competitive to the other tests when H 1 follows a triangular shape (Ampadu, 2008;Steel et al., 2009).Suppose a researcher can afford a small sample size.When there are multiple hypothesis tests under consideration, it is reasonable to choose the most powerful test under a specified H 1 .The objective of our study is to compare statistical power when H 1 is simple and when H 1 is composite.For large sample sizes, we provide examples of power calculation and sample size calculation based on asymptotic theory.For small sample sizes (n ≤ 50) we use simulation to study how much statistical power can be gained via a correctly specified H 1 and how much statistical power can be lost under a misspecified H 1 relative to the tests based on composite H 1 .

Method
We define the following notation.Let K denote the number of levels in the categorical data.Let π j denote the probability of observing the j-th level, where ∑ K j=1 π j = 1.Let H 0 denote the null hypothesis, H 0 : π j = p 0 j for j = 1, . . ., K. Let H 1 denote a simple alternative hypothesis, H 1 : π j = p 1 j for j = 1, . . ., K. The sample size is denoted by n, and the significance level is denoted by α.Let O j denote the random variable which counts the number of cases in the j-th level for j = 1, . . ., K, where where E j = np 0 j for j = 1, . . ., K. The asymptotic null distribution is the Chi-Square distribution with K − 1 degrees of freedom.Given simple H 0 : π j = p 0 j and simple H 1 : π j = p 1 j for j = 1, . . ., K, Cohen (1988) defined the effect size as If H 1 is true, the asymptotic distribution of the test statistic is the non-central Chi-Square distribution with K − 1 degrees of freedom and the non-centrality parameter λ = nw 2 (Ferguson, 1996).

Numeric Transformation
In this section, we discuss an alternative perspective of the standardized test statistic Z 0,Λ of Equation ( 2).The standardized statistic can be viewed as a test statistic for H 0 : µ = µ 0 , where µ 0 = ∑ K j=1 c j p 0 j , by transforming the j-th categorical value to the numeric value c j = 2 • ln . From this perspective, under H 0 , we have Let Φ denote the CDF of N(0, 1), and let z t be the percentile such that Φ(z t ) = t.Assume µ = µ 1 is true.When n is large, in either case µ 1 > µ 0 or µ 1 < µ 0 , the statistical power is approximately The proof of the proposition is provided in the appendix (Section 6).The proposition has three implications.First, the statistical power depends on the distance between the null value µ 0 and the alternative value µ 1 when µ = µ 1 .(In Section 4, using simulation, we show that the statistical power is maintained closely even when the true value of µ is not exactly equal to µ 1 .)Second, the statistical power also depends on the standard deviation σ 1 , and we shall prefer smaller σ 1 .Third, assuming ⃗ π = ⃗ p 1 is true, consider replacing c j by another real number x j for j = 1, . . ., K.Then, for given ⃗ p 0 and ⃗ p 1 , the means (µ 0 and µ 1 ) and the standard deviations (σ 0 and σ 1 ) depend on the choice of (x 1 , . . ., x K ), so we write Let x * 1 , . . ., x * K be the values which maximize h, and let must be more powerful than Z 0,Λ asymptotically (at least not less powerful) at significance level α.
Example 3. Continuing from Example 2, under the same ⃗ p 0 = (1/7, . . ., 1/7) T , ⃗ p 1 = (.15,.1,.1,.1,.15,.2,.2) T , α = .05and n = 100, our goal is to find a set of maximizers for the objective function h(x 1 , . . ., x K ) in Equation ( 5).There is no closed-form solution, and we can use a numerical method (e.g., optim function in R).We find x * 1 = 4.416653, x * 2 = x * 3 = x * 4 = 7.474607, and x * 6 = x * 7 = 2.164017 are a set of maximizers.The transformation from categorical data to numeric data (e.g., Monday → 4.416653) leads to µ 0 = 5.083594, σ 0 = 2.238887, µ 1 = 4.432985, and σ 1 = 2.198819.Therefore, we can formulate the hypothesis testing as H 0 : µ = 5.083594 versus H 1 : µ < 5.083594 according to the numeric transformation.Note that Figure 1 presents the statistical power of the Chi-Square GOF, the log-likelihood ratio and the numeric transformation methods using the asymptotic calculations.There is nearly no difference between the log-likelihood ratio and the numeric transformation, while both methods yield significantly greater power than the Chi-Square GOF when the specified H 1 : ⃗ π = ⃗ p 1 is true.From practical perspective, our interest should be when the specified H 1 is not true and when n is small.It will be investigated numerically in the following section.

Simulation
We observed a remarkable increase in statistical power by the use of simple alternative hypothesis, and the log-likelihood ratio and the numeric transformation yielded nearly same statistical power according to the asymptotic calculations.For practical purpose, we considered scenarios when we have discrepancy between alternative ⃗ p 1 and true ⃗ π, particularly in small samples.We simulated data and approximated statistical power to investigate the impact of wrongly assumed H 1 in the numeric transformation (NT) and the simple-versus-simple log-likelihood ratio test (S-LR) for n ≤ 50.We also compared NT and S-LR to other hypothesis tests studied in Ampadu (2008) and Steel et al. (2009) including Chi-Square GOF (χ 2 GOF), discrete Kolmogorov-Smirnov (DKS), log-likelihood ratio (LR), Freeman-Tukey (FT), power divergence (PD), discrete Cramer-von Mises (DCM), and discrete Anderson-Darling (DAD).The test statistics are as follow: , where E j = np 0 j is the expected count under H 0 .

Simulation Result
Table 2 presents simulation results, and it addresses three key points.First, we could increase statistical power by the S-LR or the NT even when the specified H 1 was not exactly equal to the truth (Cases 1 to 3 under all Scenarios A to D).An increase in statistical power was sometimes more than double when compared to the other seven tests (χ 2 GOF, DKS, LR, FT, PD, DCM and DAD).Second, when the specified H 1 was in an opposite trend of the truth, statistical power was close to zero (Case 4 under all Scenarios A to D).Third, NT and S-LR showed similar statistical power in many cases (with a difference less than .05),but they showed significantly different statistical power in some cases (e.g., Cases 1 to 3 under Scenario D with n = 20).The discreteness of test statistic in small samples could play a role in such a remarkable difference.We could not generalize the outperformance of NT over S-LR in Scenario D because we have not exhausted all simple alternative hypotheses..582 .826 .934 .991 .152 .261 .331 .477 .248 .419 .563 .766 .407 .511 .809.949(Case 3) NT .587 .826 .929 .988 .168 .261 .334 .478 .241 .423 .565 .767 .412 .665 .811.953(Case 3) S-LR .001.000.000.000.006.004.002.001.002.001.000.000.000.000.000.000(Case 4) NT .002.000.000.000.009.003.002.001.002.001.000.000.000.000.000.000(Case 4)

Multiple-Choice Questions
An exam writer often designs multiple-choice questions to reduce the burden of grading.For a four-choice question (one correct answer and three distractors), let π A , π B , π C and π D denote the probability that each letter (A, B, C, and D) is a correct answer.Let H 0 : π A = π B = π C = π D = .25 which is an ideal distribution.Students' common conception is that "C" is the most common answer for four-choice questions.Based on their common conception, let H 1 : π A = .1,π B = .25,π C = .4,π D = .25.We analyzed a mathematics test written by a college professor which consists of n = 40 four-choice questions (one correct answer and three distractors).In the answer key, there were 5 A's, 11 B's, 13 C's, and 11 D's.The significance level of hypothesis testing was fixed at α = .05,and we implemented each hypothesis test discussed in Section 3. As done in the simulation study, we generated the null sampling distribution of each test statistic and then calculated the p-value.The resulting p-values are given in Table 4.The two tests based on the simple alternative hypothesis (S-LR and NT) achieved the statistical significance, while the other tests could not.Varying alternative hypothesis (H 1 ) after the calculation of p-value is not allowed in practice.For illustration purpose only, we considered another alternative hypotheses H 1 : π A = .1,π B = .2,π C = .5,π D = .2.The resulting p-values were .045for S-LR and .044for NT.This example illustrates that S-LR and NT serve as an efficient test when we have a plausible alternative hypothesis based on accumulated experiences before observing data.

Distractor Analysis
A multiple choice question can be an effective method to assess students conceptual thinking (if well designed), and it reduces the burden of grading.The effectiveness of a multiple-choice question depends on its distractors, choices which serve as wrong answers (University of Wisconsin Oshkosh Testing Services, 2017).For example, in a four-choice question (one correct answer and three distractors), if two distractors are easily identified by students as wrong answers, the four-choice question may seem to be a true-or-false question.An ideal (conditional) distribution of students choices on three distractors would be one-third for each distractor.
To assess students' understanding for the interpretation of a confidence interval, the following sentence was given in a quiz."Based on a sample of size 132, a 95% confidence interval is calculated as (.48, .72)for the proportion of female students in the campus."Students were asked to select the correct interpretation among the following four choices: (A) 95% of 132 students in the sample were female, (B) Before collecting the sample, a 5% chance was allowed for missing the population proportion of female students, and an estimated proportion of female students is from .48 to .72 based on the collected sample, (C) There is a 95% chance that the true proportion of female students is between .48 and .72,and (D) If we take a sample from the population a large number of times, the true population proportion will fall between .48 and .72.If the three distractors (A), (C) and (D) are plausible, the null hypothesis H 0 : π A = π C = π D = 1/3 could be a reasonable assumption.Assuming (C) is the most common misconception and (A) is not as a strong distractor, the simple alternative hypothesis was specified as H 1 : π A = .1,π C = .6,π D = .3.Among the 68 students who took the test, n = 26 students selected one of the three distractors; 4 selected (A), 13 selected (C), and 9 selected (D), where the respective observed proportions are .154,.500,and .346.
The significance level of hypothesis testing was fixed at α = .05,and we implemented each hypothesis test discussed in Section 3. As done in the simulation study, we generated the null sampling distribution of each test statistic and then calculated the p-value.The resulting p-values are given in Table 4.The two tests based on the simple alternative hypothesis (S-LR and NT) achieved the statistical significance, while the other tests could not.For illustration purpose only, we considered other alternative hypotheses.When H 1 : π A = .25,π C = .5,π D = .25,the resulting p-values were .047for S-LR and .033for NT.When H 1 : π A = .3,π C = .4,π D = .3,the resulting p-values were .056for S-LR and .028for NT.When H 1 : π A = .4,π C = .3,π D = .4,which is not supported by the observed data, the resulting p-values were .947for S-LR and .946for NT.This example illustrates the benefit of using S-LR and NT for experienced and knowledgeable researchers, but not for any researchers.

Discussion
It is difficult to reject H 0 with composite H 1 when n is small and particularly when K is large.When a researcher has specific scientific rationale and/or experience to argue a simple alternative hypothesis, statistical power can be significantly increased by the use of simple H 1 instead of composite H 1 .The simulation results show that we can gain statistical power when a researcher specifies a correct trend such as decreasing, step, triangular, platykurtic or etc.For NT and S-LR, a simple H 1 : ⃗ π = ⃗ p 1 does not have to be exactly the truth, and a loss of statistical power due to a small degree of discrepancy between simple alternative ⃗ p 1 and the truth ⃗ π was negligible.In particular, a researcher can gain statistical power (relative to other tests based on composite H 1 ) when the direction of one-sided H 1 in terms of µ is consistent with the true value of µ.In other words, if we denote the null, alternative and true values of µ by µ 0 , µ 1 and m T , respectively, NT and S-LR have consistently shown higher statistical power than the other tests when (µ 1 − µ 0 )(m T − µ 0 ) > 0. On the other hand, when (µ 1 − µ 0 )(m T − µ 0 ) < 0, NT and S-LR have resulted in nearly zero power.The benefit of using a simple alternative hypothesis is (i) for those who know their scientific problems reasonably well and/or (ii) for those who have practically meaningful simple H 1 to be tested.

Table 1 .
Four simulation scenarios (A, B, C and D) and four cases (1, 2, 3 and 4) in each scenario