A Competitor of the Kolmogorov–Smirnov Test for the Goodness-of-ﬁt Problem

,


Introduction
A random sample x 1 , . . ., x n is drawn from a population X with continuous distribution function F, to test the null hypothesis H 0 : F(x) = F 0 (x) against the alternative H 1 : F(x) F 0 (x), x ∈ , where F 0 is completely specified.This common goodness-of-fit problem is usually faced by three classes of tests: the chi-square test, the tests based on spacings and the tests based on the empirical distribution function (edf).In this latter class several test-statistics can be considered, usually by adapting their versions for the two-sample problem.
The most known test based on the edf F n is surely the Kolmogorov-Smirnov test, which rejects H 0 for large values of the test-statistic As known, other test-statistics can be defined by considering the square of the difference |F 0 (t) − F n (t)|, like in the Cramér-Von Mises test Notice that in the above considered test-statistics F 0 , a continuous model, is compared with F n , which has discontinuities at x 1 , . . ., x n .However, in (1) the supremum of the difference |F 0 (t) − F n (t)| is taken, while in (2) the squared difference [F 0 (t) − F n (t)] 2 is integrated with respect to the continuous function F 0 .Because of these latter choices, no particular care is needed in the definition of the value taken by the edf at its points of discontinuity.This means that one can use the usual definition (where x (1) , . . ., x (n) denotes the ordered sample, x (0) = −∞ and x (n+1) = +∞), which makes F n to be right-continuous, or equivalently set (where c is chosen in [0,1]), so that F n can take every value of its jump at x (i) (i = 1, . . ., n).
Turning back to (2), the function with which the squared difference [F 0 (t)−F n (t)] 2 is integrated could be substituted by the edf itself.This choice allows to simplify the test-statistic as but the definition of F n (x (i) ) becomes now relevant.However, Anderson (1962) pointed out that, when c = 1/2 is taken in (4), the test-statistics C n and C n are equivalent, as the former can be also written as Besides such a latter equivalence, setting c = 1/2 in ( 4) is, as a matter of fact, a natural choice.Indeed, forcing the edf to take the mid-point of its jump at x (i) seems less arbitrary than choosing any other value in the jump (including the extremes i/n and (i − 1)/n, i = 1, . . ., n.
Notice again that any choice of F n (x (i) ), made to give a final form to C n in (5), does not affect the usual definition of the edf in the open intervals (x (i) , x (i+1) ), i = 1, . . ., n − 1.However, in the literature some modifications of the edf in such intervals were also proposed.For instance, Green and Hegazy (1976) pointed out that when the edf is re-defined as the criterion C n in (2) reduces, up to a multiplicative constant, to which is shown to lead to a powerful test under some circumstances.Notice that the test-statistic in (8) can be also obtained from C n in (5) by re-defining accordingly the value of the edf at its discontinuities, that is by setting which is again the mid-point of the jump of F n at x (i) .Other modifications of the definition of the edf in the open intervals (x (i) , x (i+1) ) are known.By noticing that the term i/(n + 1) is actually the expectation of F 0 (x (i) ) under the null hypothesis, Pyke (1959) proposed a new version of the Kolmogorov-Smirnov criterion (1), which in turn induces a further modification of the definition of the edf (see also Brunk, 1962).
The above remarks will be used in this paper to propose a goodness-of-fit version of the Girone-Cifarelli test, which was mainly studied for the two-sample problem.The definition of the test-statistic for goodness-of-fit purposes raises some questions which will be addressed in the next section, where the sample properties of the newly proposed test-statistic will be also analyzed.Section 3 will report some results of a simulation study, where the proposed test is compared with its most important competitors based on the edf.Section 4 will provide a real-data example and some conclusions.(1964) proposed a test for the equality of two populations X and Y, based on the statistic

Girone
where F n , G m and H m+n denote respectively the edf's of a n-sample from X, a m-sample from Y and the pooled (m + n)-sample.The test was actually originally proposed in the special case n = m and its sample properties were studied by Cifarelli (1974Cifarelli ( & 1975)).Generalizations for the case n ≤ m were proposed by Goria (1972), by Borroni (2001) and independently by Schmid and Trede (1995).For the two-sample problem, the Girone-Cifarelli test proved to be superior to other common tests, notably the Kolmogorov-Smirnov test, under a wide set of circumstances.This fact is far from being unexpected, as in (10) the whole behavior of the difference |F n (t) − G m (t)| is considered, while in the Kolmogorov-Smirnov test just its supremum is taken.
A goodness-of-fit version of the Girone-Cifarelli test would result useful.Using the same settings as in section 1, the edf F n of the single n-sample is now to be compared with the null model F 0 .The function with respect to which the difference |F n (t) − F 0 (t)| is to be integrated could then be the null model F 0 or the edf F n .As above remarked, this latter choice highly simplifies the structure of the test statistic, as as a consequence, the definition of the value taken by the edf at its discontinuities becomes relevant.Following the above suggestion by Anderson (1962) for C n , we can then take and define The sample properties of A n are easily derived from its two-sample equivalent.First of all notice that, being F 0 a continuous model, the variables F 0 (x (i) ), i = 1, ..., n, are uniform over [0,1] and hence A n is distribution-free under H 0 .For small sample sizes, the null distribution of A n can then be determined by simulation, as pointed out in the next section.Moreover, following Cifarelli (1975), n (−1/2) A n is asymptotically distributed as the r.v.
Differently from C n , A n is not equivalent to the statistic obtained by using F 0 as an integrating function in (11).This is shown by considering that Schmid and Trede (1996) considered √ n A n as a test-statistic and reported a small simulation study to evaluate its performance.They concluded that the power of A n is quite close to the one of the Cramér-Von Mises test, without reporting situations where A n performs definitely nor uniformly better than C n .It should be pointed out that A n , which has a rather simpler form, is not equivalent to A n , even if the two tests have often similar powers.Consequently, the next section will first present some results of a simulation study without distinguishing between A n and A n .In the following, some insights about the situations where the two tests are likely to perform differently will then be given.In a sense, the reported simulation study can be considered as an extension of the one by Schmid and Trade (1996), because it will be able to locate some alternatives where the test based on A n , along with the one based on A n , performs definitely better than the Cramér-Von Mises test.

Simulation Study
The first task to develop a goodness-of-fit test based on A n is to determine its critical values.As above mentioned, being F 0 completely specified and continuous, the transformation F 0 (X) gives a Uniform distribution over (0,1).Hence the null distribution of A n can be simulated by randomly generating a large number of samples from such a distribution, with a fixed size n.The critical values of the test can then be determined by computing the value taken by A n for each simulated sample as long as the related frequency distribution.For a selected range of sample sizes and some common significance levels, Table 1 reports the critical values of n (−1/2) A n based on 10 6 simulated samples.As a term of comparison, the last column of Table 1 reports the critical values of the asymptotic distribution of n (−1/2) A n (see section 2).The fast convergence to the asymptotic approximation can be easily appreciated.In order to get further indications about the accuracy of the asymptotic distribution and the sample sizes needed to use it, the simulated cdf's obtained for fixed values of n were compared with the asymptotic cdf, whose expression is found in Johnson and Killeen (1983).Figure 1 reports the results obtained for n = 10.As seen, the asymptotic cdf is very close to the "real" one, even if a certain difference is observed, especially for small values of the variable.However, one can claim that, to develop a goodness-of-fit test based on A n , just the right tail of its null distribution is relevant.In effect, when the last part of the distribution is considered (say for such x so that Pr{A n ≤ x} > 0.8) the finite cdf is rather close to the asymptotic cdf.To get into further details, Table 2 reports, for some selected sample sizes, the greatest absolute difference of the two cdf's and the same difference referred to the right tail of the distribution.From such a table, a minimum value of n = 50 is to be advised to get a correct approximation of the null distribution.After computing the critical values of the test based on A n , its power can be estimated by simulation as well.This section reports some results of a wide simulation study conducted at this aim.Notice that the power of a goodness-of-fit test will depend on the model F 0 chosen under H 0 as long as on the real cdf of X under H 1 , which will be denoted as F 1 .Generally F 1 will belong to a family of distributions containing F 0 itself, which is hence obtained by an appropriate choice of the parameter(s) of the family.In this paper we will focus on three models for H 0 , mostly used in applications: the standard Normal distribution, the unit exponential distribution and the uniform distribution on the unit interval.For each null model, F 1 will belong to three different families of distributions containing F 0 .
Consider first the standard Normal as a null model.A suitable family for F 1 could be the skew-normal (SN) distribution (see Azzalini, 1985) with density: where φ(•) and Φ(•) denote the density and the cdf of a standard Normal respectively.The parameter a ∈ regulates the skewness of the distribution, thus giving a standard Normal if set to zero.To this end, using family ( 16) for F 1 in a simulation study, can result in an useful analysis of such situations where the researcher needs to test normality against possible asymmetries of data.It is known, however, that data may depart from normality due to heavy-tailedness.To simulate such latter situations, the Student's T density can be used as a family for F 1 : (Γ(•) denotes the gamma function).The family (17) gives only symmetric distributions with heavy tails, such phenomenon being reduced by increasing the parameter a > 0; as known, the family converges to the normal distribution when a → ∞.Finally, to simulate such cases where the normality of data depends on the application of the central limit theorem, one can choose for F 1 the gamma (GA) density with unit scale: As an effect of the above theorem, when a → ∞, family (18) gives a normal density (which can be then standardized to be consistent with the null model F 0 ).However, in applications, a may not be large enough to guarantee a good convergence; the researcher may then need a powerful test to detect such a failed convergence.Table 3 reports the results of a set of simulations, each based on 10 5 replications, for the null standard normal model.Some selected alternative distributions, all belonging to the above described families, are chosen.
here the squared difference [F n (t) − F 0 (t)] 2 is weighted to get more sensibility in the tails of the distributions.
The powers reported in Table 3 were obtained by fixing three different values of the significance level α: 0.01, 0.05 and 0.1 (for each entry, the corresponding powers are listed in the latter order; the best power is highlighted too).It seems, however, that the performance of none of the considered tests is really affected by the choice of α.Moreover, the first row of Table 3 reports the estimated actual significance level, which is always very close to the nominal one, even for a small sample size as n = 10 (similar results, not reported here for the sake of brevity, were obtained for larger sample sizes).Notice that, for each considered alternative distribution, the values of the related parameter were set to allow a relevant comparison of the estimated powers; this need implies, incidentally, that the same value of the parameter cannot be chosen for all sample sizes in most cases.However, Table 3 (along with the following tables) was built so that at least one same value of the parameter is chosen for adjacent sample sizes.Table 3 emphasizes that, when used as a test of normality, the Girone-Cifarelli test has a good power against some kinds of alternatives.More specifically, the test outperforms all the other considered tests (and notably the Kolmogorov-Smirnov and the Cramér-Von Mises test) when the alternative distribution belongs to the skewnormal class.The superiority of the Girone-Cifarelli test for skewed alternatives seems indeed to be a general rule, at least among the considered tests, as further evidenced in the following simulations.When normality is to be tested against heavy tailedness, like for the Student's T alternatives considered in Table 3, the performance of the Girone-Cifarelli test gets worse, notably over the Anderson-Darling test.This result is far from being unexpected, but it has to be underlined that A n still keeps its superiority over the Cramér-Von Mises test (and the Kolmogorov-Smirnov test).The superiority of the Anderson-Darling test still characterizes Gamma alternatives.In this chance, however, the Girone-Cifarelli test gets worse even over the Cramér-Von Mises test and the Kolmogorov-Smirnov test.A global evaluation of Table 3 shows that, as expected, the performances of the considered tests become similar when the sample size increases, even if all the above conclusions still hold.Notably, A n outperforms the other considered tests for skew-normal alternatives, as D n does for Student's T and Gamma alternatives.However, in this latter case, the Girone-Cifarelli test seems to grow better over its competitors as the sample size increases.
Another set of simulations was conducted by setting the unit exponential distribution as a null model.This assumption is typical for many datasets in reliability analyses.In this kind of applications, exponentiality is often to be tested against some more complicate distributional assumptions.To this end, a natural choice for F 1 is again the gamma (GA) density with unit scale.When a = 1, (18) reduces to the unit exponential.Another family of distributions, mostly used in reliability analyses as well, is the Weibull (W) density with unit scale: which was used as a family for F 1 , after noticing that it reduces to the unit exponential when a = 1.Finally, a third family was used to shape the alternative hypothesis: that is the generalized-Pareto (GP) density with zero location and unit shape.( 21) gives the unit exponential density as a → 0. The best results for the Girone-Cifarelli test were obtained for the Gamma alternatives, as shown by Table 4, which has the same settings as Table 3.A n outperforms all the other considered tests, notably the Cramér-Von Mises test.The Anderson-Darling test has generally a worse power than A n , even if it becomes its main competitor as n increases.It has to be emphasized that the simulated powers reported for Gamma alternatives in Table 4 cannot be compared with the ones reported in Table 3, as the null distribution is quite different in the two sets of simulations.More specifically, when the null model is the Normal distribution, the power of each considered test is a decreasing function of the parameter a in (18); conversely, when the null model is the unit exponential distribution, the power is an increasing function, at least if a > 1.This fact explains why, even if the sample size and the value of a may coincide, Table 4 and In the simulations reported in Table 4, both the null and the alternative distributions are skewed.To consider other cases where a null symmetric model is to be tested against skewed alternatives, like for the above standard Normal case, a third set of simulations is finally reported.The uniform distribution on the unit interval (0, 1) is used as a null model.Some "modifications" of the uniform density are considered as alternatives.The first has density (where a > 0) and it is labeled as MU; the second, is essentially a "compressed" uniform (CU) distribution over the interval (a, 1 − a), where 0 ≤ a ≤ 1/2.Both densities reduces to the uniform distribution on the unit interval when a = 0.They were drawn from the study conducted by Schmid and Trede (1996).To complete such an investigation, a third alternative family is here considered for F 1 : The Beta (B) density in (24) reduces to the uniform distribution on (0, 1) when a = b = 1.In the reported simulation study, b is then set to 1 and a > 0 is left to vary.Notice that, as a grows over 1, the distribution becomes more skewed, thus giving exactly the needed kinds of alternatives.Table 5 reports some results of this last set of simulations.The alternatives of kind ( 22) and ( 23) evidence that all the considered tests suffer from a problem of bias, to which the Anderson-Darling test seems to be the most exposed.A second remark is that the performance of A n is similar to the one of C n , even if the former has almost everywhere a higher estimated power.Both the Girone-Cifarelli and the Cramér-Von Mises test are outperformed by the Anderson-Darling test for as large sample sizes as n = 100 (for alternatives (23) even from n = 25).These conclusions add few extra details to the ones obtained by Schmid and Trede (1996) for the test based on A n in (15), which has actually a performance similar to the Girone-Cifarelli test.However, the alternatives of the Beta class represent a considerable addition in the evaluation of such tests of uniformity: the power of A n , as long as the one of A n (unreported), is here steadily over the one of the other considered tests, notably over the one of C n , with some minor exceptions for D n .Even if Table 5 reports just the results for selected values of the parameter a in (24), the simulation study showed the Girone-Cifarelli test to be uniformly more powerful than the Cramér-Von Mises test for a > 1.
The discussion of Table 5 raises an important issue to be considered before giving some general conclusions in the next section.As stated from the very beginning, the Girone-Cifarelli test performs often similarly to the test based on A n in (15); this fact resulted clearly from the conducted simulation study and it is essentially the reason why no separate results about A n are reported in the above discussion.However, the two test-statistics A n and A n are not equivalent, as evidenced by the following simple decomposition: where A ≡ i : i−1 n < F 0 (x (i) ) < i n .Notice that the set A is not empty (and thus A n and A n are not equivalent) as long as the empirical distribution function is not dominated by the null model F 0 (or conversely).Hence the possible differences in the powers of A n and A n are likely to be observed when the alternative distribution does not dominate the null model (or conversely), a fact that can be partially guaranteed by letting the two distributions have the same location.A last set of simulations was then conducted where the alternative distribution was forced to have the same mean of the null model.In effect, some of the above-reported alternative distributions do not guarantee such a requirement.In addition, small values of the sample size were chosen, as the effect of the second summand in ( 25) is likely to decrease with n.Table 6 reports some results when A n and A n are used to test unit exponentiality against other skewed alternatives, a situation which proved to be good for both tests against their classical competitors.On the average, A n turns out to perform still similarly to A n , even if there are cases where the difference in their powers becomes relevant.Notice that, with some minor exceptions, the Girone-Cifarelli test has never a lower power with respect to A n .A n outperforms A n for Gamma and notably for generalized-Pareto alternatives.The Weibull case is less definite, as A n has only a minor advantage over A n and not for very small same sizes.
This paper presents a simulations study which gives some new insights about goodness-of-fit tests based on the empirical distribution function.The main conclusion is that a good analysis should never neglect tests based on the averaged absolute difference |F n (t) − F 0 (t)|.The tests based on A n and A n will both serve at this aim, even if the former can give some slight advantages over the latter, at least for small sample sizes.Moreover, the test-statistic A n has a rather simple form and it can be computed very easily.A second important conclusion is that A n (and A n ) has very often a different performance from the one of C n , which is based on the averaged squared difference [F n (t) − F 0 (t)] 2 .The reported simulations give a good evidence of such alternatives where A n outperforms C n .It seems, specifically, that this happens more frequently for skewed alternatives.Concerning the Kolmogorov-Smirnov test K n , which takes into consideration the supremum and not an average of the difference |F n (t)−F 0 (t)|, the reported study shows that there are few practical situations where it performs better than the other considered tests, and notably than A n .The superiority, under some circumstances, of the Girone-Cifarelli test over the Kolmogorov-Smirnov test has been evidenced, in effect, in other studies concerning the two-samples problem.As a last issue, one can claim that the real competitor of A n (and similarly for A n ) is the Anderson-Darling test D n , rather than K n or C n .The discussion in this paper shows that there are cases where D n outperforms all other considered tests and that it leaves A n as a second best.These are mainly cases of alternatives with heavy tails, probably thanks to the weighting function in the definition of D n .An important element of a future research could then be to evaluate the effect of the introduction of suitable weighting functions in the definition of A n as well.

Figure 1 .
Figure 1.Comparison of the "real" cdf and the asymptotic cdf of A n under H 0 (n = 10)

Table 2
. Greatest absolute difference between the "real" cdf and the asymptotic cdf of A n under H 0 (whole distribution and right tail)

Table 3 :
except for n = 5, the test based on D n has definitely the best power, but A n clearly outperforms the Cramér-Von Mises and the Kolmogorov-Smirnov tests.
Table 3 report quite different values of the estimated powers.Turning to other distributions considered in Table4, one can notice that the above conclusions are reversed for alternatives of the generalized-Pareto kind, as D n outperforms here all the other tests; A n has a similar power to the one of the Cramér-Von Mises test, but it seems to worsen as n increases.The Weibull alternatives evidence a problem of bias for some tests under some circumstances; apart from this fact, this case resembles the Student's T alternatives of

Table 5 .
Simulated powers when the null model is a uniform distribution on the unit interval (α = 0.01, 0.05, 0.1)

Table 5 (
continued).Simulated powers when the null model is a uniform distribution on the unit interval (α = 0.01, 0.05, 0.1)

Table 6 .
Comparison between the powers of A n and A n when the null and the alternative distributions have the same location (H 0 = unit exponential, α = 0.01, 0.05, 0.1)

Table 6 (
continued).Comparison between the powers of A n and A n when the null and the alternative distributions have the same location (H 0 = unit exponential, α = 0.01, 0.05, 0.1)