A New Transformed t-test for Skewed Data: A Goodness-of-fit Approach

A new transformed two-sample t -test has been proposed for testing equality of two population means for skewed distributions by means of a univariate normal goodness of fit to the combined sample. The small sample performance of the proposed test is compared with untransformed t-test and the non-parametric analogue of t-test via Wilcoxon rank sum test using real-life examples and simulation from skewed distributions with varying values of skewness, empirically. It reveals that the proposed new test is appropriate for estimating the level of significance and is more powerful than the untransformed t-test and the Wilcoxon rank sum test for skewed distributions.


Introduction
Let = ( , , … , ) and = ( , , … ) be two independent random samples from two populations having means = ( ) and = ( ), respectively. We wish to test the null hypothesis : = that is, the two populations from which the two samples are considered have the same mean. For testing , the standard statistical models usually assume that the two population distributions are normal with the common unknown variance . Under the assumption, a pooled estimator of is given by follows Student's t-distribution with + − 2 degrees of freedom. This test is uniformly most powerful unbiased test (see, e.g., Lehmann 1994), and is omnipresent in statistical practice for making inference about the difference of the two population means.
In real life, the assumption of normality is often invalid or unmet. As such, one option is to use the nonparametric analog of t-test, namely, Wilcoxon rank sum test (Wilcoxon, 1945) or Mann Whitney U test (Mann & Whitney, 1947) which does not require the normality of the data for the validity of the inference. Alternately, one may use the t-test to transformed data following an appropriate transformation. With transformation an option, the common practice is to re-express the data to achieve the normality and then implement t-test (Mosteller & Tukey, 1977;Atkinson, 1985). In an oft-cited paper, Box and Cox (1964) suggested a power transformation for non-negative observations to achieve normality. Since then Box-Cox transformation has widely been used for of the problems of statistical inference.

Methods
In this section, we review some popular tests for comparing two groups with respect to their locations (means or medians). Section 2.1 presents a brief review of nonparametric Wilcoxon rank sum test for the completeness of the comparison. A Box-Cox transformed -test achieved via a maximum likelihood method is discussed in section 2.2. The new transformation using the univariate normal goodness-of-fit is discussed in section 3. Examples from a real-life situation and a simulated data appear in section 4 to demonstrate the application and performance of the proposed test as compared with the other tests described. A simulation study is carried out in section 5 to compare the finite sample performance of all tests considered in this article. Results and discussion from examples and simulation study appear in section 6. The concluding remarks of the study appear in section 7.

Wilcoxon Rank Sum Test
The nonparametric Wilcoxon rank-sum test, also known as the Mann-Whitney U test, is well-known and preferable to the two-sample t-test when the two populations the samples come from depart from normality. Let { , … , } and { , … , } be two independent samples from two populations with continuous cdfs and and location parameters and , respectively. Then, the basic null hypothesis of the Wilcoxon rank sum test is that the two populations have an identical distribution (Gibbons & Chakraborti, 2014;Kvam & Vidakovic, 2007;Desu & Raghavarao, 2004). That is : ( ) = ( ). Note that when the two random variables and have the identical distribution, they will have the same median or mean, say, = . Then, one can test the equality of two location parameters using the test : = or equivalently : = .
In order to test , the Mann-Whitney ( ) test compares each , = 1,2, … , with each , = 1,2, … , and is defined as follows: , where < < ⋯ < are the ordered ranks of " " -observations in the combined sample. On the other hand, the Wilcoxon rank sum test ( ) is defined in terms of "sum of ranks in the combined sample": = ∑ . It is easy to verify that and are connected by the equation = + ( ) (e.g., see Gibbons & Chakraborti, 2014;Kvam & Vidakovic, 2007;Desu & Raghavarao, 2004). In view of this relationship, one can use either of the statistics or or similarly defined or for testing .
For an example, given a level of significance , the inference procedure using Wilcoxon rank sum statistics can be made as follows: We implement this test using the statistical software R.

Box-Cox Transformed Test
An alternative to Wilcoxon rank sum test, one can use the Box-Cox transformation (Box & Cox, 1964) to achieve normality before applying t-test when the data deviate from normality. For simplicity of presentation, let = ( ,…, ) and = ( ,…, ) be non-negative random variables having a positive skewed distribution or deviating from normality. Given a scalar , the Box-Cox power transformation to the sample , ( ), is defined by The transformation to , ( ) is defined in a similar way.
Let ( ) = ∑ ( ) be the mean of the transformed sample ( ). Let ( ) be defined similarly. Let ( ) be the pooled maximum likelihood estimate of the variance to the transformed data given by Given the transformation (1) is successful to transform the data to fit a normal model, the profiled log-likelihood function for the transformation parameter is ( ) = −{( + )/2} log ( ) + ∑ log + ∑ log (2) Box and Cox (1964) Hinkley (1975) and Hernandez and Johnson (1980) investigated the asymptotic properties of the parameter estimates. Bickel and Doksum (1981) critically examined the behavior of the asymptotic variances of the parameter estimates for regression and analysis of variance situations. Chen and Loh (1992) and Chen (1995) proved that the Box-Cox transformedt test is typically more efficient asymptotically than the t-test without transformation. Islam and Chen (2007) justified the use of transformed t-test by fitting a t distribution to transformed data.

The New Proposed Transformed T-Test
In this article, we propose a new transformed -test by applying a univariate normal goodness-of-fit to the transformed combined sample data. This method is easy to implement using any standard statistical software, and it outperforms other tests considered in this study while applied in real-life problems and simulations. Below we describe the new method along with an algorithm to implement it.
Given the transformation ( ) is successful or nearly successful in achieving normality, it is expected that ( ) = of size = + from a N(0,1) distribution, which for the simplicity of the presentation we write as: ( ) = ( ), ( ), … , ( ) We propose to estimate by in a way that is as close as possible to the true N(0,1) distribution. Viewing this problem as a goodness-of-fit to normal distribution, we test the hypothesis: 0 : 1 ( ), 2 ( ), … , ( ) is coming from a N(0,1) distribution, against 1 : 1 ( ), 2 ( ), … , ( ) is not a N(0,1) distribution. Following Shapiro and Wilk (1965), we use the test statistic ( ) to test , which is given by , where is the variance-covariance matrix of order × , and , = ( ) ( ), ( ) ( ) , , = 1, … , is the covariance between th and th order statistics.
While the value of ( ) lies between zero and one, the small value of ( ) leads to the rejection of normality, whereas a value close to one indicates normality. Given a level of significance , one may reject the null hypothesis if -value ( ) = ( ≤ ( )) ≤ and accept otherwise. We propose to estimate by observing the maximum -value associated with ( ) over all possible values of λ to achieve the desired normality of the transformed data.
In other words, the new estimate using the univariate goodness-of-fit to N(0,1) distribution satisfies the equation where is a pre-specified set of values of considered in the search. In this article, the search for is made over the interval [−1,1] with an increment of 0.1, and therefore, hereafter, we express it by = {−1: 0.1: 1}. Once is obtained, we re-express the original samples and apply Student's t-test to the transformed data. We employ the software R in all examples and simulation to obtain the optimum and other computational purposes.
An algorithm for the estimate and the transformed test using is as follows: Given and and a fixed ∈ = {−1: 0.1: 1}, i.
ii. vii. is the value of corresponding to the maximum -value in step (vi).
viii. Obtain transformations and .
ix. Perform usual t-test based on transformed data in step (viii) and decide about the acceptance and rejection of the null hypothesis comparing with critical value of distribution.

Applications
In this section, we will present two examples, one with real life data and the other with simulated data from a skewed distribution to show application and performance of various tests in making inference about acceptance or rejection of the equality of two population means.

Example 1
In this example, and refer to sample data of checkout times, in minutes, of two grocery checkers. This data is due to Verzani (2005 Figure 1 to understand the shape of the simulated data. Based on values of skewness and the shape from the histograms and boxplots, both samples and seem to have positively skewed distributions. Let the population mean difference ∆ = − . The results of the test : ∆ = 0 against the two-sided alternative : ∆ ≠ 0 using various tests discussed in this article appear in Table 1.   Table 1, it follows that all four tests provide identical conclusion of the acceptance of the null hypothesis at 5% level of significance, with transformed two tests, ( ) and ( ) , outperforming the other two tests with -values 0.4229 and 0.4111.
It is to be noted that the conclusion of -test, whatever it is, may be misleading because data do not provide any evidence of normality, a violation of applicability of -test. Wilcoxon test assumes that the two distributions are identical, and is a popular alternative to Student's -test for comparing two populations with respect to locations (medians). On the other hand, the conclusion of both transformed -tests appears to be valid because transformations were intended to achieve normality.

Example 2
For this example, we simulate sample from a (2,1) distribution and the sample from 0.8 + (2,1) distribution. Thus, in the population distributions of and , an absolute mean difference is |∆ | = − = 0.8.
In other words, we simulate two samples and under alternative hypothesis 1 : ∆ ≠ 0. For the convenience of the presentation, we round up the values of the simulated data to two decimal places and are presented as follows: Histogram of X  Figure 2 to understand the shape of the simulated data.

Figure 2. Histograms and boxplots of samples and in Example 2 with their shapes
Since the samples and come from the populations with identical variance but different means, we expect that various test statistics would be able to assess the inequality of the two means with stronger evidence. The results of various tests with corresponding -values are reported in Table 2. Based on the performances of two examples presented, it seems reasonable to recommend the new transformed test for skewed data.

Simulation Study
In this section, we carry out a simulation study to compare the finite sample performance of the various tests described in this article, along with the proposed t-test. All simulations are performed by using the statistical software R, with values of ∈ = {−1: 0.1: 1}. Under the null model, the samples and are simulated from ( , ) population where is the shape parameter and is the scale parameter. Under alternative model, the samples and are simulated from ( , ) + ∆ and ( , ) populations, respectively, with the mean difference ∆ > 0. The mean difference ∆ is arbitrarily chosen from the set {0.15, 0.25, 0.50, 0.65, 1.25} to ensure a testing power away from 0 and 1 for the purpose of the comparisons. Note that the skewness of ( , ) distribution is = 2 √ ⁄ . In simulations, we choose different values of the parameter to allow varying levels of skewness of the simulated samples. We fix the value of the parameter at 1 since it does not affect the skewness of the simulated data. In all simulations, the Monte Carlo size is considered 5,000. The power of various tests is estimated from the proportion of rejection of null hypothesis under alternative over a Monte Carlo simulation of size 5,000 at 5% level of significance. In a similar manner, the level of significance is estimated from the proportion of the rejection of the null hypothesis over a Monte  Vol. 9, No. 5;2020 Carlo simulation of size 5,000 at 5% level of significance when the null hypothesis is true. Table 3 provides the values of the parameter used in the simulation of samples and to allow varying values of the skewness.  Table 4 provides estimated power of the simulation study for varying values of shape parameter , sample sizes ( , ) and the mean difference ∆ = − . Table 5 provides estimated rejection rates under the null distributions at 5% level of significance, along with mean and standard deviation of the estimated transformation parameter by maximum likelihood ( ) and univariate goodness of fit technique ( ) over a Monte Carlo simulation of size 5,000.

Results and Discussion
The results of Example 1 in section 4 suggest that all the four tests applied to compare means of checkout times of two grocery checkers lead to the identical conclusion of acceptance of equality of two locations with -values 0.3716 (Student's ), 0.2394 (Wilcoxon test) and 0.4229 (transformed -test by a maximum likelihood method) and 0.4111. However, given the fact that the and seem to have a positive skewed distributions, as are evident by histograms and boxplots in Figure 1, and the skewness (1.71 for and 0.86 for ), one may be doubtful about the conclusion of the Student's -test. In reference to the conclusions of four tests applied to Example 2, only the proposed new test ( ) could make a correct conclusion of the rejection of the null hypothesis given the fact that the data were generated under the alternative. Thus, the proposed test outperforms other tests in the right decision-making.  Table 4, it is evident that the new transformed test ( ) provides the maximum power for all sample sizes, equal ( = ) and unequal ( ≠ ), among all four tests considered. We consider equal sample sizes ( = ) at 10, 15, 20 and 25. Note that the lower value of the shape parameter corresponds to the higher value of the skewness. To evaluate the performance for varying values of skewness, we consider values of from 0.25 to 10 with arbitrary increases to its values to cause skewness to decrease from 4 to 0.6 as appeared in Table 3. It is also evident that all tests demonstrate higher power as mean difference ∆ and sample size increase. The new test ( ) has always performed the best in terms of estimated testing power; the second best has been the ( ) test. However, as expected, the nonparametric test has demonstrated higher power than the Student's -test. Also, the differences in power among four tests have decreased as the skewness of the distribution has decreased. It makes sense because Wilcoxon and transformed tests are expected to perform better for skewed distribution; the higher the skewness, the better is their performance with respect to the testing power. As we see, overall, the proposed new test ( ) outperforms all other three tests in terms of the power, for all sample size and skewness considered in the simulation.
From the simulated results presented in Table 5, it appears that the estimated level of significance for Student's ranges from 0.031 to 0.052, for a 5% nominal level of significance, throughout the simulation, under null hypothesis. Indeed, the estimated levels of significance seem to be underestimated for all sample sizes for highly skewed distributions (e.g., = 0.25, 0.50) and approach the nominal level as the skewness decreases ( = 1, 2, 10). The estimated rejection rates for Wilcoxon test is close to the nominal level of 5%, with estimated values ranging from 0.036 to 0.052, under null hypothesis. On the other hand, the estimated rejection rates for both versions of transformed tests are comparable at 5% level of significance, with estimated values ranging from 0.045 to 0.057 for ( ) test, and 0.048 to 0.058 for ( ) test, under null hypothesis.
The estimated average and standard deviation of and over 5,000 simulations under null hypothesis are also reported in Table 5, where the search for and is made in the interval [-1,1] with an increment of 0.1. It follows that the average and standard deviation of and depend on the levels of skewness of the distributions, with standard deviation of both decreasing with the increase of the sample sizes for a given value of skewness. In terms of average and standard deviation values of and , similar conclusions apply under the alternative hypothesis where powers are calculated and therefore, are not reported in Table 4 to avoid redundancy.

Concluding Remarks
This article proposes a new transformed -test where the Box-Cox transformation to normality is achieved via a univariate normal goodness-of-fit test. To this end, we i) apply Shapiro and Wilk test to the combined standardized transformed samples to fit into the N(0,1) distribution, ii) estimate the best transformation to normality by observing the maximum -value from the Shapiro and Wilk test for all possible values of ∈ {−1: 0.1: 1} and iii) apply student's -test to the best normal transformed samples to compare location parameters (means). The performance of the new test over Student's -test, Wilcoxon test and an existing transformed -test achieved via likelihood method has been justified by two examples, and simulations where data comes from skewed distributions (gamma distribution). It is evident that the new test is appropriate for estimating the level of significance and is more powerful than other three tests considered for skewed distributions. It is also clear that higher the skewness, the better are the transformed -tests in terms of the testing power, with the new transformed ( ) test performing the best. It makes sense because if the data is less skewed or almost no skewed at all, the power transformation will not be needed or appropriate. It follows that the power of all tests is sensitive to the mean difference and sample size; the power of all tests increases with the increase in the mean difference of two population means and the size of the samples. Given the performance of the proposed new -test, in terms of estimated power under the alternative hypothesis, and estimated level of significance under the null hypothesis, researchers can practice the proposed test with confidence. Overall, the Wilcoxon test is better in power than the Student's -test and transformed -tests are better than the Wilcoxon test with the new proposed test ( ) demonstrating the highest power. If researchers are too concern about the estimated level of significance, they might consider Wilcoxon test because of its robustness. However, if power is of the concern, the new test performs the best.