Comparative Performance of Pseudo-Median Procedure , Welch ’ s Test and Mann-Whitney-Wilcoxon at Specific Pairing

The objective of this study is to investigate the performance of two-sample pseudo-median based procedure in testing differences between groups. The procedure is the modification of one-sample Wilcoxon procedure using pseudo-median of differences between group values as the central measure of location. The test was conducted on two groups setting with moderate sample sizes of symmetric and asymmetric distributions. The performance of the procedure was measured and evaluated in terms of Type I error and power rates obtained via Monte Carlo methods. Type I error and power rates of the procedure were then compared with the alternative parametric and nonparametric procedures namely the Welch’s test and Mann-Whitney-Wilcoxon test. The findings revealed that the pseudo-median procedure is capable in controlling its Type I error close to the nominal level when heterogeneity of variances exists. In terms of robustness, the pseudo-median procedure outperforms the Welch’s and Mann Whitney Wilcoxon tests when distributions are skewed. The pseudo-median procedure is also capable in maintaining high power rates especially for negative pairing.


Introduction
Testing the equality of central tendency (location) parameters or differences between two groups is a common statistical problem.Under traditional parametric test statistics, it is well known that Student's two-independent sample t-test (Student, 1908) can be highly unsatisfactory when the distribution of the data is non-normal and variances are unequal (Teh & Othman, 2009;Zimmerman, 2004;Zimmerman & Zumbo, 1993).This test also produces low power under arbitrarily small departures from normality (Keselman, Othman, Wilcox & Fradette, 2004).In cases where distributions are normal but population variances are unequal, Welch (1938) gave the solution to this problem.His solution is an approximate degrees of freedom t test.However, Welch's test still has problems in controlling Type I error under non-normal distributions (Algina, Oshima & Lin, 1994;Zimmerman & Zumbo, 1993).
A popular alternative for analyzing data from non-normal populations is to use nonparametric test statistics such as Mann-Whitney-Wilcoxon test.Nonparametric statistics are insensitive to the deviation of normality.Even though nonparametric methods are distribution free, but they are not assumptions free.Usually the underlying distribution has to be symmetric (Gibbons & Chakraborti, 2003).Nonparametric procedures are more appropriate for data based on weak measurement scales and appropriate for symmetric shape (Syed Yahaya, Othman & Keselman, 2004).In addition, procedures in nonparametric statistics are less powerful than the parametric ones and therefore, require larger sample sizes to reject false hypotheses.Thus, choosing non parametric tests as alternative to the classical tests might not guarantee a reliable method due to the weakness of the tests.
To circumvent the effects of assumptions violations on the classical procedures, researchers have been advised to adopt heteroscedastic test statistics, replace the conventional methods with permutation, or transform their data to achieve normality and/or homogeneity.Some studies suggested substituting robust estimators (e.g.trimmed means and Winsorized variances) for the least square estimators (i.e. the usual mean and variance).Robustness to non-normality and variance heterogeneity in unbalanced independent group designs can be achieved by using robust estimators with heteroscedastic test statistics as demonstrated by a number of papers (Keselman, Algina, Wilcox & Kowalchuk, 2000;Keselman, Kowalchuk & Lix, 1998;Wilcox, Keselman, Muska & Cribbie, 2000).These literatures also indicated that by applying robust estimators with heteroscedastic test statistics, distortion in rates of Type I error could generally be eliminated.However, the use of trimmed means for example, required a percentage of observations to be discarded from the whole data which might cause some useful information from the data to be lost.
Over the years, many procedures were developed to handle the violation of the assumptions.However, each of the aforementioned procedures can only handle certain violations and so far, no single statistical method can be considered ideal.In this study, we proposed a statistical procedure which is based on the pseudo-median to deal with the problem of multiple violations such as non-normality, variance heterogeneity and unbalanced group sizes occurring simultaneously.This study also investigates the performance of the pseudo-medians procedure in terms of controlling Type I error and maintaining high power rates under these multiple violations.The performance of this procedure was then compared with the parametric and nonparametric tests namely the Welch's test and the Mann-Whitney-Wilcoxon test, respectively.This method optimistically, will help researchers in conducting their research in a more flexible situation without having to worry about the rigid assumptions.
The rest of the paper is organized as follows.The second section briefly explains the criteria for evaluating the performance of a statistical test.The third section elaborates on the methods used in this study.The design specifications of the data are described in the fourth section while the fifth section discusses the results.The final section concludes our study.

Performance Evaluation of the Statistical Test
The evaluation of any statistical test involves two attributes namely Type I error and power rates.Type I error and power rates are measured in the form of p-values.Type I error happens when a true null hypothesis   0 H is incorrectly rejected.The power of a statistical test of 0 H is the probability that the 0 H will be rejected when it is false, that is, the probability of obtaining a statistically significant result or the test resulting in the conclusion that the phenomenon exists (Cohen, 1988;1992).A procedure having its Type I error close to nominal value is considered as robust.If a procedure is able to control its Type I error rates close to the nominal value and generates good statistical power simultaneously, then the procedure is deemed to be the procedure of choice.These properties are usually used as the criteria for evaluating the performance of a statistical test.

Methods
The pseudo-median procedure is generated from the modification of one-sample nonparametric Wilcoxon procedure with the incorporation of pseudo-median of differences between group values as the statistic of interest in a two groups setting.As stated in Hoyland (1965), the pseudo-medians of a distribution F is defined as the median of the distribution of    where 1 X and 2 X are independently and identically distributed according to F. Hollander and Wolfe (1999) noted that the pseudo-median of a distribution F is the , where 1 Z and 2 Z are independent, each with the same distribution F.
In this procedure, suppose be samples from distributions 1 F and 2 F , respectively.Let the differences between the observations from both samples be 1 2 , 1, 2,..., and 1, 2,..., . The absolute value of the differences is given by ij D and ij R denote the rank of ij D .An indicator function, ij e , is defined as in Equation 1.1.0, 0 Then the Wilcoxon statistic is defined as Equation 1.2.
The pseudo-median is a location parameter and its value has to be estimated.The estimation is done using the Hodges-Lehmann estimator (Hollander & Wolfe, 1999).The Hodges-Lehmann estimator    of pseudo-median is given in Equation 1.3 where i Z are the differences between the observations from both samples.
ˆ, 1 ,..., 2 The modification of the Wilcoxon procedure is performed by adding the pseudo-median value to all observations in the second sample.A bootstrap procedure was employed to test the hypothesis as given in Equation 1.4 where d is the pseudo-median.
The algorithm of the bootstrap procedure is enumerated below.
1. Based on the two samples, find W and estimate the pseudo-median,   d .
2. Shift the second sample by adding d to all members.
where U = 1 or 0 and L = 1 or 0. 9. Calculate the p-value as 2  minimum (number of L, number of U)/B.

Design Specifications
This study focused on completely randomized design containing two groups with moderate sample size.The total sample size was set to be 40 and then split to form unbalanced design with sample sizes (15, 25), respectively.The test was conducted under heterogeneous group variances as variance heterogeneity can affect both Type I error and power of the analysis (Wilcox, Charlin & Thompson, 1986).Luh and Olejnik (1990) stated that when the population variances differ, the actual statistical power could be less than that desired.To examine the effect of variance heterogeneity on the procedure, in this study, the group variances were set to be 1:36.This ratio was chosen as it reflects extreme variance heterogeneity.This variance ratio was used by a number of researchers in their study for two groups case (Keselman, Wilcox, Lix, Algina & Fradette, 2007;Othman, Keselman, Padmanabhan, Wilcox & Fradette, 2004;Luh & Guo, 1999).
Unequal group sizes, when paired with unequal group variances, will produce either positive or negative pairings.
A positive pairing occurs when the largest group size is associated with the largest group variance, while the smallest group size is associated with the smallest group variance.On the other hand, a negative pairing referred to the case in which the largest group size is paired with the smallest group variance and the smallest group size is paired with the largest group variance.These conditions were chosen since the test for equality of central tendency parameters typically produces conservative results for the positive pairings and liberal results for the negative pairings (Syed Yahaya et al., 2004;Othman et al., 2004;Keselman et al., 2004).According to Cribbie and Keselman (2003), when variance and sample size are directly paired, Type I error estimates can be conservative and power correspondingly will be deflated.On the other hand, when variance and sample size are inversely paired, Type I error estimates can be liberal and power correspondingly will be inflated.Therefore, all the tests were examined under these two types of pairings to appraise their ability in controlling the Type I error and maintaining good power value.
In terms of distributions, we chose a g = 0, h = 0.225 (Hoaglin, 1985) distribution to represent symmetric leptokurtic and the chi-square distribution with three degrees of freedom   2 3  to represent skewed leptokurtic.
The former distribution has zero skewness and kurtosis equal to 154.84 while the later distribution has skewness and kurtosis equal to 1.63 and 4.0, respectively.Both distributions have positive kurtosis which indicates a peaked distribution with heavy tails.Normal distribution was used as a basis of comparison.
This study was based on simulated data.The simulation was carried out using the random-number-generating function in SAS and the simulation program was written in SAS/IML (SAS, 2006).In terms of data generation, pseudo-random standard normal variates were generated by employing the SAS generator RANDGEN and this involved the straight forward usage of the (RANDGEN(Y, 'NORMAL')).To generate the chi-square variates with three degrees of freedom, we used RANDGEN(Y, 'CHISQUARE', 3).To generate data from a g-and hdistribution, standard normal variates were converted to g-and h-variates via where Z values were generated using the generator RANDGEN with the normal distribution option.
The effect size or the shift parameter used in this study is not a single point but its values ranging from 0.2 to 2.0 with increment of 0.2 units.Therefore, for each condition, ten power values were obtained.This effect size is computed based on the common language (CL) statistics proposed by McGraw and Wong (1992) and Vargha and Delaney (2000).In this study, 0.80 was used as the standard for adequacy in power analysis.There are no hard and fast rules about how much power is enough, but according to Murphy and Myors (2004), power of 0.80 or above is usually judged to be adequate.Most power analyses specify 0.80 as the desired level, and this convention seems to be widely accepted.For each condition examined, 599 bootstrap samples were generated and 5000 data sets were simulated.The nominal level of significance was set at  = 0.05.

Results and Discussion
The simulation results of Type I error for pseudo-median (PM), Welch's-test (W) and Mann-Whitney-Wilcoxon (MWW) procedures are presented in Table 1.This study uses the Bradley's (1978) liberal criterion of robustness to quantify the performance of a statistical test to control its probability of Type I error.According to Bradley's liberal criterion of robustness, a test can be considered robust if its empirical rate of Type I error is within the interval   0.5 ,1.5   .Thus, when the nominal level is set at 0.05

 
, the procedure or test is considered robust if its' Type I error rate is in between 0.025 and 0.075.Type I error rates greater than 0.075 are considered liberal and those less than 0.025 are considered conservative.
Under normal distribution, all procedures are able to control their Type I error rates close to the nominal level of 0.05 for positive pairing.The error rates are 0.0486, 0.0492 and 0.0458 for the pseudo-median procedure, the Welch's test and the Mann-Whitney-Wilcoxon, respectively.Under negative pairing, the pseudo-median procedure shows outstanding performance in controlling the Type I error rates.The recorded rate is 0.0492, very close to the nominal value.Regardless of pairing, the pseudo-median procedure produces consistent Type I error rates under normal distribution.On the other hand, Welch's test produced Type I error rate with value of 0.0514 that is slightly greater than 0.05 but still very close to the nominal level.Under the same condition, Mann-Whitney-Wilcoxon has Type I error rate that is beyond Bradley's liberal criterion with value equal to 0.1142.Under the g-and-h distribution, both Welch's test and pseudo-median procedure produced Type I error within the Bradley's liberal criterion.For both pairings, the pseudo-median procedure and the Welch's test produced good and consistent Type I errors.For positive pairing, the values for the pseudo-median procedure and the Welch's test are 0.0518 and 0.0448, respectively.As for negative pairing, the value for pseudo-median is slightly inflated to 0.0532 while the value for Welch's test is consistently around 0.044.Meanwhile, the Mann-Whitney-Wilcoxon has good Type I error (0.0436) for positive pairing but very liberal Type I error (0.108) for negative pairing.
Under skewed distribution, pseudo-median procedure produced Type I error rates within the Bradley's liberal criterion.The result seems to follow the norm where positive and negative pairings typically produce smaller and larger rates, respectively.The rates of Type I error for both pairings are 0.0476 and 0.055.However, Welch's test produced Type I error considerably greater than 0.05 for both pairings with the rates equal to 0.0654 and 0.0736, however, the rates are still within the robustness criterion.Unfortunately, Mann-Whitney-Wilcoxon produced very liberal Type I error for both pairings with values equal to 0.1812 (positive pairing) and 0.2398 (negative pairing).
The last row of Table I displays the "Average" values obtain by averaging both p-values corresponding to each procedure and distribution.Underlined average values denote that the "Average" is within the Bradley's liberal criterion.As we can observe, regardless of distributions the "Average" values for pseudo-median procedure and Welch's test are within the robustness criterion.However, Mann-Whitney-Wilcoxon depicts liberal "Average" values for all distributions.
In statistical power analysis, we only considered procedures which were identified to be in control of Type I error rates.The comparisons of statistical power will only be meaningful if the procedures being compared are capable of controlling their rates of Type I error.The results of power analysis are tabulated in Table 2 and also illustrated in Figure 1.Table 2 is divided into two parts (above and below) based on the pairings.The first column of Table 2 represents the shift parameter used in the study.The rest of the columns record the power rates corresponding to each of the procedures tested under each type of distribution.
As we can observe from Table 2, the power rates for all the tests fail to achieve the desired level for both pairings.Between the pairings (table above and below), the comparison shows that all the procedures under positive pairings produce greater power rates than the negative pairing.When scrutinizing the results under positive pairing, the analysis reveals that under normal distribution, the power of pseudo-median procedure is just slightly below the Welch's procedure, but performs much better than Mann-Whitney-Wilcoxon procedure.However, the power of pseudo-median procedure improves under g-and-h distribution but decline again when the skewness of the distribution gets larger as shown in the second last column.Under negative pairing, even though the power values for the pseudo-median procedure slightly dropped from the positive pairing, but the procedure performs better than the Welch's test under g-and-h and chi-square distributions.Under this pairing, we did not include the Mann-Whitney-Wilcoxon because of its inability to control Type I error.

Conclusion
The objective of this study is to investigate the performance of the pseudo-median procedure in terms of controlling its Type I error rates and maintaining high power value.With respect to robust performance, the pseudo-median procedure is capable in controlling its Type I error close to nominal level when heterogeneity of variances exists.The pseudo-median procedure also outperforms the Welch's test and Mann-Whitney-Wilcoxon under skewed distributions.The popular Mann-Whitney-Wilcoxon is capable in controlling its Type I error only for positive pairing under symmetric distribution but fails in controlling its Type I error under asymmetric distribution.The study also reveals that pseudo-median procedure perform better than the other procedures especially under the influence of negative pairing.
Figure 1.Power Curves for all distributions under specific pairing

Table 1 .
Type I error rates for all procedures under specific pairing

Table 2 .
Power rates for all procedures under specific pairings