Testing Simultaneous Marginal Homogeneity for Clustered Matched-Pair Multinomial Data

For matched-pair data with a polychotomous outcome, the Stuart-Maxwell test (1955) and the Bhapkar test(1966) are commonly used to test marginal homogeneity. When the outcome is ordinal, the test proposed by Agresti (1983) can be used to test the marginal homogeneity against stochastic order. In practice, we often face the need to consider multiple categorical outcomes simultaneously to insure Type I error protection. In this paper, we propose three statistics to test simultaneous marginal homogeneity for multiple multinomial outcomes in two dependent samples. Furthermore, when the outcome is ordinal, we also propose a transformed version of the three statistics for testing simultaneous marginal homogeneity against stochastic order in two dependent samples. We then prove their asymptotic properties. Finally, Monte Carlo simulations are conducted to evaluate their performance in small samples with respect to empirical size and power.


Introduction
In phase II-III clinical trial of pharmaceutical products, the analysis of adverse event(AE) data is an important aspect of examining the safety of a new drug.The investigator are often interested in looking at if the severity distributions of the AEs patients experienced are different under two treatments or two dosages.The simplest way is to analyze each AE separately and combine individual p−values through various multiple adjustment techniques.However, many AEs are correlated to a huge extent.Without a proper adjustment, type I error will not be adequately controlled.Therefore, analyzing the incidence of related AEs simultaneously is ideal and preferable to treating them separately.Analyzing the severity distribution of multiple AEs simultaneously is challenging, as it leads to dealing with clustered and thus correlated matched-pair multinomial data.
Chuang-Stein and Mohberg (1993) developed the Wald and score-type tests for comparing the incidence of multiple AEs simultaneously in two independent groups.The score-type test used the pooled covariance estimator under two treatments.Agresti and Klingenberg (2005) recommended the score-type test over the Wald test because the empirical size of the score-type test tends to be closer to the nominal level than the Wald test does.Klingenberg and Agresti (2006) developed the Wald and score-type test for comparing the incidence of multiple AEs simultaneously in two dependent groups.Both statistics asymptotically follow a Chi-square distribution.The score-type statistic is preferred because it maintains the nominal level much better than the Wald statistic.They also pointed out that for sample sizes less than 100, neither statistic is well approximated by a χ 2 distribution when the number of AE considered is 2 and 4. Therefore, the score-type statistic using bootstrap method is recommended in small samples.In situations when bootstrap method is too computationally intensive, a permutation test is recommended.But the permutation test is to test the identical joint distribution (IJD), which is a stronger hypothesis than SMH.Thus, the permutation test will lead to inflated type I error rates when testing SMH.Klingenberg et al. (2008) developed the score-type test and a score-free statistic for testing SMH against stochastic order in two independent groups.The permutation test and bootstrap method based on these two tests were proposed.For small sample sizes, the bootstrap method for the two tests is either too conservative or liberal, while the permutation test has the empirical size closer to the nominal level.However, the permutation is only valid for testing SMH when a prior condition (stochastic order) is assumed or two samples are balanced.
In this paper, we focus on the methods for testing SMH in matched-pair multinomial data.In section 2, we define the hypothesis of SMH for matched-pair multinomial data and develop the statistics to test SMH.In section 3, we focus on a special case of multinomial data, which is ordered.For ordinal data, an alternative hypothesis, stochastic order, is defined.Then we present the statistics to test SMH against stochastic order.In section 4, Monte Carlo simulations are performed to evaluate the performance of the statistics introduced in section 2 and 3. Section 5 discusses the drawbacks of the methods proposed and possible extensions.The paper is entirely framed in terms of a safety analysis comparing the marginal proportions of each AE severity categories under two treatments or dosages.However, our methods can be applied to any paired or repeated multinomial response.

Simultaneous Marginal Homogeneity
T be a K × 1 vector of multivariate responses for subject i at dose j = 1, 2, where K is the number of multinomial responses (AE severity), Y i jk is the multinomial response with used for all k which denotes the 4 severity levels of AE, i.e. none, mild, moderate and severe.If subject i experienced AE k with severity c k at dose j, then Y i jk = c k .For each subject, let ′ , where π jk (c k ) denotes the probability Pr( The null hypothesis of SMH is defined as ) T denote the marginal proportions of each AE at dose j, where j = 1, 2 and π jk (c k ) denotes the sample proportion of subjects with severity ) T denote the difference of the marginal sample proportions at two dosages.
Under the assumption of multinomial distribution, the covariance matrix V of d has elements: where π k (c k , c k ) is the probability of experiencing AE k of severity c k at both dosages, where π 12k (c ′ k , c k ) is the joint probability of experiencing AE k of severity c ′ k at dose 1 and AE k of severity c k at dose 2, and where Let V denote the sample version of V.Then, a Wald statistic to test SMH is and based on the Central Limit Theorem, W has an asymptotic null χ 2 distribution with By replacing π 1k (c k ) and π 2k (c k ) by the pooled estimator, π0k (c k ) = (π 1k (c k )+ π2k (c k ))/2, we have Vs as the pooled estimator of V, which has elements as: For the Wald and score-type statistics, it is assumed that (Y 1 , Y 2 , • • • , Y n ) are n independently and identically distributed random variables from a multinomial distribution.However, this assumption may not be feasible in practice.Therefore, a non-parametric covariance estimator of d is considered in this section.The non-parametric covariance estimator Vnp of d is given as follows: where Next, we will show (5) and ( 6) are consistent estimators of Var( dk (c k )) and Cov( dk (c k ), dk ′ (c k ′ )), respectively.First, we will show are independently and identically distributed with mean µ and variance σ 2 .It is wellknown that sample variance is a consistent estimator of σ 2 .Furthermore, we have Hence, and Similarly, it can be shown that ( 6) is a consistent estimator of Cov( dk (c k ), dk ′ (c k ′ )).Since (5) and ( 6) are both consistent, Vnp is a consistent estimator of variance of d.
From the Central Limit Theorem, we have a non-parametric Wald test statistic W np = d T  Vnp d, which asymptotically follows a χ 2 distribution with d f = K(C − 1) when n −→ ∞.

Stochastic Order
All the statistics introduced in section 2.2 treat the outcome as nominal.They can be used to test marginal homogeneity against any alternatives.When the outcome is ordinal, however, they ignore the ordinal nature of the outcome.Quite naturally, the tests that utilize the ordinal information will be more powerful.For instance, for ordinal outcome, one is usually interested if the classifications based on one variable are higher than those based on the other variable.For a I × I square table, let Y 1 denote the observation from the row marginal distribution {π i+ } and Y 2 denote the observation from the column marginal distribution {π + j }.Y 1 is stochastically higher than Y 2 (Agresti, 2010) if the cumulative density function of Y 1 is uniformly below the cumulative density function of Y 2 , i.e.
This means that Y 1 is more likely to have larger values than Y 2 .

Multivariate Tests of SMH against Stochastic Order
For clustered matched-pair AE severity data, it may be of interest to test whether one margin is stochastically higher than the other for each AE.Motivated by the statistic proposed by Agresti (1983), a statistic for testing SMH against stochastic ordering is formed by comparing the marginal mean scores under two treatments.
, where π jk (c k ) denotes the sample proportion of subjects with severity denote the difference of the marginal sample proportions at two dosages.The difference of the marginal mean scores at two dosages is formed by S = Ad, where Under the assumption of multinomial distribution, the covariance matrix V of d is given in (1), ( 2) and (3).From S = Ad, we have the covariance matrix of S as Σ = AV A T .Let Σ denote the sample version of Σ.A multivariate Wald test is constructed by Based on the Central Limit Theorem, W o asymptotically follows a Chi-square distribution with d f = K when n −→ ∞, where K is the number of AEs considered simultaneously.When K = 1, it reduces to the statistic proposed by Agresti (1983).
By replacing π 1k (c k ) and π 2k (c k ) with the pooled estimator π0k (c k ) = (π 1k (c k ) + π2k (c k ))/2, we have Vs as the pooled estimator of V, which is given in (4).Then we have a score-type statistic W os = S T Σ−1 s S , which also asmptotically follows the χ 2 distribution with d f = K when n −→ ∞, where Σs = A Vs A T .
The above two statistics assume that (Y 1 , Y 2 , • • • , Y n ) are n independently and identically distributed random variables from a multinomial distribution.Similar to the non-parametric statistic in section 2.2, a non-parametric covariance estimator of d can be considered, which is as given in ( 5) and ( 6).Then we have a non-parametric Wald test statistic W onp = S T Σ−1 np S which also asymptotically follows a χ 2 distribution with d f = K when n −→ ∞, where Σnp = A Vnp A T .

Asymptotics of W, W s and W np
In this section, the empirical size of W, W s and W np are compared for sample sizes n = 25; 50; 100; 200; 300; 500 and for the number of AEs K = 2; 3; 4, under the nominal level = 0.05.And we use C = 4 categories for each AE, To simulate a dataset under the null hypothesis (SMH), a 4 2K × 1 random vector of (a length of 4 2K is used because two treatments are considered) multinomial probabilities is generated.Then the iterative proportional fitting procedure (Deming and Stephan, 1940) is performed to adjust the vector to make each one of K pairs of marginal probabilities under two treatments equivalent to a specified 4 × 1 vector of probabilites.

Note:
The bold text indicates that the empirical size falls outside the 95% confidence interval (0.044, 0.056) of the nominal level 0.05 .
When the simulated data set was very sparse and the test failed to work, the data set was eliminated from the summary analysis.
Table 1 shows the empirical size of W, W s and W np at a nominal type I error rate of 0.05.The W and W np are too liberal and the empirical size improves as the sample size increases.On the contrary, the W s is too conservative and the empirical size improves as the sample size increases.When K = 2, the W s maintain the nominal level well when n ≥ 100 and a larger sample size is required for W and W np to control the empirical size at the nominal level.When K = 3, 4, none of the W, W s and W np seem to perform reasonably if n ≤ 200.A minimum sample size of 300 is required for W s when number of outcomes is greater than 2.

Asymptotics of W o , W os and W onp
Table 2 shows the empirical size of W o , W os and W onp with scores (1,2,3,4).It shows that W os maintains the nominal level much better than W o and W onp and is therefore recommended to use.When K = 2, W os can be used even at n = 25.When K = 3, 4, W os can be used when sample size is at least 50.The bold text indicates that the empirical size falls outside the 95% confidence interval (0.044, 0.056) of the nominal level 0.05.

Power Comparison of W s and W os in Testing SMH against Stochastic Order
For the case of one adverse event, Agresti (1983) showed that the test using ordinal scales outperforms the test ignoring its ordinal nature when they are used to test SMH against stochastic ordering.In this section, a simulation is performed to verify if the test using ordinal scales is also more powerful to test SMH against stochastic ordering for the multiple AEs case.
Based on the simulation results in section 4.1 and 4.2, it appears that the W s and W os performs the best.Thus, the W s and W os are contrasted in this section.To ensure the nominal level can be well controlled for W s and W os , the sample size of 100 and 200 are utilized in the simulation.
Their power performance is investigated by extending the simulation design introduced by Agresti (1983).We randomly sample from an underlying multivariate normal distribution having mean 0 and within-AE correlation ρ 1 = 0.6 and between-AE correlation ρ 2 = 0.2.A half of the dimensions of the multivarite random vector are divided as the severity levels c = 1, 2, 3, 4 of all AEs under treatment 1 and the other half are divided as the severity levels of all AEs under treatment 2. The boundries for AE categories under treatment 1 are set as −0.6, 0 and 0.6.The boundries for AE categories under treatment 2 are obtained by placing a shift ∆ = 0.2 relative to the boundaries for treatment 1. Hence, the boundaries of the AE categories under treatment 2 are −0.4,0.2 and 0.8.The division produces the marginal probabilities of AE categories under treatment 1 to be (0.2743, 0.2257, 0.2257, 0.2743), and the marginal probabilities of AE categories under treatment 2 to be (0.3446, 0.2347, 0.2089, 0.2119).Our simulation includes the settings representing the combinations of sample size n = 100, 200 and number of AEs K = 2, 3, 4.
Table 3 shows the power of the W s and W os under the nominal levels of 0.05.From table 3, we have the following observations: 1.The power of the W os is consistently greater than that of W s for all the combinations of K and n we considered.
2. The ratio (1−power of the W s )/(1−power of the W os ) is consistently greater at n = 200 than n = 100 for all K.
3. As K increases, the ratio (1−power of the W s )/(1−power of the W os ) consistently increases for all n.
Therefore, the W os outperforms the W s with respect to the power when they are used to test SMH against stochastic order.Furthermore, as the sample size or number of AEs increases, the advantage of the W os compared to the W s increases.

Example
The data used in this article was part of a longitudinal study of criminal career patterns of former California youth authority wards between 1965and 1984(Haapanen, 1990)).This study investigated the patterns of criminal behavior that occurred over 10 to 15 years for 1308 subjects whose early criminal involvement was serious enough to result in commitment to California Youth Authority institutions.The outcomes selected from this study are yearly arrest rates for 12 types of offenses, including murder, rape, felony assault, misdemeanor assault, armed robbery, strong-arm robbery, other personal offenses(extortion, kidnapping), burglary, receiving stolen property, grand theft, forgery and grand theft auto, over four four-year age blocks (18-21, 22-25, 26-29, 30-33).The 12 types of offenses are categorized according to the definition used in the earlier study (Haapanen, 1990), which are as follows: Violent-aggressive: murder, rape, felony assault and misdemeanor assault.
Property: burglary, receiving stolen property, grand theft, forgery and grand theft auto.Our main interest is to test if the yearly arrest rates for the three categories decline over time as indicated in Haapanen (1900).To apply the method we proposed, the yearly arrest rates for the 12 types of offenses are all converted to a scale of 1 to 4 (represents the offense rate as none, mild, moderate and severe), using the range: 0, (0, 1], (1,2] and (2, ∞).Scores (0.1, 0.5, 1.5, 2.5) are assigned to the outcome categories which are approximately midpoints of the above ranges.Table 4 shows P-values from our method and Hotelling's Paired T 2 test.Note that the Hotelling's Paired T 2 test is applied to continuous variables before categorization.
Then we have a score-type test statistic W s = d T V−1 s d, which also has an asymptotic null χ 2 distribution with d f = K(C−1) when n −→ ∞.In the binary case (C = 2), the W and W s reduces to the multivariate McNemar's tests (Klingenberg and Agresti, 2006).
the subject i's AE severity profile.Assume that we have n subjects in the study, where (Y 1 , Y 2 , • • • , Y n ) are n independently and identically distributed random variables from a multinomial distribution with probability 2 Multivariate Tests of SMH Motivated by the statistic proposed by Agresti and Klingenberg (2005) and Klingenberg and Agresti (2006), a statistic to test SMH is constructed by comparing the marginal proportions of each AE at two dosages.Let π j

Table 1 .
Empirical size of the W, W s and W np in 5000 simulated data sets under the nominal level 0.05

Table 2 .
Empirical size of the W o , W os and W onp with scores (1, 2, 3, 4) in 5000 simulated data sets under the nominal level 0.05

Table 3 .
Empirical power of the W os and W s in 5000 simulated data sets under the nominal level of 0.05