On Multiple Hypothesis Testing Maximizing the Average Power

,


Introduction
There is continuing interest in multiple hypothesis testing procedures resulting from the desire to maximize efficiency in the simultaneous testing of large numbers of data sets such as microarray data and more recently RNA-Seq data (Wang, Gerstein, & Snyder, 2009) that after a lot of pre-processing (Givan, Bottoms, & Spollen, 2012) give simultaneously the expression levels of a very large number of genes from RNA samples.Such experiments are capable of extracting information about how much each gene is involved (i.e.transcribed and potentially translated into a functional protein) in different organisms and tissues and under different experimental conditions, including for example many applications in crop and human disease.Genes of interest for further study show unusual expression patterns indicating a possible relevance to the study.Probability models for these patterns underlie the statistical analysis used in the search for such genes.Much recent theoretical work involves correction for effects not explicitly modelled that cause correlation among the data for individual tests (Leek & Storey, 2007;Leek & Storey, 2008;Lunceford et al., 2011;Chakraborty, Datta, Somnath, & Datta, Susmita, 2012), while the simpler problem of handling independent tests (Storey, 2007;Hwang & Liu 2010), does not seem to have been fully explored in its practical implementation when one or both hypotheses have unknown (hyper)parameters (Nixon, 2012).This may be because the perception of the need to deal with dependence makes such a study almost irrelevant.However it has been shown theoretically that data sets after the extraction of the surrogate variables are independent (Leek & Storey, 2008) undermining this perception and in fact demonstrating that a thorough analysis of the simpler case where the separate data sets are independent is actually of fundamental importance.
In this paper multiple testing procedures are analysed assuming no statistical dependence amongst the data for the individual tests to develop optimal procedures of this type.It is also assumed that the dataset is very large and can be modelled as continuously infinite so that the distributions of test statistics and p-values will be assumed to be continuous i.e. have probability densities, and all sums are represented as integrals, so for example the data set D is defined by the number m and density t(x) of points in x space i.e. the data space for a single data set.This makes the notation relatively simple.The view is also taken that a multiple hypothesis testing problem can be regarded as a model to be fitted to the entire data set and multiple hypothesis testing procedures (MTP's) are model fitting procedures (MFP's) applied to a model that describes multiple hypothesis testing (MHT).Such procedures will be called solutions of the model.
In Section 2 the probability models to be used are introduced and in Section 3 hypothesis testing as implied by the above models i.e. in the "multiple" scenario is developed.Section 4 discusses multiple hypothesis testing from the point of view of decision theory and the assumption that the number of tests (m) is large, introduces some notation, demonstrates the equivalence of different optimisation criteria, and shows that there is always a trade-off between certain expected error rates.Section 5 explains the relationship between Storey's (2007) ODP in general and tests maximizing the average power (MAP tests).In Section 6 arguments are given establishing multiple testing procedures appropriate for different probability models.The procedures themselves are described step by step in Section 7 and the important special case when there is only information about the null hypothesis available, leads to the surprising conclusion in Section 8 that the density of p-values from these tests is the most efficient final test statistic.The remainder of the paper discusses the implementation of this and the results showing that the density of p-values can generate a very efficient testing procedure when the alternative model is only known empirically.

Probability Models and Data Structures
As in all the general models in this paper, the symbols denoting variables will be assumed to have as many dimensions as needed in any specific case, for example the variable x will represent the collection of all measured variables including any covariates.
Apart from simply adding more parameters to a model for a data set to make it potentially fit better, there is a basic way in which complex models can be built from simpler ones and is related to complex data sets that have a structure that is essentially a multiple replication of the structure of a simpler data set.If the simpler data set has been modelled with several parameters then the entire data set should be modelled by the joint distribution of these parameters (whether they be continuous or discrete).If this model itself involves parameters they are sometimes known as hyperparameters, and the hyperparameters describe the entire data set.It is here implicit that the separate data sets would be independently generated from this model according to the joint distribution of the parameters.If however this is not so, modelling on another level is involved that is beyond the scope of this paper.
The simplest example of these ideas is as follows.Suppose there are two probability models for some phenomenon A and B, each having parameters to be fitted, a model C could specify that there is a probability π 0 that model A holds, and a probability 1 − π 0 that model B holds, where π 0 is the extra parameter to be fitted in model C that describes the distribution of the binary variable in the entire dataset.Model C would be appropriate for a data set that consisted of many instances of data to which either model A or model B could be applied.In this case, models A and B would be said to be submodels of the composite model C.An MFP for model C should include an estimate of the partition of the data set into two parts each of which fits one of the submodels A and B i.e. π 0 as well as the parameters in A and B. The two-groups model is specified by the density which defines the term "multiple hypothesis testing" as used in this paper.The rest of this paper will be concerned only with analysis of models of this sort.The following forms of this model will be considered: 1) The densities f (x) and g(x) are uniquely defined.
2) The densities f (x) and/or g(x) have known functional forms with unknown parameters or are unspecified (not both f and g obviously).Here f and g are here not dependent on the test instance i.e. i.
3) The densities f (•) and g(•) are dependent on the test being carried out, and are then uniquely defined so they are denoted by f i (x) and g i (x) respectively for 1 ≤ i ≤ m.
For each of these cases the assignment of the data to the two hypotheses may be (A) known or (B) unknown.
In case A, an analysis can only determine the order in which the assignment of the data to the hypotheses may be in error i.e. which data points x seem to be most unlikely to have come from the model they actually came from, and order the data most unlikely first.In the practical case B, not only is the partition of the data between the hypotheses estimated but the parameter π 0 , which is the proportion of null hypotheses (that come from f (x) or f i (x)) in the data, is an unknown that must be estimated from the data.For the models where the assignment to the hypotheses is known, the number of null hypotheses (mπ 0 ) and alternative hypotheses (m − mπ 0 ) are also known.Clearly not all of these models may be useful practically particularly models with known the assignment of the data to the hypotheses, but they are instructive when developing statistical procedures because the relevant methods used for deciding whether the models fit the data and if so estimating their free parameters are closely related.
The models of type (3) reduce to models of type (1) when all the densities are the same, and models of type (2) may be considered to reduce to models of type (1) when the number of free parameters is zero.Further, models of type B reduce to models of type A when the assignment of the data to the hypotheses is specified.

Single Hypothesis Testing
In the simplest hypothesis testing situation there are two normalised densities f (x) and g(x) representing two hypotheses ("simple" hypotheses are represented by fixed probability density functions that have no parameters as in the case here), the null hypothesis H 0 and the alternative hypothesis H 1 respectively, and an observation x.The object is to determine which model is most appropriate in the light of the data x on the assumption that one or other of these models is the origin of the observation.Analysis shows that there is always a trade-off between the two types of errors: type I, choosing g(•) when f (•) is correct and type II, choosing f (•) when g(•) is correct.Minimizing the probability of errors of type II for a given value of the probability of an error of type I (or vice versa) occurs when the decision is made based on the ratio of the probability densities as the Neyman Pearson (NP) lemma states (Dudewicz & Mishra, 1988).The critical value of the ratio is determined by which point on the trade-off curve is chosen.In the case when one or both hypotheses have parameters to be fitted (composite hypotheses), this can be done by maximum likelihood before invoking the NP lemma.

Multiple Hypothesis Testing
Now suppose that m such observations x i have been made.This situation is quite different, the multiple testing scenario, where many statistical tests have to be carried out, one for each data point.Here there are models on two levels, instances of the simple model describing single data points x i , and another model describing the distribution of parameter values in this simple model (that could specify fixed values or be another distribution with known or unknown parameter values).The two-groups model with simple hypotheses has two submodels that can be represented by an overall model for the data distribution t(x) of the form given by Equation ( 1) where π 0 is the probability of the null hypothesis i.e. x is drawn from the distribution f (x), and π 1 = 1 − π 0 is the probability of the alternative hypothesis, which is given by the distribution g(x).In the general case to be considered with composite hypotheses, the functions f and g will be dependent on parameters θ and φ (that can be multidimensional) to be estimated from the data i.e. the overall model is Therefore one can write where the total number of points is #{D} = m dxt(x) = 1.
If w(x) is the test statistic to be derived, (the smaller w(x) is the more significant the data x is) the condition that the p-values are uniformly distributed for the set of null hypotheses is consistent with defining the p-values by To see this note that Equation (4) implies that p(x 1 ) = p(x 2 ) for any two points x 1 and x 2 such that w(x 1 ) = w(x 2 ), and if w(x) increases so does p(x).Therefore a relationship of the form p(x) = h(w(x)) holds where h(.) is a monotone increasing function.Thus the condition determining the range of integration w(x ) ≤ w(x) is equivalent to h(w(x )) ≤ h(w(x)) and to p(x ) ≤ p(x).So Equation (4) can be written as By the definition of H 0 , P H 0 (s < p(x) ≤ s + ds) = s<p(x )≤s+ds dx f (x , θ), and by (5) this reduces to ds.Therefore the distribution of p(x) under the null hypothesis H 0 is uniform as required.
The p-value to be associated with a single data point x will in general depend on the entire data set via π0 and θ.
For simplicity of notation this has been left implicit on the left hand sides of Equations ( 4) and ( 5) and this will be done throughout the paper for p-values, q-values and test statistics.
4. The Decision Theoretical View of Multiple Testing Procedures (MTP's) and the Equivalence of Optimization Criteria The results of this section assume that the hypotheses are simple, or the parameter values (θ and φ) can be considered as fixed while considering the relationship between the error frequencies n 01 and n 10 defined below.This is related to the optimal multiple testing procedure which first fixes values of both these parameters (see Section 7). In dx f (x , θ) These could be estimated by simulation for example by simulating m data points in each of d data sets, where d is a suitable large number and using w(x) as the cut-off for the statistic w(•).In the situation in which m is very large, the value of d will make practically no difference to the probabilities.In each simulated data set each point is null with probability π 0 , (then an instance of the distribution with density f (•) is simulated), and alternative with probability π 1 = 1 − π 0 (then an instance of the distribution with density g(•) is simulated).Therefore as x takes on each possible value in turn, the points (n 01 , n 10 ) trace out a curve on Figure 1 determined by the test statistic w(•) which characterises the MTP.
The performance of any MTP could be compared with the following trivial procedure that does not depend on the data: Accept H 0 (the null hypothesis f (•)) with probability λ and reject it with probability 1 − λ.Then n 10 = π 1 λ and n 01 = π 0 (1 − λ).Elimination of λ gives n 10 π 0 + n 01 π 1 = π 0 π 1 which is the line joining (0, π 1 ) and (π 0 , 0) shown in Figure 1.Therefore any MTP should have a risk point (n 01 , n 10 ) that is not above this line otherwise each of the errors n 01 and n 10 can be reduced keeping the other fixed by using the trivial MTP with an appropriate value of λ.Further, for any two MTP's a and b with different risk points below this line, it is possible to define an MTP (MTPλ) as MTPa with probability λ and MTPb with probability 1 − λ, and the risk point P, for MTPλ is the fraction 1 − λ along the line from MTPa to MTPb.Therefore the optimal MTP with the same value of n 01 as P must have n 10 less than or equal to n 10 for P.This argument shows that optimal MTP's must fall on a convex curve (i.e. with second derivative d 2 n 10 dn 2 01 ≥ 0) in (n 01 , n 10 ) space representing a trade-off between the two error rates.
If n 01 = 0 i.e. every null hypothesis instance in the data is not rejected, then because the statistical test depends on a probability being less than a threshold, the only way this can be certain is if for every test the null hypothesis is not rejected.This implies that n 11 = 0 so n 10 = π 1 .Likewise if n 10 = 0, for every alternative hypothesis instance the null hypothesis is rejected.Then for all tests, H 0 is rejected i.e. n 00 = 0 and n 01 = π 0 so the points (0, π 1 ) and (π 0 , 0) are always on the optimal curve.Two error frequencies α = n 01 π 0 , which is the fraction of null hypothesis instances in the data that are rejected in error is the type I error rate, and β = n 10 π 1 , which is the fraction of alternative hypothesis instances in the data that are not rejected is the type II error rate.These two error rates play a role analogous to the case of a single statistical test.Two further error rates are the 'false discovery rate' (FDR) (Storey & Tibshirani, 2003) given by which is the fraction of rejections of the null hypothesis that are in error, and the 'missed discovery rate'(MDR) (Storey, 2007) is the fraction of acceptances of the null hypothesis that are in error, and is given by The error rates q and s are introduced here because criteria involving them were suggested for defining optimization of MTP's.Storey (2007) used the criterion that MTP's should be optimized such that the expected number of true positives (ETP = n 11 m) should be maximized for each fixed expected number of false positives (EFP = n 01 m).This is referred to as maximizing the average power (MAP) and is the same as maximizing n 11 for fixed n 01 i.e. minimizing n 10 for fixed n 01 i.e. minimizing β for fixed α.The second criterion suggested (the MDR criterion) was that the MDR must be minimized for each fixed value of FDR, i.e. s must be minimized for fixed q.A third criterion (criterion II) was suggested (Hwang & Liu, 2010) in which EFN/(EFN+ETP) i.e. n 10 π 1 is minimized for fixed FDR i.e. q, where EFN is the expected number of false negatives = n 10 m.Finally, the criterion used in this paper for deriving the optimal MTP is to minimize q for fixed n 1 .Because n 10 = n 01 + π 1 − n 1 , this criterion is also easily illustrated in Figure 1 as are the others.In fact it turns out that all these four criteria are equivalent i.e. equivalent to the MAP criterion.Lines of constant q and s can be considered in the space of points (n 01 , n 10 ) which can be expressed by rearranging ( 7) and ( 8) respectively to be solved for n 10 : Using ( 9) to eliminate n 10 in (8) gives s = n 01 (1−1/q)+π 1 1−n 01 /q from which ds dn 01 | q = q−π 0 q(1−n 01 /q) 2 follows.Then ds dn 01 | q = 0 ⇔ q = π 0 ⇔ n 10 = n 01 (1 − 1/π 0 ) + π 1 which is the line passing through (0, π 1 ) and (π 0 , 0) and by examining the sign of z = ds dn 01 | q near (0, 0) where q → 0 it follows that z < 0 for all points below that line (which is the only region of interest).From ( 9) it follows that lines of constant q all pass through (0, π 1 ) and have slope 1 − 1/q < 0, and from (10) the lines of constant s all pass through (π 0 , 0) and have slope s/(s − 1) so for fixed q by (9), minimizing n 10 is equivalent to maximizing n 01 is equivalent to minimizing s.Therefore criterion II is equivalent to the MDR criterion and the MDR criterion of optimal MTP's is that for fixed q, the point with the largest n 01 i.e. with the smallest n 10 must be chosen.
The convexity property of the optimal curve in Figure 1 allows an inequality to be derived as follows.Using dq ds and dn 10 dn 01 to denote the derivatives along the theoretical optimal curve representing the optimal MTP in Figure 1, and the subscript P to denote an arbitrary point on this curve This shows that the optimal curve (that must pass through (0, π 1 )) always has a slope of smaller magnitude at a point than the line of constant q that intersects it at that point.Therefore the Storey (2007) criterion is equivalent to the MDR criterion for optimal MTP's.These criteria are also equivalent to the criterion that q must be minimized for all tests with the same value of n 1 because the slope of the lines of constant n 1 is 1 and so these lines intersect the optimal curve just once.Likewise which shows that optimal curve always has a slope of higher magnitude (it is < 0) at a point than the curve of constant s that intersects it at that point.In order to relate the inequalities ( 13) and ( 15) to the possible range of values of dq/ds on the optimal curve, the relationship between these derivatives is needed which is as follows: After inserting the partial derivatives of q and s with respect to n 01 and n 10 and using the definitions in Table 1 this becomes The function 1, and the upper bound (15) for dn 10 dn 01 coincides with the singular point −C/D = −n 10 π 0 −n 01 .Therefore the RHS of ( 17) is decreasing with dn 10 dn 01 in its allowed range and this range is mapped to which shows that there is in general a trade-off between FDR and MDR as indicated in Figure 1.There is also a trade-off between α and β as is indicated in Equation ( 15).For the hypothetical tests mentioned by Hwang and Liu (2010)  Their work illustrates the general principle that most powerful statistical procedures result from correctly modelling the data.These MAP tests involve hypotheses that are fixed because they involve fixed probability distributions of the parameters appropriate to each test and can therefore be expressed as integrals.The next section describes extensions of these arguments to models involving undetermined parameters (sometimes referred to as hyperparameters if they refer to the whole model rather than parameters that could be estimated from the individual data sets denoted by x here) other than π 0 .Established techniques for solving this kind of models involve estimation of the parameters by maximum likelihood (Storey, 2007), or the use of iterative techniques (Nixon, 2012).

Arguments Establishing MTP's for Different Models
Storey ( 2007) made a fundamental advance in the theory of multiple hypothesis testing by showing that the optimal solution for models of types 3A and 3B (Section 2) is essentially unique.These are generalisations of the Neyman Pearson (NP) lemma whose proofs rely essentially on this lemma.
Specifically, he showed for model 3A that to use the MAP criterion i.e. in order to maximize n 11 for fixed n 01 the test statistic can be chosen as m i=m 0 +1 g i (x) m 0 i=1 f i (x) , where without loss of generality the null hypotheses are data points 1 to m 0 , and the alternative hypotheses are data points m 0 + 1 to m. Storey (2007) also showed, in the case that the assignment of the hypotheses to the data is not known i.e. the model is of type 3B by a very similar argument, that under the same optimization, the test statistic can be chosen as m i=1 g i (x) m i=1 f i (x) .The word "can" is used instead of "must" because any other statistic giving the same order of the test statistics hence the same order of the p-values can be used, where the p-values are calculated according to Equation (4) with ) .(Note that the statistic w(x) is smaller the more significant it is, which is the reverse convention to the one used by Storey (2007) in his test statistics.Therefore the reciprocal of his statistics may be taken if the values are to be interpreted in the reverse order i.e. smallest values are most significant.)These statistics are referred to as the ODP's for their respective models.For models of types 1A and 1B, Storey's (2007) test statistic becomes which is equivalent to g(x)  f (x) .These results are essentially the same as the NP lemma.Proceeding systematically to develop MTP's note that for model 1A, the two subsets of the data corresponding to each hypothesis can be tested against their corresponding submodel to find out how good the fits are.If they are satisfactory, the test statistic and p-values can be calculated for each data point, giving the order of significance of the data.This only serves to show which data points that are actually null hypotheses look most likely to have been alternatives and vice versa, and to order the data accordingly.This is obviously of limited practical value.
For model 1B, the statistic g(x)  f (x) and p-values could be calculated for each data point x.After a p-value cut-off λ has been selected, the data can be partitioned into two parts leading to an estimate of π 0 and two goodness-of-fit (GOF) tests one for each of the probability densities.If a value of λ can be found that satisfies both these GOF tests then that estimate of π 0 can be selected.This procedure does have the consequence that the p-value distributions of the data subsets from which f (•) and g(•) are estimated each have a sharp cut-off, and also the result is dependent on the unknown parameter λ.Therefore in this case, with the assignment of the data to the hypotheses unknown, it seems more appropriate to just compare the overall density which is π0 f (x) + (1 − π0 )g(x) with the observed distribution of the entire data set D, to determine the best estimate of π 0 and how well this model fits.
If the GOF test shows a fit for some value of π0 then that estimate of π 0 can be taken, but if there is more than one plausible region of π0 it would probably indicate that the model is not satisfactory.If the fit is satisfactory, the obvious question now is to estimate which data D 0 come from f (x) and which data D 1 come from g(x), where D = D 0 ∪ D 1 .The obvious estimate of D 0 is the first π0 fraction of the m data taken in decreasing order of p-value.
Because the cut-off π0 depends on the data set as a whole, the statistical test to be carried out for each data point clearly has an interpretation that can depend on properties of the whole data set, i.e. the interpretation of the statistical test for data point i could have been different if the same data x i was part of a different data set.This should not be seen as a paradox, it is simply a consequence of fitting a model with an unknown parameter (in this case π 0 ) that describes the whole data set.
The only difference between models of type (2) and models of type ( 1) is that in the former one or both hypotheses are arbitrary or have parameters and therefore have to be estimated from the data.For model 2A, again both GOF tests have to be done and are used to fit the parameters or estimate the distribution of the data non-parametrically.In the latter case this might involve data smoothing techniques (see for example Simonoff, 1996).If the fits are satisfactory, the test statistic can be written as ĝ(x) f (x) , which can be used to sort the two data subsets into order of significance and generate p-values.This is of limited practical value as in model 1A.
Model 2B is the most practical case because the assignment of the hypotheses to the data is not known as well as the precise functional forms of one or both hypotheses.This case obviously presents the most challenges.Again a GOF test could be done to try to fit π0 f (x) + (1 − π0 )ĝ(x) to the observed distribution of the data, to obtain estimates π0 , f (x) and ĝ(x).If it is successful, the test statistic ĝ(x) f (x) can be used to sort the data into order of significance as for model 1B.There is clearly a problem here because this test should determine which data should contribute to the estimation of f (•), and which should contribute to the estimation of g(•), both of which are involved in the test statistic, hence there is circularity here that suggests using an iterative algorithm.
In contrast with the models of types (1) and (3), there is no theoretical reason to suppose that such a converged solution obtained by an iterative algorithm for models of type 2B is optimal in any sense.This is a consequence of the estimation required to obtain the density functions of the models.A similar conclusion was found in Nixon (2012) where the calculation was done by letting the weight w i with which data x i contributes to the null hypothesis be given by w where p(x i ) is the p-value for data point i in the previous cycle, and iterating the procedure to convergence, then repeating this for a range of values of the free parameter λ.It was found that the power expressed by the fraction of the data declared significant n 1 for a given q-value q = n 01 n 1 varies with parameters (λ, m 0 ) and it was not clear how this can be optimized.This calculation gives rise to artificial cut-off's in the p-value distributions of the data subsets suggesting that there is a better solution based on the optimization criteria discussed above.An obvious approach is to try to minimize n 1 , which is the probability of errors amongst those x for which the null hypothesis is rejected, conditional on (i) a fixed value m and (ii) the GOF test being satisfactory.For the purpose of this optimization, the function w: D → {0, 1} can be varied.Here the function w(•) has the interpretation as follows: w i ≡ w(x i ) = 0 means the null hypothesis for data x i is rejected, and w(x i ) = 1 means the null hypothesis for data x i is not rejected.However while implementing the approach above for a hypothetical continuous theoretically infinite (m → ∞) data set, clearly it makes no sense that w is an arbitrary function of x.Instead the transition point where w changes from 0 to 1 should be determined by a continuous function of x, or for greater flexibility w should be continuously varying with x.Therefore w(•) will be allowed to take any values in the interval [0, 1].
The function w(x) is to be thought of as the weight or probability associated with data x in the estimation of f (the null hypothesis) and 1 − w(x) is the weight or probability associated with x in the estimation of g (the alternative hypothesis) i.e. w(x) is the probability that data point x is from the null hypothesis.This ensures that all the data are represented equally weighted, which could be adjusted if necessary as Storey (2007) suggested, if for example some data were known to be more reliable than others.The function w(x) is also a measure of significance of the data x, i.e. the smaller w(x) is the more significant the data x is.
With this change to w(•), for an arbitrary data point x, q must now be rewritten as q(x) = E #{y|w(y)<w(x) & y is from f (•)} #{y|w(y)<w(x)} and n 1 (x) = E #{y|w(y)<w(x)} m .Both q = n 01 n 1 and n 1 = n 01 − n 10 + π 1 are determined in terms of the cut-off point x, i.e. q is determined as a function of n 1 ; this could be plotted on Figure 1 after elimination of n 1 (π 1 is considered as a known constant).Now w(•) should be varied to as to simultaneously minimize q for each fixed value of n 1 if possible.For simplicity this will initially be applied to models of type 1B under the assumption that this model, as a whole, fits the data.It will turn out to be an alternative argument equivalent to that given by Storey (2007), but is instructive in developing the final MTP for model 2B.
In the following lines s represents w(x) for an arbitrary data point x in the hypotheses listed in order of significance with the most significant first.The variable y represents another arbitrary data point and q, now expressed as a function of s, is This can be written as In order to write this in terms of integrals, the following result will be used.The probability that Q holds if (1) the probability of Q depends on x and (2) x is drawn randomly from the probability distribution with density t(x) is Now the denominator of ( 22) can be written as Prob{y satisfies w(y) < s} = dyt(y)Prob(w(y) < s) and the numerator of ( 22) after a little tricky manipulation becomes In these equations, the fact that y is randomly selected from t(•) does not have to be spelled out on every line, and restriction of the integral to w(y) < s, which is known under the integral sign i.e. its probability is 0 or 1, implies that this does not have to be given as a condition under the integral sign.Here γ(y) is the probability that the given value y is from f (•), and model 1B as a whole fits the data (that come from the distribution with density t(x)) i.e. γ(x) . From this it follows that γ(x) increases monotonically with and q * (s) = w(y)<s dyt(y)γ(y) w(y)<s dyt(y) Now q * (s) has to be minimized simultaneously for each value of s between 0 and 1 by choosing the function w(•).
If s = 0 the formula for q becomes 0/0.Suppose s is a small quantity δs then q * (δs) = w(y)<δs dyt(y)γ(y) w(y)<δs dyt(y) .Both the numerator and denominator refer to the same small region of data space and the value of the quotient is the average of the fixed function γ(y) over the region {y: 0 ≤ w(y) < δs}.To make this as small as possible requires that w(•) be such that the smallest value of γ occurs in this region where w is smallest i.e. the region where f /g is smallest.This argument can be extended to other values of s but another simple result is needed first.

Suppose A+x
B+y must be minimized by varying x and y, such that x > 0 and y > 0 are both much smaller than A and B, which are both fixed and positive.The line A+x B+y = c can be written as y = 1 c (x + A) − B. Therefore these lines for different values of c intersect at x = −A and y = −B and have slope 1 c .Minimizing c means maximizing the slope, so this implies maximizing y/x i.e. minimizing x/y.This fact is obvious from the geometry and is regardless of the precise allowed region of x and y.

Now consider minimizing
assuming that q * (r) for 0 ≤ r ≤ s has already been minimized and a region of smallest γ i.e. γ(y) < h corresponds to a region of smallest w i.e. 0 ≤ w(y) < s.This requires minimizing s≤w(y)<s+δs dyt(y)γ(y) s≤w(y)<s+δs dyt(y) , which is the average of γ(•) over the region {y: s ≤ w(y) < s + δs}.To minimize this subject to the above constraint needs w(•) to be chosen such that γ(•) is next smallest in the region s ≤ w(y) < s + δs.Now a region of smallest γ(•) is {y: 0 ≤ w(y) < s+δs}.This argument is peculiar because it resembles induction, but the variables are continuous.The conclusion is that the function w(•), that minimises q for each value of s such that 0 ≤ s ≤ 1, is itself such that w(x) is dependent on γ(x) only and w(y) is a monotone increasing function of γ(y).Therefore to take the data in the order most significant first, the data should be taken in the order given by increasing w(x) i.e. increasing γ, i.e. increasing f /g.This argument shows that Storey's analysis procedure for solving models of type 1B based on the NP lemma (Storey, 2007) is equivalent to the analysis based on simultaneously minimizing q for each fixed n 1 by varying w(•) such that w(•) is continuous and takes any value between 0 and 1.It is interesting to note that this argument does not determine w(•) precisely but this is sufficient for determining the analysis procedure.
Returning to models of type 2B i.e. the model (2), the functions f and g are now dependent on parameters i.e. f = f (x, θ) and g = g(x, φ) where θ and φ represent sets of parameters whose values are to be determined i.e. f and g are not fixed functions of x and they must be estimated.Therefore, because of its meaning as a weight function, the function w(•) could be associated with two GOF tests i.e.(a) fitting f (x, θ) to t(x)w(x) π0 (with the best θ) and (b) fitting g(x, φ) to t(x) (1−w(x))   1−π 0 (with the best φ).Thus w(•) needs to be chosen such that both these fits are good.This suggests a combined condition should be optimized in the choice of w(•) and simultaneously an estimate of π0 needs to be found.There are difficulties with this because (1) the presence of the factor w(•) the likelihood is not strictly a valid concept here because the product t(x)w(x) can have fractional values which should be frequencies for the likelihood to have a meaning and (2) a way of combining the two GOF statistics is needed and an optimization method.
An alternative related approach is the use of (a) to determine w(x) by choosing follows.Then after substituting for w(x) the GOF test for g can be written as comparing t(x) with π0 f (x, θ) + (1 − π0 )g(x, φ), which is the GOF test for the whole composite model.A satisfactory fit is required for the remainder of this analysis to be valid.Thus in this approach, only one GOF test has to be done for the whole model after simultaneously estimating π 0 , θ, and φ, which will be typically done by maximum likelihood.
Minimizing q using the argument above with arbitrary fixed estimates of θ, φ and π 0 shows that the order the data are declared significant (i.e. are from g) is in increasing order of γ i.e. increasing order of f (x, θ)/g(x, φ).Now it is clear from Equation ( 28) that assuming the model fits, i.e. w(x) = γ(x).

The Proposed Multiple Testing Procedures
The above arguments suggest that the following procedure (MTP1) should be used to test and fit the model (2) to a data set where f and g are alternative sub models that could apply for each data point x.
(1) Use a fitting procedure to simultaneously search for optimal values of θ, φ and π 0 (i.e.their estimates denoted byˆ) such that the overall model π 0 f (x, θ) + (1 − π 0 )g(x, φ) best fits the entire data set.The fitting could be done by maximum likelihood (ML) using the Nelder and Mead (1965) simplex method (also described in Press, Teukolsky, Vetterling, & Flannery, 1997) to search for the optimum point.
(2) Do a GOF test for this model to test that this best fitting model actually fits the data, for example simulate data (for the same number m of x values as in the observed data) based on the model r times, calculate the likelihood for each simulated data set and suppose on b occasions the likelihood is less than or equal to the likelihood of the observed data (in practice the logarithms of the likelihood values would be used).Then the p-value should be estimated as b+1 r+1 taking into account the correction for the fact that this Monte Carlo technique generates an exact uniform discrete frequency distribution of b under the null hypothesis (Phipson & Smyth, 2010).This one-sided test is appropriate because simulated data sets based on the model are expected to fit the model better than any other data sets.
(3) If the model fits, choose data x i as "significant" in the order of increasing f (x i , θ)/g(x i , φ) i.e. smallest values are taken first.Otherwise exit the procedure.
(4) In order to estimate the null data subset, the first fraction 1 − π0 of the data can be assumed to be from g while taking the data in the order of increasing w(x).If required, the following steps may be taken.( 5) For each x calculate the p-value (i.e. the smallest value of n 01 π 0 that can be obtained while calling the data point x i significant) according to Equation (4) using the discreteness correction (Phipson & Smyth, 2010).This would be typically done using a large simulated population of x i from the density f (x, θ).
(6) Finally, calculate the estimates of the q-values q(x i ) i.e. estimates of the smallest value of FDR = n 01 n 1 that can be obtained by calling data point x i significant according to Storey and Tibshirani (2003), which only uses the definition of the FDR, using the estimate π0 obtained above.
In step 2 of MTP1, the number r of simulations may not need to be very great in practice.For example if the true p-value was 0.5 indicating a good fit, and r = 10 was chosen, then there would be only a probability of 1/2 10 = 1/1024 that b = 0 giving the estimated p-value as 1/11 and necessitating more simulations to estimate it more precisely.Therefore if the true p-value was not close to zero it could be inferred relatively inexpensively, and accurate estimation of its value is unimportant.
MTP1 could be compared with a method (MTP2) which is an obvious modification of the method Nixon (2012) used and which is appropriate for the model (2): 1) Choose λ between 0 and 1.
2) Do an initial test generating p-values for each member x of the dataset D. This splits it into two subsets D 0 = {D : p > λ} and D 1 = {D : p ≤ λ} which are estimates of the null data and alternative data subsets respectively.
3) Repeat { (a) Estimate the parameters θ by fitting the model f (x, θ) to D 0 and φ by fitting g(x, φ) to D 1 by maximum likelihood.(b) Calculate the likelihood ratio test statistic for each data point by l(x) = g(x, φ)/ f (x, θ) (c) Calculate the p-values as p(x) = x :l(x )≥l(x) f (x , θ)dx (d) Redefine D 0 and D 1 as in step 2 using the new p-values.(e) Calculate ndiff as the number of hypotheses that have changed status in two consecutive cycles of the procedure } while ndiff > 0 4) Determine whether the two fits in step (a) are satisfactory as in step 2 of MTP1.
5) If they are both satisfactory, calculate q-values as in step 6 of MTP1 otherwise exit the procedure.
Note that if the null and alternative hypotheses were reversed, then the likelihood ratio statistic would be l * (x) = f (x, θ)/g(x, φ) and p * (x) = x :l * (x )≥l * (x) g(x , θ)dx = x :l(x )≤l(x) g(x , θ)dx .Similarly to the above argument, p * increases monotonically with l increasing and p decreases monotonically with l increasing therefore p decreases monotonically with p * increasing.In other words p and p * determine complementary subsets of D that are significant as should be the case.So analysing the data this way will give the reverse order of significance of the data.This argument works for MTP1 and MTP2.
For the case when the model for g is saturated ĝ(•) is entirely data-dependent and describes the distribution of data that don't fit the model f (•).The fitting is in general non-unique because all the data could be described by an appropriate g(•) i.e. a solution with π0 = 0 always fits perfectly, but with π0 > 0 there could be no solutions, one or more.Therefore a new fitting procedure is needed that modifies the first step and omits the second step of MTP1.
Because the model fits perfectly for π0 ≤ this upper limit, the likelihood will be independent of π0 up to this point, so the parameters are ML estimates and the procedure to be defined below (MTP3) is a limiting case of MTP1.
If the data are first smoothed giving say t * (x), then because f ≥ 0, the parameters π 0 and θ must be such that for each data point The estimate π0 should be as large as possible to be maximally informative and depends on θ.Therefore from Equation (31) for fixed θ, π0 = inf x t * (x) f (x, θ) , and further, θ should be chosen to maximize π0 i.e.
(An alternative may be to first fix π0 then fit θ by maximum likelihood.Do this repeatedly to find the largest value of π 0 that allows an adequate fit.)Because the model fits we can write so the test statistic is from which the p-values can be calculated from Equation (4).
Therefore the modified procedure (MTP3) would be in outline as follows: 1) Use Equation (32) to determine both π0 and θ.
2) Use Equation (34) to determine the test statistic w(x).
3) Use Equation ( 4) to determine the p-values as in step 5 of MTP1.
4) Estimate the null data subset and q-values according to steps 4 and 6 of MTP1.

Applying This Method When the Null Hypothesis Is Not Explicit
Suppose that the null hypothesis is not explicitly given, and is replaced by a statement which implies a known probability distribution of a test statistic, i.e. there is a known procedure for associating p-values (denoted by y) with data such that they are uniformly distributed under the null hypothesis, and the alternative distribution is not known a priori.Thus y is one-dimensional and is uniformly distributed in the interval [0, 1].Then model (2) applies to the variable y with f (y) = 1 and the alternative hypothesis g(•) is not known a priori and θ is absent.The above arguments suggest MTP3 applies, so Equation (32) implies estimate of the density of p-values at y .
and the test statistic would be s = 1− π0 t * (y)− π0 which is equivalent to using t * (y) as the test statistic because they are related inversely by a monotonic function (s decreases as t * increases).The quantity t * (y) which in practice is an estimate of the probability density of the p-values at y could be used as the test statistic when evaluated at each p-value i.e. at each data point.The largest values of this test statistic are now the most significant.After sorting the data into the new order most significant first, the final p-value for the data point x (if needed) corresponding to y is obtained as the fraction of data at least as significant as x according to Equation (4).

Methods and Models
In order to test this statistical method, the problem of classifying frequency ratios (Nixon, 2012) where the frequencies are small and the number of instances is large will be considered.The parameters for the null hypothesis are also the same as in Nixon's paper so that the results can be compared.Three simulation models were investigated that only differ in the distribution of N 1 in the alternative data and so the same theoretical result of Section 8 still applies.The model for the simulation is as follows: both the alternative and null data (500 each) have N = N 1 + N 2 poisson distributed with a mean of 20, but truncated and renormalised so that N cannot be 0 or 1.The null data are such that given N, N 1 is binomially distributed with expected fraction N 1 /(N 1 + N 2 ) = 0.3, but truncated and renormalised such that N 1 cannot be zero or N: The alternative data were chosen according to three models conditional on the total N.In model (a), N 1 is uniformly distributed in [1, N − 1]: In models (b) and (c), N 1 has the fixed values max(1, N − 10) and N − 1 (implying N 2 = 1) respectively i.e. and where Model (a) was chosen to be the minimally informative model that could apply if nothing is known about the alternative data that are sought except that they do not follow the null hypothesis that represents independent random assignment of the clones to the two libraries in proportion to their sizes.Models (b) and (c) are quite the opposite; model (b) represents a specific pattern of non-random behaviour (despite being unrealistic in this case) and model (c) is the most unlikely possible behaviour under the null hypothesis.They were introduced to test the statistical procedure described above based on the density of p-values of individual tests.
For this purpose, 100 samples of 1000 frequency pairs were independently generated as above for each scenario.
In order to make this an example of the general theory of Section 8, the only parameter r = E{N 1 /N} = 0.3, i.e. the expected fraction of the total frequency N in the first library, has to be known in the analysis procedures (i.e.0.3 except where the deliberately wrong value 0.4 was used) because this theory requires no undetermined parameters.The purpose of the analysis is to estimate π 0 and arrange the data in the order of significance, taking the most significant first, such as to minimise the expected frequency of errors at whatever point in the ordered data set is chosen as the cut-off for the significant data.
The initial p-values for the density calculations were obtained using the binomial test (Nixon, 2012) which is to calculate the p-value of (N 1 , N 2 ) by where bin(r, i, N) = r i (1 − r) N−i N i , because it is appropriate to the above null distribution being related to the quantile function of a continuous random variable which is always uniformly distributed.For the case when there is no parameter to be fitted, the iterated ODP (Nixon, 2012) reduces to ordering the data by the likelihood ratio g(x)/ f (x) given by the NP lemma, where f is the null density and g is the alternative density (or equivalently the total density).This statistic is for model (a) and These statistics were compared with the other statistics for analysing the simulated data.

Estimation of Probability Densities
To implement the method described in Section 8, the first step is to implement a probability density estimate based on a one-dimensional set of data having values in [0, 1].The method chosen was based on Wand and Jones (1995) and consists basically of five steps.
1) The m values p i were treated as quantiles and converted to instances of a normal random variable z with zero mean and standard deviation σ given by "slope" c divided by √ 2π using the inverse normal cumulative distribution function (c.d.f.) routine available online (Acklam, 2003) i.e.
2) The bandwidth parameter h = ( 4 3m ) 0.2 σ (Wand & Jones, 1995, Equation 3.2) was calculated where σ is the sample standard deviation obtained from the set of z values.This value of h minimises the AMISE (asymptotic mean integrated squared error) for estimating the normal probability density with the sample standard deviation σ from m data points, using the kernel density estimate with the normal(0,1) kernel function K (implying R(K) ≡ (K(x)) 2 dx = 1/(2 √ π) and μ 2 (K) ≡ x 2 K(x)dx = 1).
3) The initial estimate of the density was the kernel density estimate f (z) = 1 mh m j=1 K( z−z j h ).4) This estimate was used as the pilot estimate of the density to obtain the bandwidth α(z i ) = h( f (z i )) −0.5 associated with the point z i i.e. according to the square root law (Abramson, 1982).
5) Finally this bandwidth was used to construct the variable kernel density estimate ( i.e.Wand & Jones, 1995, Equation 2.31).
To assess bias and variance of this estimate and fix the one free parameter c giving a reasonable point in the tradeoff between the bias and variance, 100 populations were simulated with m = 1000 points from the uniform density and their density estimates were computed as above.The results for the mean and standard deviation of these estimates are shown in Figure 2.  The parameter value c = 0.01 was chosen in the following calculations as a reasonable compromise between the high variance and low bias that results when the c is small and the low variance and high bias that results when the slope is large.The main effect is near the ends of the range i.e. 0 and 1 (note that for c = 0.3 the mean density at p = 0.0001 was 1.51).Of particular concern was to not under or overestimate the density near 0 which corresponds to the most significant data.
There is another aspect to this that arises from the fact that the underlying data are small frequencies.Hence there are many repetitions of the data of individual tests in the entire data set so the data, after the initial analysis, is best represented by a set of pairs, (p-value, frequency).For the applications where the density estimate is required for each test (as opposed to the above case where the density was estimated on an equally spaced grid) the manner in which the frequency information was included in the density calculation had a major effect on the result.There seemed to be two reasonable ways to do this (here the sum is over the distinct values z j with frequency s j ): 1) Use the same formula for the density as if all the p-values were distinct i.e.

f1 (z
2) Use the above formula but use each distinct p-value once only, then correct the resulting density estimate for frequencies by multiplying the density estimates by their corresponding frequency afterwards i.e.

f2 (z
To these a option was added as a curiosity. 3) Use the uncorrected density estimate from step (2) above i.e.

f3 (z
where m * is the number of distinct p-values.Note that correct normalisation will not affect the order that the tests are considered to be significant and so will not affect the expected q-values, but will affect the estimates of π 0 .f1 and f3 are correctly normalised while f2 is not and it is not clear how it could be normalised because it appears to be an estimate of a discrete density with the scale factor changing from point to point; this clearly requires more research.To speed up the calculations by avoiding adding large numbers of terms that are very close to zero, the sums were started from j = i and continued to larger values of j while the terms were greater than a fraction 10 −10 of the largest term.Then similarly the terms for j < i were added in decreasing order of j.Finally the density for p was calculated as for each of the density estimators.

Calculation of q-Values
For each simulated data model and each test procedure a set of tuples was constructed that contains a tuple for each distinct data point: N 1i , N 2i , the number of times s i in the simulated data set that the data point i (i.e.(N 1i , N 2i )) occurred, and the test statistic.The set of tuples for the whole dataset was constructed as a list in the Standard Template Library of C++ to enable it to be sorted efficiently.Then the number of times each distinct data point was found in the null and alternative data subsets were obtained and added to the appropriate tuple.After sorting the tuples, most significant first, the running totals of the number of data points in the null and alternative data subsets were obtained, and the fraction of the data that are true nulls.This fraction is an estimate of the expected fraction of errors (q-value) that would occur if the given fraction of the data was assumed to be alternative hypotheses.

Results
For the analysis of each simulated data set, the q-values were linearly interpolated onto an equally spaced grid of fractions n 1 of the data selected from 0.001 to 0.999 with spacing 0.001, because repetitions of data points (N 1 , N 2 ) in the data sets implies that the n 1 spacing for the q-values directly calculated from the simulated data is irregular, and changes for each simulated data set.The means and standard deviations of the interpolated q-values for 100 data sets at each n 1 point were compared.In many cases independent sets of data sets are compared on the same graphs.To assess the practicality of the p-value density method for doing multiple hypothesis testing, three versions of this method corresponding to the three density estimates (46, 47 and 48) based on the binomial test were compared with using the p-values from the underlying binomial test, with the likelihood ratio test (based on the NP lemma), and with the ODP (Storey, 2007) and its iterative extension (Nixon, 2012) both using the actual and estimated values of r in the binomial test on which these methods are based.basic difficulty in estimating anything by the smallest value of a quantity rather than some sort of average because the random errors would be expected to be relatively large.Apart from the estimator f1 that performs poorly, the estimates for model (b) are better than those for model (a) because it is easier to distinguish model (b) data points from the null hypothesis data points than to distinguish model (a) data points from the null hypothesis data points as the LR test performance shows.Statistics for minimum probability density as an estimator for π 0 = 0.5 from 100 simulations based on the binomial test with given r = 0.3 and with the two alternative models for N 1 (see text).

Discussion
What has been developed in this paper is the multiple testing procedure (MTP1) for the general class of multiple hypothesis testing problems namely the two-groups model where the data are a mixture of those that come from two submodels, each with their own unknown parameters, and its specialisation (MTP3) that is appropriate when one of the submodels is effectively vacuous and would estimated from the data.These procedures are all based on a test statistic and sort the data into order of significance such that whatever cut-off point is chosen determining the fraction of the data selected as being significant, the fraction of true null hypotheses amongst this subset is as small as possible on the average.(Another procedure MTP2 was also mentioned based on a method Nixon (2012) used, but there is no theoretical reason for its optimality.)In the situation where both hypotheses are known without any unknown parameters, all three multiple testing procedures MTP1, MTP2 and MTP3 reduce to testing based on the NP lemma.In the case where the null hypothesis is specified implicitly by a known probability distribution of a test statistic t and the alternative is not specified, MTP3 reduces to the final test statistic which is the probability density of p-values derived from t.This of course leaves open the question as to what density estimation method should be used.This procedure implies a two-stage calculation of test statistics and p-values, once for the individual data sets, and once for the entire data set.This multiple testing procedure therefore adapts to the unspecified distribution of the alternative data to minimise the error rate whatever fraction of the data are selected as significant, but could be computationally more demanding per test than a direct test (because of the density estimation which is not yet optimised) and is subject to additional random error that decreases with the size of the data set arising from the probability density estimation.Of these three procedures (MTP1, MTP3, and the density-based method), the density-based test seems likely to be most important in applications because explicit models for alternative distributions may not be available.
In the numerical test of this method, the probability density was estimated by three methods that differ only in the manner in which the frequencies of occurrence of the data points (themselves being frequency pairs) in the entire data set are handled.Surprisingly it turned out that for testing with two different models for the alternative hypothesis, different versions of the density-based test (with different density estimators) performed best.For the estimation of π 0 (for either model) the same density estimator f2 had the smallest magnitude of bias.
The numerical results show that the p-value density method can improve the efficiency of multiple testing procedures resulting from a different order of significance of the data than would be obtained by simply using the p-values directly.That the order of significance was changed is confirmed by the estimated p-value distributions which are not monotonic functions of p.This would be expected to happen for example if there is a peak in the pvalue density resulting from a cluster of p-values a little away from zero and there are some smaller more isolated p-values.Tests close to this peak could have the final effective p-values very close to zero and may be more significant than isolated p-values nearer to 0. Moreover the results as a whole confirm that the theoretically optimal test based only on the test statistics for the null distribution can outperform, provided an appropriate density estimator is supplied, the binomial test and under some conditions the ODP (Storey, 2007) and its iterated extension (Nixon, 2012) (that both know about the null hypothesis model), and be very close to the theoretically optimal LR test that requires both the null and alternative hypotheses to be specified.
In general once a null hypothesis is specified, it is possible to find a test statistic having a known distribution

Figure 2 .
Figure 2. Estimates of the uniform probability density from 100 samples of 1000 points

Figure 3 .Figure 4 .
Figure3.Probability density estimates using the three estimators from samples of 1000 points using model (a) Table 1, π 0 and π 1 should be considered fixed as different tests are considered giving different values of n i j .There are therefore 2 degrees of freedom in this table and each such table is defined by the point (n 01 , n 10 ) representing the two error frequencies (Figure1).After carrying out the tests with w(•) as the test statistic and with the data point x just significant, the four outcomes in Table1have the definite probabilities: 01 < n 01 (P) because of the convexity property of the theoretical optimal curve.Therefore Hwang and Liu (2010) 19/21 and test 1 has n 01 = 2/21 and n 10 = 1/21 and test 2 has n 01 = 1/21 and n 10 = 10/21.Therefore AD − BC < 0 in each case, so the risk points for both tests are outside the region of interest i.e. above the line joining (0, 19/21) and (2/21, 0).Therefore both tests result from test procedures that can be bettered by a randomizing test procedure that does not depend on the data, and consequently no conclusions regarding optimization criteria should be drawn from them.5.Storey's ODP and MAP TestsStorey (2007) proposed one of the criteria for optimizing an MTP which were shown above to be equivalent to each other.He also proved that this optimization is satisfied by the "optimal discovery procedure" (ODP) which is an MTP for a class of models based on the NP lemma.The ODP maximizes the average power (MAP) for the two groups model Equation (1) involving any two hypotheses that are represented by known probability functions f (•) and g(•) of the data.Tests based on this principle are often referred to as MAP tests.However asHwang and Liu (2010)pointed out,Storey (2007)did not model the variation in the variances, only the variation in the means in connection with the application of the ODP to microarray data.WhenHwang and Liu (2010)did this they derived MAP tests that are more robust, and approximate MAP tests appropriate for the microarray analysis problem, in particular the useful approximation that is numerically almost exact, the Fss test that can be computed very fast.

Table 2 .
Estimates of the fraction of null hypotheses (π 0 )