The Just-About-Right Pilot Sample Size to Control the Error Margin

,


Incorrect Estimation of Sample Sizes
A pilot study is a small-scale investigation designed to test the feasibility of methods and procedures for later use on a larger scale (Thabane et al, 2010). In clinical studies, a pilot randomised controlled trial (RCT) could be used to help in the planning of a proposed substantive RCT (power = 0.8) or definitive RCT (power >= 0.9). The pilot RCT provides a means to collect preliminary data on safety, is used to assess the recruitment rate and the degree of participant retention, provides data on willingness to be randomised, and crucially, to provide estimates of variation in outcomes measures to assist the decision-making process for the sample size of the follow-on trial (Lancaster et al, 2004, Ln 2005, Arnold et al, 2009). This latter consideration begs the question, on how to determine the optimal RCT pilot sample size for any given context, with the aim of being able to estimate accurately the sample size requirements of the proposed follow-on RCT.
One of the most common errors in any type of empirical scientific research is an insufficient sample size (Makin, 2019). Small sample sizes can lead to Type Two errors (false negatives) and in practice this is especially true when combined with moderately low or low effect sizes. Small sample sizes can leave a research community in some doubt as to whether effects are real. There is also the position that it is unethical to ask participants to commit to taking part in a study which is insufficiently powered to meet objectives (Altman 1980, Halpern et. al. 2002). In addition, any such study would be an uneconomic use of resources. On the contrary, having too large a sample size could also be problematic. A sample size might be considered too large if the same quality of conclusions could have been obtained with a much smaller sample size. If the sample size is too large, then this too may be considered an uneconomic use of resources and it may be deemed unethical to be randomly allocating any excess sample size to control or intervention irrespective of whether intervention confers a benefit or not. In summary, for any substantive or definitive trial, the sample size should be sufficient to achieve worthwhile results, but not so large as to involve unnecessary recruitment of participants. Guidance is needed to allow research teams, ethics committees, funding panels, data monitoring committees, and protocol reviewers to evaluate whether a study intends to recruit too many participants (overpowered) or too few participants (underpowered) and it is important to get a just-about-right (JAR) sample size which is not too small, not too large but just-about-right.
In terms of context specific considerations, Browne (1995), considered determining sample size for a two-arm parallel RCT study when (a) a pilot study is used to collect preliminary data on outcome variation and (b) the minimum clinically important difference (MCID) is pre-specified and (c) the follow-on study is to be adequately powered to detect an effect and (d) an assumption of normally distributed outcome data can be made. For these situations, the required per arm sample size, , is given by ( 1 − 2 ) 2 (1) where 1 − 2 is the true mean difference or MCID, 1− 2 ⁄ , and 1− are standardised normal deviates for two-sided significance testing with nominal significance level and required power 1 − , and 2 is the population variance for the outcome measure assumed to be equal between arms. Although the MCID might be specified by hypothesis, the true population variance 2 would be unknown. The pilot study would provide a sample estimate for 2 , but this sample estimate 2 , would most likely underestimate the population variance 2 , since ( 1 + 2 − 2) 2 2 ∽ 1 + 2 −2 2 ⁄ where 1 and 2 denotes the sample sizes in the two arms of the pilot study.
It is well known that chi-square distributions are positively skewed, hence using 2 in place of 2 in the above formula would typically produce an estimated sample size lower than truly required. For this reason, Browne (1995) cautiously suggested estimating and replacing 2 in the sample size formula with the estimated 100(1 − ) per cent one-sided upper confidence limit (UCL) for 2 . Specifically, the sample size per arm, for 1:1 randomisation under Browne's suggested approach is given by where 2 is the sample pooled variance and 2 is the 100(1 − γ) percent one-sided upper confidence limit (UCL) for σ 2 . The quantity 100(1 − γ) is the "coverage" i.e., the percentage of times that the predicted sample size per arm, n, would exceed the true required sample size per arm n. From a practical perspective, Browne advocated a coverage of 80% (0.8) or a coverage of 90% (0.9).

Browne's Method
Simulation work conducted by Browne (1995) and Obodo et al, (2021) confirms that the approach considered by Browne has merit, achieving the required coverage of 0.8 or 0.9 as appropriate, for α = 0.01, α = 0.05, β = 0.2, β = 0.1, and for a range of effect sizes (small, medium, large) and for a range of pilot sample sizes between 5 and 100. However, Obodo et al, (2021) show that the procedure can produce underpowered studies, or frequently produce an intolerably large degree of excess, and that the extent of the problem depends on pilot sample size per arm (m), level of coverage (1 − γ) but not on significance level α = 0.01, 0.05, nor on power 1 − β = 0.8, 0.9, nor on MCID. Both coverage and pilot sample size are at the control of an investigator at the trial planning stage. We therefore sought to quantify the relationship between pilot sample size and JAR requirements for coverage of 0.8 and 0.9 separately.
We operationalise an investigator chosen JAR interval to be [n − λ 1 n, n + λ 2 n ] where λ 1 , λ 2 ∈ [0, 1], are investigator chosen parameters to prevent the degree of underpowering (λ 1 ) and degree of overpowering (λ 2 ). We aim for trialists to be able to justify pilot sample size and to make a statement to the effect of "The proposed two group pilot study will have a sample size of m per arm. This sample size is chosen so that the resultant power calculations for a larger study will have 100(1 − γ)% chance of exceeding the minimum required sample size and which in a two-sided test with significance level α will have 100(1 − β)% power for detecting a difference between arms assuming a MCID of (μ 1 − μ 2 ). This proposed pilot sample size of m per arm will ensure that the estimated sample size will lie in the interval (1 − λ 1 )n to (1 + λ 2 )n, with probability π providing a safeguard for under-and over-powering." In this statement we consider α = 0.01, 0.05, power (1 − β) = 0.8, 0.9, coverage (1 − γ) = 0.8, 0.9, any value for MCID, lower bounds λ 1 = 0.1, 0.2 and upper bounds λ 2 = 0.1, 0.2, 0.3, 0.4 for any chosen level of π.

Monte Carlo Simulation Design
The Monte Carlo simulations are informed by Browne (1995) and mimic the design given by Obodo et al, (2021). In brief, we consider the two-arm parallel RCT with 1:1 randomisation which is to be analysed using the independent samples t-test (equal variances assumed, two-sided, alpha = 0.05, 0.01). The true sample size for the RCT is calculated for desired power (0.8 or 0.9), for a specified MCID corresponding to a small, medium or large effect (0.1, 0.4, 0.75) assuming equal variances ( 2 = 1) under an assumption of normality.
For pilot samples sizes ( = 5, 10, 30, 50, 100) the 80% and 90% upper one-sided confidence limit for the pooled sample variance is used in Browne's formula. The percentage of times that the estimated sample size, ̂ is in the interval [ − 1 , + 2 ] is recorded for 1 = 0.1, 0.2, and 2 = 0.1, 0.2, 0.3, 0.4, 0.5. Table 1 summarises the factor levels for the 2 by 2 by 2 by 3 by 5 fully crossed design. Simulation was done using the R programming language with 100,000 replicates (as against Browne 1995 who used 2,000 replicates) for each cell of the design to obtain more precise simulation values. Table 2 summarises the percentage of times the estimated sample size, ̂, for the follow-on study would be in the interval [ − 1 , + 2 ] for 1 = 0.1, 0.2, 2 = 0.1, 0.2, 0.3 for = 5(5)100, and for coverage (1 − ) = 0.8, 0.9. Simulation percentages are aggregated over significance level = 0.01, 0.05, over prior reasoned statistical power (1 − ) = 0.8, 0.9 and assumed effect size 1 − 2 = 0.1, 0.4, 0.75 as it is known that these factors do not affect the estimated sample size (Obodo et al, 2021). Table 2 and Figure 1, clearly shows the percentage within any given interval monotonically increases with increasing pilot sample size for each of coverage = 0.8 and for coverage = 0.9. It is also clear that the percentage in any given interval is greater for coverage = 0.8 compared with coverage = 0.9 and this is only to be expected since, for any estimated sample size, the sample size for when coverage is 0.9 must be greater than the sample size when a tolerance for coverage is set to be equal to 0.8. Table 2 and Figure 1 show that the percentage of instances within an interval is particularly sensitive to the upper bound 2 which naturally follows from the positively skewed chi-square distribution used in the estimation process.  The monotonic trends between ̂ and pilot per arm sample size , for each interval [ − 1 , + 2 ] and each level of coverage has been modelled using linear regression with the functional form ln(̂) = 0 + 1 √ . Thus, for instance, when coverage = 0.8 and the interval ± 0.1 is considered then it is readily verified that ln(̂) = −2.745 + 0.297√ and that the overall goodness-of-fit, 100 2 , is 96.3%. Table 3 provides the estimated intercepts, gradients and goodness of fit for 1 = 0.1, 0.2; 2 = 0.1, 0.2, 0.3, 0.4 0.5 for coverage 0.8 and coverage 0.9. Table 3. Regression equations of the form ln( ) = 0 + 1 √ giving estimated intercept ( 0 ), gradient ( 1 ), coefficient of determination (R-squared) for 1 = 0.1, 0.2, and 2 = 0. For any level of coverage and any interval, any regression equation in Table 3 may be re-written in terms of pilot sample size i.e., = ([ln(̂) − 0 ]/ 1 )^2. Solution of this will give an estimated pilot sample size per arm, , for any required percentage for the given interval. Table 4 shows the pilot sample size per arm ( ) needed to have a required probability ( ) of being in a given interval [ − 1 , + 2 ] for coverage of 0.8 or coverage 0.9. Thus, for instance, if an investigator requires an 80% chance of not being underpowered for a definitive trial (coverage = 0.8) and requires a 70% chance ( = 0.7) of being within ± 10% of the true required sample size ( 1 = 0.1, 2 = 0.1) then a sample size per arm ( ) of 65 is needed for any given MCID.

Discussion and Conclusion
Pilot studies are conducted for a variety of reasons. One such reason is to help determine variation in outcome measures to help plan the required sample size for a large-scale substantive or definitive follow-on study. The preceding sections consider the situation where the MCID can be pre-specified for a scale outcome variable and an assumption of normality is reasonable.
Sample size may be calculated if parameters are either known or can be reasonably estimated. For instance, in a two-arm study, if for example MCID = 0.2, variance = 1, alpha = 0.05, beta = 0.10, then the required sample size may be verified to n = 526 per arm (complete data set after any missing data). In practice the variation of the outcome measure may not be known but may be estimated by collecting pilot data. In these regards, Browne's method, may be used to estimate a required sample size with either 80% or 90% coverage i.e. the estimated sample size has an 80% or 90% chance of exceeding the required minimum sample size. A problem with this approach is the chance of underestimating the required sample size, or in having an estimated sample size which far exceeds the required sample size (see Obodo et al, 2021). We considered a strategy to curb these excesses so that estimated sample sizes would be "not too small" and "not too large" in comparison to the true required but unknown sample size, by considering a just-about-right (JAR) sample size. The chosen coverage (say 80% or 90%) is not dependent on pilot sample size. However, with a given level of coverage a researcher may wish to ensure that the probability of the margin of error attached to any estimate is pre-specified to be within an interval around the true required sample size e.g. 70% chance of being within 10% of the required sample size. By inspection of Table 4, if 80% coverage is required with a 70% chance of being within +/-10% of true sample size, then the pilot study would require at least m = 65 per arm. The protocol may then contain a summary "The proposed two group pilot study will have a sample size of 65 per arm. This sample size is chosen so that the resultant power calculations for sample size in a larger study will have an 80% chance of exceeding the minimum required sample size and which in a two-sided test with significance level α will have 100(1 − β)% power for detecting an effect assuming an MCID of (μ 1 − μ 2 ). This proposed pilot sample size of 65 per arm will ensure that the estimated sample size will have a 70% chance of being in an interval of +/-10% of the true required sample size providing a safeguard over under-and over-powering." The pilot sample sizes given in this article (Table 4) is predicated on an MCID. If the true effect size exceeds the MCID then the follow-on study is likely to be overpowered to detect a difference (the lesser of the two possible errors). If the true effect is smaller than the MCID then any effect smaller than the MCID is not of clinical interest and may go undetected.
The pilot sample sizes given in this article are based on assumptions of normality and equal variance. In these regards, the practical utility of the pilot sample size recommendations needs further investigation for variance heterogeneity and non-normal distributions including binary outcomes. In a similar way, other simulations may consider the two arm pre-post-RCT design with repeated measures ANCOVA as the analysis strategy. As such the given pilot sample sizes are restricted to the stated assumptions with a direct parametric comparison between the two groups.