Bayesian Estimation With Flexible Prior for the Covariance Structure of Linear Mixed E ff ects Models

Linear mixed effects models arise quite naturally in a number of settings. Two of the more prominent uses are in experimental designs and multilevel models. Furthermore, Bayesian analysis has also been utilized with respect to such models. Here we will consider such an approach with emphasis placed on estimation of the covariance matrix for the random effects. With respect to the covariance structure, however, we depart from the traditional Bayesian prior usage of the Inverse Wishart distribution. The rationale for such a departure is that this distribution is somewhat constraining. Instead, we employ a multivariate Normal approximation procedure for the likelihood of the matrix logarithm of the random effects covariance matrix. Such an approximation allows us to use a multivariate Normal prior for the logarithm of the random effects covariance matrix and still maintain the tractability of conjugacy, at least in an approximate sense. All posterior moments are calculated via Markov Chain Monte Carlo (MCMC) techniques. The Metropolis–Hastings accept reject algorithm is utilized to appropriately account for the approximation procedures. As a particular application we consider a multilevel model where student grade point average relate to a number of standardized test scores.


Introduction
The goal here in this research is estimation of the covariance matrix of a multivariate response variable.This allows practitioners to appropriately model correlation structures.It is well known that within Bayesian statistics that the Inverse Wishart is a conjugate prior for the classic multivariate Normal model (Evans, 1965;Dickey et al., 1985;Chen, 1979).The Inverse Wishart prior is completely characterized by a scalar degree of freedom hyperparameter and a location matrix hyperparameter, that is equal in dimension to the underlying covariance matrix.As the name implies, the prior location matrix contains information concerning the location of each element of the original covariance matrix.The degree of freedom hyperparameter possesses all of the information surrounding the strength of prior beliefs in the location values.This choice of prior distribution is mainly driven by ease of analytical exposition.However, the Inverse Wishart is restrictive in two key ways.First, the scalar degree of freedom hyperparameter contains all the information surrounding strength of prior beliefs.This single value does not allow for heterogeneity across the elements of the prior location matrix.If a diffuse prior is desired for certain elements of the location matrix hyperparameter, while a more a concentrated prior is desired for other elements, then the Inverse Wishart cannot accommodate this.The second is that since the prior location matrix only models locational information, we cannot model any correlation structure in the underlying covariance.Please note some noninformative priors for a covariance matrix, such as the Jeffreys prior, developed by Jeffreys (1961), Geisser (1965) and Villegas (1969) and the reference prior, proposed by Yang and Berger (1994).Shrinkage priors based on scale mixtures of uniform distributions were proposed by Wang and Pillai (2013).Please also note, Zeithammer and Lenk (2006) discussed a related work in a special case when dimensions are absent.
In their original work Leonard and Hsu (1992), illustrated another method of Bayesian analysis for a covariance matrix that solves both of the problems delineated above.Berger (1985, p. 400) outlines that if we consider the logarithm of the variance of a Normal distribution, then a Normal can be utilized as a prior.As a multivariate extension to this technique, Leonard and Hsu use an eigenvalue decomposition to take the logarithm of a covariance matrix.This matrix logarithm transformation is then coupled with a result from Bellman (1970, p. 171).In particular, Leonard and Hsu show that the exponential terms can be written as a Volterra integral.Bellman's iterative solution to the Volterra integral can then be utilized to approximate the likelihood.Quite conveniently, it turns out that the resulting approximate likelihood is in the form of a multivariate Normal.Thus, the Normal will serve as a conjugate.The approximate posterior distribution will also be of this same form.It should be obvious that a multivariate Normal distribution overcomes the main shortcomings outlined above of the Inverse Wishart prior.Varying degrees of confidence can now be expressed.Additionally, interdependency amongst the covariance parameters can also be model a priori.
Throughout we will make use of such a novel Bayesian procedure for estimation purposes of the covariance matrix of a linear mixed effects model.The general outline of the article is to begin by introducing the linear mixed effects model and performing the requisite Bayesian analysis.We then will outline a Markov Chain Monte Carlo method with a Metropolis-Hastings algorithm.We will consider one particular application to the High School and Beyond survey, which was administered by the National Center for Education Statistics (NCES, 1980).Since the study is representative of a multilevel model a mixed effects model is appropriate.Standardized test scores will serve as explanatory variables and the response variable will be student grade point average.We will also produce the best linear unbiased predictor (BLUP) for the grade point average of a hypothetical student.

Linear Mixed Effects Model
We begin our analysis by considering the linear mixed effects model.Consider the i-jth observation which is given by y × 1] fixed effect vector of explanatory variables and z i j = z i j0 , . . ., z i jp T is a (p + 1) × 1 random effect vector of explanatory variables.
T is a (p + 1) × 1 random effects vector for the ith group.
Assume that each group has n i observations for i = 1, . . ., N. Thus, we are able to express the model for the ith group in the following matrix vector form.
This can be succinctly written in matrix vector notation as We are then able to stack all the observations in the following fashion where n = N i=1 n i .The entire model can thus be succinctly written as y = Xβ + Zb + .

Likelihood Function
The first distributional assumption we make is with respect to the random error vector .In particular, the distribution of the random error vector conditional on the parameter φ is multivariate Normal such that | φ ∼ N n (0, φI n ).Based upon this distributional assumption we can specify the functional form of the likelihood function.
The likelihood function as presented in Equation (1) will be used extensively throughout the Bayesian analysis in the subsequent sections.

Conditional Distribution for Fixed Effects
Since our main concern lies with the random effects covariance matrix we make the simplifying assumption of vague prior information for the fixed effects.In particular, we assume π (β) ∝ 1.If there is greater interest in modeling the fixed effects one can use a multivariate Normal as a prior for β as in Searle et al. (1992, p. 319).This will only slightly increase the complexity of the analysis.Now let u = y − Zb.Then the full conditional posterior for β is given by This last result is easily recognizable as the well known multivariate Normal probability density function.
From a computational standpoint the fact that the posterior distribution for the fixed effects is of a multivariate Normal form greatly facilitates the numerical procedures.
Incidentally, had we assumed a Normal prior for β we would have maintained a multivariate Normal posterior distribution due to the multivariate Normal distribution's conjugacy.However, the mean vector and covariance matrix would be slightly modified.The main drawback would be the need to further model the additional hyperparameters associated with the multivariate Normal prior distribution for β.

Conditional Posterior Distribution for Random Effects
It is fairly common practice to model the random effects with a multivariate Normal distribution (Pinheiro & Bates, 2000, p. 58).We do the same here, however, with a slight modification.In experimental design problems a random effect intercept term is usually not included.However, in mixed effects multilevel models it is not uncommon to include a random effect intercept term.Since our particular application will be with respect to a multilevel model we will in fact include a random effect intercept term.The inclusion of such a random effect intercept term accounts for the group level variability.Additionally, the Normal prior for the random effects covariance matrix requires that an exchangeability condition be met.For this reason we will model the random effect intercept terms separately from the random effect slope terms.However, we stress that the Bayesian analysis and subsequent computational procedures are fully generalizable to include models both with and without random effect intercept terms.Here we will assume that b 1 , . . ., b N | Σ * iid ∼ N (p+1) (0, Σ * ) where Σ * is a (p + 1) × (p + 1) block diagonal matrix given by Here the scalar parameter τ is the variance of the random effect intercept terms and Σ is the (p × p) covariance for the random effect slope terms.
Note that if we let w i = y i − X i β then the likelihood for the ith group is Then the posterior distribution for the random effects of the ith group is given by the following.
In this last step here we have made use of a multivariate technique for completing the square (Leonard & Hsu, 1999, p. 245).We recognize the following posterior distribution for the random effects

Conditional Posterior Distribution for φ
For the variance parameter of the error terms we utilize a diffuse prior distribution, that is π (φ) ∝ φ −1 .We can then derive the following posterior for the variance of the random error.
From this last result we recognize the Inverse Gamma probability density function.Therefore, we have following posterior distribution for the random error variance parameter.
Again, we have a highly tractable posterior distribution in the Inverse Gamma, which will facilitate the computational procedures.

Conditional Posterior Distribution for τ
Denote the set of random effect intercept terms as b 0 = [b 10 , . . ., b N0 ] T , that is b 0 is the (N × 1) vector of all the random effect intercept terms for all N groups.Then we have b 0 | τ ∼ N N (0, τI N ) where I N is an (N × N) identity matrix.We assume a vague prior specification for τ, that is π (τ) ∝ τ −1 .Combining this prior for τ with the distribution of the random effect intercept terms we have the following posterior Therefore, we recognize the following posterior distribution for the variance of the intercept random effect regression coefficients.
In the subsequent subsections we derive a novel Bayesian estimation technique for the covariance matrix of the random effect slope terms.

Conditional Posterior Distribution for
Here we make use of an approximation technique first detailed by Leonard and Hsu (1992).A more detailed exposition can be found in Hsu et al. (2012).Here we merely outline the procedure.Begin by defining b , for all i = 1, . . ., N, as the (p × 1) vector of just the random effect slope terms for the ith group.
That is, b i is the vector of all the random effect terms for the ith group except for the random effect intercept term b i0 .Furthermore, define the (p × p) sample covariance matrix of the random effect slope terms as Based on the eigen-decomposition we can define the matrix logarithm transformation where the (p × p) matrix of orthonormal eigenvectors is given by E and the diagonal D matrix contains the normalized eigenvalues for Σ.Analogously, we define E 0 and D 0 for S.
Noting that |Σ| = exp {tr [A]} we can then write, Now, we define the following function to obtain the upper triangular elements of a matrix in a very specific order, starting with the main diagonal and progressively moving upward.If a i j is the (i, j)th element of A, then, where q = 1 2 p (p + 1).Analogously, λ = Vec * (Λ).Using this definition and a result for a Volterra integral from (Bellman, 1970, p. 175), Leonard and Hsu (1992) show how the following approximation can be made to Equation (7), where b = b 1 , . . ., b N T .The (q × q) symmetric almost surely positive definite matrix Q is the so called information matrix of α and is a function of the normalized eigenvalues and normalized eigenvectors of S. In particular, where and d j is the jth normalized eigenvalue of S for j = 1, . . ., p. f i j = e i * e j is a (q × 1) vector where α T e i * e j = e i T Ae j .
The approximate probability density function for α, given in ( 8) is Normal.Specifically, when viewed as a function of α the approximate probability density function ( 8) is a q dimensional Normal with mean λ and covariance Q −1 .Equation (8) will be used for the Bayesian analysis for α.Refer to Leonard and Hsu (1992) for the details on the approximation.

Hierarchical Prior Specification for α
As stated above, the approximate likelihood for α is Normal.Therefore, we can use a multivariate Normal prior along with the approximate likelihood to obtain a multivariate Normal approximate posterior.Formally, we assume α η, Υ follows a q dimensional multivariate normal distribution with a (q × 1) prior mean location hyperparameter vector η and a (q × q) covariance hyperparameter matrix Υ.
In terms of capturing a priori information, the multivariate Normal prior provides an extremely rich class of prior specifications.Here we opt to only utilize a portion of the flexibility afforded us by the multivariate Normal.Specifically, we consider η = η (μ) and Υ = Υ (σ), where μ and σ are of smaller dimension than η and Υ, respectively.In certain situations, a priori we may want to only model a subset of the parameters.In particular, suppose we consider the variance terms separate from the covariance terms.
To this end, we utilize the intra-class matrix model.Specifically, we model the first p elements of α versus from the remaining (q − p) terms.Interested readers are referred to Hsu et al. (2012) for more on this specification.
Formally, a priori α | μ, Δ ∼ N q (Jμ, Δ).Then, where the (2 × 1) vector μ = μ 1 , μ 2 T and To be clear, the first p elements of the first column of J are all equal to one and the remaining (q − p) terms are equal to zero.Analogously, the first p elements of the second column of J are all zero and the remaking (q − p) elements are all one.Thus, μ 1 and σ 2 1 are the location and variance hyperparameters, respectively, for the variance components of α, and μ 2 and σ 2 2 are the location and variance hyperparameters for the covariance components of α.By choosing to model the random effect intercept terms apart from the slope terms we satisfy an exchangeability condition for b 1 , . . ., b N .

Approximate Posterior Distribution for α
For the hyperparameters μ = μ 1 , μ 2 T and Δ = h σ 2 1 , σ 2 2 we will assume the following Inverse Gamma prior distribution where δ 1 , δ 2 , γ 1 and γ 2 are fixed constants to be chosen by the practitioner in order to implement a sensitivity analysis.This topic will be discussed in greater detail in the computational section below.Note that a vague prior specification for μ and Δ will be investigated and can be obtained by letting The joint prior distribution for α, μ and Δ is given by the product of ( 9) and (10) the prior distribution for μ and Δ.In order to facilitate the Markov Chain Monte Carlo procedure we consider the joint prior distribution for just α and Δ.After integrating over the joint prior distribution for α, μ and Δ with respect to μ we have where and I q is a (q × q) identity matrix.We make heavy use of (11) throughout §2.6.4 and §2.7.
The approximate posterior for α and Δ will be proportional to the product of ( 8) and (11).
We can then apply the matrix version of completing the square to (12) with respect to the terms in the exponent involving α.After completing the square, we ignore all terms that do not involve α.
where the (q × 1) vector This demonstrates the conjugacy of utilizing the approximate likelihood function (8), and numerical methods such as MCMC can readily be implemented by making use of ( 14).In short, we have developed a highly flexible while at the same time tractable Bayesian methodology for the covariance structure.

Exact Posterior Distribution for α
In order to carry out the Metropolis-Hastings step we require the exact posterior for α, which will be the product of ( 7) and ( 11).Here we will use α = Vec * (A) interchangeably.

Posterior Distribution for Δ
Conditional on α, the posterior for Δ = h σ 2 1 , σ 2 2 will be proportional to (11), q−p q i=p+1 α i are the means of the variance and covariance terms of α, respectively.It can be observed that the posteriors for σ 2 1 and σ 2 2 , conditional on α, are Inverse Gamma and independent of one another.
We should note that with respect to the hyperparameters δ 1 , δ 2 , γ 1 and γ 2 we will not conduct the usual Bayesian analysis in terms of prior distribution specification and subsequent derivation of the associated posterior distributions.Rather, the purpose of these hyperparameters will be to serve in the sensitivity analysis.Specifically, we wish to investigate how sensitive the posterior distribution and subsequent Bayesian estimates of α will be for various choices of δ 1 , δ 2 , γ 1 and γ 2 .That is, δ 1 , δ 2 , γ 1 and γ 2 will be predetermined to be equal to specific numerical values in order to reflect the strength of believe in the prior specification for α This completes the analytical portion of the both the Bayesian and hierarchical Bayesian derivations.We now turn to the computational and numerical analysis related to the estimation procedures.

Markov Chain Monte Carlo Procedures
Equations ( 2), (4), ( 5), ( 6), ( 14), ( 15), ( 16) and ( 17) provide the necessary distributions for a Markov Chain Monte Carlo procedure with a Metropolis-Hastings step.The goal is to produce posterior means for all relevant model parameters.We now outline how to implement such a procedure.
Initial starting values for β and b can be obtained by a number of methods.Maximum likelihood estimation or restricted maximum likelihood estimation are two widely used estimation techniques (Pinheiro & Bates, 2000, p. 75).
Alternatively, we obtained initial starting values by iteratively applying ordinary least square estimation.In the first stage, we applied ordinary least square estimation on the entire data set using the original response variables and the fixed effects explanatory variables to obtain initial starting values for β (please refer to Table 1).In the second stage, we again applied ordinary least square estimation, however now in this case on the group level data, using the residuals from the first stage as the response variables and the random effect explanatory variables.In this fashion we obtained initial starting values for each of the b i for all i = 1, . . ., N. One of the chief advantages of Bayesian estimation via Markov Chain Monte Carlo techniques is the robustness with respect to initial starting values.Coupling this feature with the analytical and computational ease of ordinary least square estimation offers a secondary advantage.
This last procedure in step ( 6) is of course the well known Metropolis-Hastings accept reject algorithm.We employ this procedure since we are utilizing an approximation to the exact posterior distribution (Gelman et al., 2005, p. 325).Having laid out the groundwork for the Markov Chain Monte Carlo we now move on to how the posterior moments were produced.

Posterior Means and Standard Errors
Bayesian estimates of the posterior means and standard errors can be calculated quite easily based upon the Markov Chain Monte Carlo results.We define l to be the number of iterations used as a burn in.Furthermore, we define there to be a total of T iterations of the Markov Chain Monte Carlo algorithm and simply for notational simplicity let M = T − l.Then for a general parameter vector θ we have the following posterior mean and variance.
Standard errors will simply be calculated as the square root of V.In the case of any matrix estimates it is understood that the vectorized matrix will be estimated.In practice, we found that a total iteration size of T = 200000 was quite sufficient to achieve stability and we used l = 2000.

Sensitivity Analysis
According to the intra-class prior specification all the variance terms of α have the same mean and variance a priori.Similarly, all the covariance terms of α have the same mean and variance a priori.Depending upon the strength of the information contained in the data the posterior distribution for α will reflect this fact more or less to some degree.In practice, we found that with a vague prior for both σ 2 1 and σ 2 2 the Bayesian estimates for these hyperparameters converged to zero.In turn, this led to all the the variance terms of α being nearly identical as well as all the covariance terms also being nearly numerically equivalent.That is, the posterior covariance matrix for the slope random effect regression coefficients assumed the classic intra-class form from Dickey et al. (1985).
In order to better investigate and understand this phenomenon we implemented a sensitivity analysis wherein we chose various values for δ 1 , δ 2 , γ 1 and γ 2 .We did so in manner that is consistent with attempting to counteract the degree of strength of the information present in the data.Specifically, we always chose a common value for the four hyperparameters.In fact, to simplify the notation of this section we shall henceforth let δ 1 = δ 2 = γ 1 = γ 2 = δ, a common value.By choosing values for δ in this fashion we maintain the posterior mean for σ 2 1 and σ 2 2 close to unity.This procedure will also lead to a richer class of posterior means for α.
We considered two different cases δ = 0 and ρ = 0, which is the limiting case representative of vague prior information with respect to σ 2 1 and σ 2 2 .We now move on to address some more of the specifics of the computational procedures outlined above.

Other Computational Concerns
As mentioned above in §2.8 starting values for β and b can be obtained through a number of methods.In our particular case we opted to employ an iterative ordinary least square procedure.This produced very reasonable initial starting values for β.In fact, the ordinary least square estimates for the fixed effect regression coefficients were not too different from the estimates produced using more sophisticated techniques such as restricted maximum likelihood.
Despite the robustness of Markov Chain Monte Carlo techniques there are cases when certain parameter simulations exist in some sense in the tails of the respective probability density functions.This was the case with the α parameter vector.Specifically, the Metropolis-Hastings accept reject algorithm rejected all initial candidate values of α.In order to combat this we employed a training set methodology.That is, the first handful of the α candi-dates were accepted with certitude.Essentially, the Metropolis-Hastings accept reject algorithm was overridden.In practice, we found that accepting the first fifty α candidates with certitude was more than adequate to allow the Markov Chain to relocate to a space with higher probability of occurrence.
Along these same lines we also found that under the case when δ = 0 the Markov Chain with respect to α also had a tendency to migrate to a state space of very low probability and was unable to recover.To alleviate this we employed a very similar procedure as outlined above for the initial first fifty iterations.That is, when the value of ρ from the Metropolis-Hastings accept reject algorithm fell below the threshold value of 10 −15 we employed a so called subroutine.Within the subroutine the exact same Markov Chain Monte Carlo procedure as outlined above was invoked, however, the Metropolis-Hastings accept reject algorithm was overridden and in addition none of the simulated values for any parameters were included in the estimates.Upon completion of the subroutine the program returned to the main Markov Chain Monte Carlo procedure with the last simulated values of all parameters being passed forward as initial starting values.We found that this proved to be quite an effective technique for allowing the Markov Chain to migrate away from areas of the probability density function for α of very low probability.

High School and Beyond Survey
In 1980 the National Education Longitudinal Studies program of the National Center for Education Statistics administered the High School and Beyond (HSB) survey NCES (1980).The HSB study contains both a 1980 senior class cohort and a 1980 sophomore class cohort.Within the senior class cohort we focused on schools with six or more students of which there were a total of 679.The maximum number of students any one school had was twenty eight.The total number of seniors at all 679 schools investigated was 6671.
The HSB study contains a myriad of data and variables.In particular, for the senior class cohort two exams each in the areas of vocabulary and mathematics and one exam in the area reading were administered.The test scores were standardized to have a mean of fifty and a standard deviation equal to ten.We will use these standardized test scores for the senior class cohort as our explanatory variables.Additionally, grade point average data was obtained from official transcripts that were included in the survey.The grade point average data will serve as our response variable.Below is the summary Table 1 of the initial ordinary least square fitted regression.The adjusted R 2 value was a modest 0.187, however, we feel that in this context this value is reasonable.Of course working with a complete data set is preferable to simply omitting any students who have one or more missing variables.To this end we employed a data augmentation technique.The specific procedure was invoked using the statistical and mathematical software program R using the "mice " library (Buurin & Oudshoorn, 2000).This particular data augmentation procedure fits nicely within the context of our research here since it employs a multivariate imputation by chained equations technique.That is, it uses a multivariate Gibbs sampler procedure to augment the missing data set.In particular, incomplete columns of data, that is variables with missing data, are augmented by generating appropriate values of data given the values of the other columns of variables.In this way, the other columns act as predictors.
Tables 2-7 present Bayesian estimates and the associated standard errors for all relevant parameters of interest.As discussed above a sensitivity analysis was implementing for investigating the impact of the prior specification for σ 2 1 and σ 2 2 .The sensitivity analysis varies from the extreme limiting case when δ = 0 which is representative of vague prior information for σ 2 1 and σ 2 2 up to when δ = 50 representing greater prior information strength.Recall that δ = δ 1 = δ 2 = γ 1 = γ 2 is the common hyperparameter for both σ 2 1 and σ 2 2 .Notice from Table 3 that in the limiting case when δ = 0 the matrix A assumes the intra-class form as described above and in Dickey et al. (1985).Furthermore, for nonzero values of δ the Bayesian estimates of σ 2 1 and σ 2 2 are roughly centered around one.This is due to the mean of the Inverse Gamma posterior distributions for σ 2 1 and σ 2 2 which is on the order of the ratio of the scale parameter divided by the shape parameter.Moreover, as δ increases in magnitude we observe the departure away from the intra-class form for A as we should expect to occur.
It is common practice to produce a best linear unbiased predictor (BLUP) when working with mixed effects models (Pinheiro & Bates, 2000, p. 94).From a classical frequentist perspective such a quantity can be computed quite easily and is given by x T i j β + z T i j b i .However, under our Bayesian framework where posterior means are calculated via Markov Chain Monte Carlo procedures we require a slightly different computation.Specifically, for the posterior best linear unbiased predictor we calculate where as before T is the total number of iterations of the Markov Chain Monte Carlo algorithm, l is the number of burn in trials and for notational simplicity M = T − l.Furthermore, the posterior variance of the best unbiased linear predictor can be computed as As a particular illustration of such a calculation we consider a student at a given school whose explanatory standardized exam scores were equal to the mean test scores.We calculated a posterior best linear unbiased predicted grade point average value of 2.4312 with a standard error equal to 0.0697393.

Conclusions
We investigated a linear mixed effects model.In particular, we modeled the covariance matrix of the random effect slope coefficients via the flexible prior paradigm that has been developed in Leonard and Hsu (1992), and further discussed in Hsu et al. (2012).Additionally, we implemented a sensitivity analysis in order to examine the behavior of the hyperparameters as they relate to the matrix intra-class prior for the covariance structure.All posterior means were calculated via a Markov Chain Monte Carlo procedure.As we have done throughout the text we utilized a Metropolis-Hastings step.
Our application here was made with respect to the High School and Beyond survey.Students were nested in a multilevel model via the various schools included in the survey.Due the multilevel nature of the data set a mixed effects model is an appropriate approach.Specifically, we considered grade point average as a response variable and a number of standardized test scores as the explanatory variables.We concluded by calculating the best linear unbiased predictor for a student at a given school with average exam scores.

Table 1 .
GPA regressed on standardized test scores Below we outline the specific Markov Chain Monte Carlo procedure invoked in our estimation algorithm.As is the case with Markov Chain Monte Carlo techniques the ordering of draws is somewhat arbitrary, however, it does seem logical to simulate α last since the Metropolis-Hastings accept reject algorithm is employed at this step.Initial starting values can be obtained in a number ways as mentioned above.

Table 2 .
Posterior mean and standard deviation of β when δ = 0As can be expected with educational data there was some moderate degree of missing data.In particular, out of a total sample size of 6671 students 1661 did have a recorded grade point average, 841 did not have a recorded VOCAB1 test score, 853 did not have a recorded VOCAB2 test score, 845 did not have a recorded READ test score, 867 did not have a recorded MATH1 test score and 963 did not have a MATH2 test score.