Estimating Statistical Measures of Pleiotropic and Epistatic E ff ects in the Genomic Era

Recent developments in the technology for sequencing the genomes of various species has had a profound effect of the working paradigms of various fields of genetics. Included among these fields is the classical field of quantitative genetics, which is a subfield of statistical genetics, that is devoted to traits that can be quantified on some continuous scale and are often influenced by alleles at many loci. In recent years, many investigators have conducted genome wide sweeps and have used a variety of statistical criteria to judge whether identified regions of the human genome have a significant influence on the expression of some quantitative trait such as measurements on patients with Alzheimer’s disease. From the point of view of quantitative genetics, the regions of a genome that have some influence on a quantitative trait may be viewed as loci, and variations among these loci at the DNA level, such as nucleotide substitutions or other markers, may be used as working definitions of alleles, and, therefore, can be used to determine whether an individual carries a particular allele at some locus. Given such data, an investigator can identify the genotype of each individual in a study, with respect to the loci under consideration as well as the two alleles present at each locus in a diploid species such as man. This ability to use these working definitions to identify the genotype of each individual in a sample results in a significant change in the working paradigm of sub-field of quantitative genetics, called variance and covariance analysis, because effects and components of variance and covariance may be estimated directly in a sense that will be described in detail in the paper.


Introduction
Following the sequencing of the human genome in 2000 as well as that of other species, sequencing technology has advanced to the point at which it is becoming financially feasible, see Church (2006), to sequence the genomes of individuals in samples under study by an investigator or teams of investigators.This technological advance has led investigators to conduct genome wide sweeps to search for regions of the human genome as well as those of other species such that there is evidence to suggest that these regions are implicated in the expression of complex quantitative traits.In an interesting paper by Stranger et al. (2010), the impact of genome wide association studies on the genetics of complex traits is discussed in depth.Among these complex traits are Alzheimer's disease (AD) and immune-mediated diseases such as rheumatoid arthritis.For the case of AD, in a recent paper Raj et al. (2012) have reported that 11 regions of the human genome are involved in susceptibility to this disease, and, moreover, there is evidence that four of these regions form a protein-protein interaction network that is under natural selection.Similarly, in a paper by Rossin et al. (2011), it has been found that proteins encoded in genomic regions associated with immune-mediated disease physically interact and this interaction may also suggest some biological mechanisms underlying such diseases.In some samples of patients with AD, the genome of each individual in the sample has been sequenced so that, in principle, the genotype of each individual may be determined with respect to 11 loci, regions of the human genome, or in combinations of these loci, whenever there are working definitions of at least two alleles at each locus.This ability to identify the genotype of each individual in a sample with respect to some quantitative trait provides a concrete basis for extending some of the techniques of classical quantitative genetics into the era of sequenced genomes.In classical quantitative genetics, however, the loci under consideration as well as the alleles at each locus were unknown to an investigator so that these notions were treated abstractly and the parameters of a model could only be estimated indirectly.Briefly, a variety of analysis of variances and covariance procedures were used to estimate parameters of interest indirectly.A concrete example of such procedures may be found in the material accompanying Table 1 in Mode and Robinson (1959).But, as will be shown in subsequent sections, whenever the genotype of each individual in a sample may be identified with respect to some loci with at least two alleles at each locus, the parameters of the model be estimated in a straight forward manner based on elementary methods of statistical estimation.
As is recognized by many who at sometime during their careers have worked in the field of quantitative genetics, the subject known as components of variance analysis began with the publication of a paper on correlations among relatives on the supposition of Mendelian inheritance by Fisher (1918).In this paper, Fisher attempted to reconcile existing biometrical theories with Mendelian genetics that led to describing genetic variation in terms of components of variance.Evidently at that time, the use of the word "variance" was relatively new, because in the beginning of the paper Fisher emphasized the word by representing it in upper case letters.Subsequently, the ideas of Fisher were extended to include epistasis and other interactions among alleles in the seminal papers of Cockerham (1954) and Kempthorne (1954).The techniques introduced in these papers have been applied in the current genomic era.Examples of the ideas introduced by Cockerham have used in the paper Kao and Zeng (2002), and those of Kempthorne have been used and extended in the paper Mao et al. (2006).The ideas of Kempthorne were also used and extended in the paper of Mode and Robinson (1959) as well as in unpublished lecture notes by the author written and presented during the period 1960 to 1966.Furthermore, the roots of the ideas presented in this paper are extensions of the some of the unpublished material in the lecture notes complied by the author during the period 1960 to 1966.Many of the themes of statistical genetics as they existed during the 1950s were summarized and extended in Kempthorne's well known book (Kempthorne, 1957).
During the years following Fisher's seminal work and those cited above by Cockerham and Kempthorne, an extensive literature on quantitative genetics has evolved.It is beyond the scope of this paper to review this literature and in what follows a few books on the subject will be cited.A book that has been very popular with quantitative geneticists is that of Falconer and MacKay (1996) as well as earlier editions.Another book of interest on quantitative genetics is that of Bulmer (1980).Both of these books contain extensive lists of references on quantitative genetics.A more recent book on genetics and analysis of quantitative traits is that of Lynch and Walsh (1998).This is an influential tome, consists of over 900 pages and contains what seems to be the most extensive treatment of the subject of quantitative genetics published in the 20th century.The principal focus of this book is a biological and evolutionary point of view along with an extensive use of applied statistical methods.There is also an extensive list of papers on quantitative genetics that a reader, who is interested in quantitative genetics, may wish to peruse.The book by Liu (1998) on statistical genetics focuses on statistical genetics along with linkage, mapping and quantitative trait linkage (QT L) analysis.Two recent books on statistical genetics are those of Laird and Lange (2011) and Wu, Ma and Casella (2010).It is suggested that if a reader is interested in an overview of the material that is being taught in courses in quantitative and statistical genetics or simply an introduction to these subjects, that the books cited above be consulted.

Pleiotropism and the Phenotypic, Genetic and Environmental Covariance Matrices for the Case of One Autosomal Locus
Pleiotropism is a term used by geneticists to describe cases in which several traits, discrete or quantitative, seem to be governed by alleles at a single autosomal locus or a locus on chromosomes that govern the sex of an individual.Because quantitative traits are important in agriculture and medicine, the focus of attention in this paper will be quantitative traits, whose genetics seems to be governed by alleles at a single autosomal locus.Let A denote the set of alleles at some autosomal locus and let the symbols x and y denote alleles in the set A. In what follows, the number of alleles in this set will be assumed to be finite.The genotype of an individual will be denoted by the symbol (x, y), where x and y are alleles in A contributed by the maternal and paternal parents, respectively.Let G denote the set of genotypes under consideration so that for every genotype (x, y), it follows that (x, y) ∈ G.To take into account pleiotropic effects for k ≥ 2 quantitative traits, let the W 1 , W 2 , ... W k be k random variables characterizing the phenotypic variation in these k traits among individuals in a population or sample under consideration.It will be assumed that these random variables take values in the sets R 1 , R 2 , ... R k of quantitative measurements expressed in terms of real numbers.As a first step in developing a succinct notation that will be used extensively in what follows, let W denote a k × 1 column vector whose components are the random variables W 1 , W 2 , ... W k .In what follows, the symbol R k will denote the set k ≥ 2 dimensionally set real numbers, which is the set of all possible values that may be realized by the vector W, and let the symbol T denote the transpose of a vector or matrix.
As the technology for sequencing genomes of individuals continues to develop, it will become financially feasible to sequence the genomes of all individuals in a sample.Furthermore, in some cases, it has been possible to find evidence that some region of a genome has been implicated in the phenotypic expression of the k ≥ 2 quantitative traits under consideration.Moreover, it may also be possible to differentiate the alleles in this genomic region so that the genotype (x, y) of each individual in a population or sample may be identified in terms of the bases or other characteristics of the DNA in a genomic region by identifying each of the alleles x and y.It will at the discretion of an investigator as to whether the maternal and paternal alleles are identified if such information is indeed available.Actually in what follows all that is necessary is that two alleles carried by an individual at a particular locus be identifiable.In any event, whenever it is possible to identify the genotype of each individual in a population or sample, it will be feasible to develop methods of statistical estimation that can take advantage of this information and provide more direct methods for estimating and drawing inferences about the parameters in a quantitative genetic model accommodating pleiotropic effects.
From the statistical point of view, the model under consideration will be a mixture of a discrete and a continuous multivariate distributions.Let denote the genotypic distribution of the population under consideration, where p (x, y) ≥ 0 for all (x, y) ∈ G and By way of interpreting the distribution D Geno , let the random pair (X, Y) denote a genotype of an individual chosen at random from a population.Then, It should be mentioned that if a sample of individuals resulted from matings of parents whose genotypes were known, then, for the case of one locus, the theoretical genotypic distribution could be predicted by well known Mendelian theory.But, in general, in most samples of sequenced individuals, the genotypes of their parents are not known so that there would not be a theoretical basis for predicting the form of the genotypic distribution.
Given that (X, Y) = (x, y), let f (w | (x, y)) denote the conditional density of the random vector W. It will be assumed that this distribution has the expectation vector with finite elements for all genotypes (x, y) ∈ G. From these definitions, it follows that the unconditional k × 1 expectation vector of the population is and the k × k unconditional covariance matrix of the population is By definition, the marginal or gene frequency of the maternal allele x in the population is for all x ∈ A. The marginal frequency p (y) of the paternal allele y is defined similarly.A population is said to be in Hardy-Weinberg equilibrium if p (x, y) = p (x) p (y) (2.8) for all genotypes (x, y) ∈ G.
In the formulation under consideration, vectors that are measures of first and second order effects of alleles x and y on the k quantitative traits under consideration will be defined in terms conditional expectations of the mean vector μ (x, y) for each genotypic (x, y) ∈ G with respect to the genotypic distribution.In general, one would not expect that a population would be in a Hardy-Weinberg equilibrium so that for any particular sample an investigator may wish to test the hypothesis that a population was in this type of equilibrium.To accommodate the case in which a population is not in a Hardy-Weinberg equilibrium, vectors of first order effects will be defined in terms of conditional distributions.By definition, for p (x) 0, the conditional distribution of the allele y ∈ A, given x, is (2.9) Thus, the conditional expectation of the conditional mean vector μ (x, y), given x, is (2.10) It is interesting to note that if the population is in a Hardy-Weinberg equilibrium, then (2.11) The conditional expectation of μ (x, y), given y such that p (y) 0, is defined similarly.
The first order effect of the allele x is defined by the vector equation for all x ∈ A. Similarly, the first order effect of allele y is defined by for all y ∈ A. The second order effect of alleles x and y due to their interacting is defined by the vector equation for all genotypes (x, y) ∈ G. Equivalently, for all genotypes (x, y) ∈ G.This equation reminds one of an analysis of variance model of a multivariate experimental design with two factors.If α (x, y) = 0, the zero vector, for all genotypes (x, y) ∈ G, then the alleles x and y act additively with respect to the k quantitative measurements.But, if α (x, y) 0 for some genotype (x, y), then there is interaction among the alleles.
In a similar fashion, with respect to the random vector W with k phenotypic measurements for genotypes (x, y), it can be seen that the vector equation is valid all genotypes (x, y) ∈ G.By definition, the total phenotypic covariance matrix with respect to k quantitative traits in the population is where the conditional expectation is taken with respect to the multivariate conditional density f (w | (x, y)) for each genotype (x, y).The covariance matrix (2.18) measuring the covariation of the mean vectors μ (x, y) for genotypes governing k ≥ 2 traits around the mean vector μ for the population, is called the genetic covariance matrix.Finally, the covariance matrix is called the environmental covariance matrix in the context of Equation (2.16).Observe that it is the same matrix as that defined in Equation (2.6) .
From Equation (2.16) it follows that (2.20) But, Therefore, Equation (2.20) reduces to (2.22) By multiplying this equation by p (x, y) and summing over all genotypes (x, y) ∈ G, it follows that It is interesting to note that this equation was derived by using properties of conditional expectations and is free as to any assumptions that may be made about the distributions of the terms of the right in Equation (2.16).For example, in classical quantitative genetics, for either one trait or many with pleiotropic effects, it was often assumed that the genetic effects (μ (x, y) − μ) and environmental effects (W − μ (x, y)) were distributed independently.But, as can be seen from the derivation of (2.23) just described, such assumptions are not necessary.It is also important to note that, even though equation was derived under the assumption that only one autosomal locus was under consideration, Equation (2.23) would also be valid under the assumption that two or more autosomal loci were under consideration, but the formal details of a proof this statement will not be given here.
As is well known, the principal diagonal elements of any covariance matrix are the variances of the elements of a random vector W under consideration.In particular, let var P [W ν ] denote the phenotypic variance for trait ν in the random the random vector W. Similarly, let var G [W ν ] and var E [W ν ] denote, respectively, the genetic and environmental variances of trait for every trait ν.A measure of heritability of a trait that has used extensively by many investigators is the ratio (2.25) for for every trait ν = 1, 2, • • •, k.In an actual application involving the analysis of data based in the structure presented in this section, an investigator may wish to estimate the fraction H ν for every trait ν = 1, 2, • • •, k.
Even though many investigators have used a ratio of form (2.25) to estimate heritability of a quantitative trait, the details in the calculations may vary among investigators.It is suggested that if a reader is interesting in pursuing these details, the books cited in the introduction be consulted.Three recent papers containing applications of the concept of heritability are those of Yang et al. (2010), Zaitlen et al. (2013) and Price et al. (2011).It should also be mentioned that, even though only one autosomal locus has been under consideration is this section, Equations (2.23) and (2.24) are also be valid if two or more autosomal loci were under consideration.Hence, the methods outlined in this section are also valid for any combination of two or more autosomal loci.

Partitioning the Genetic Covariance Matrix Into Component Covariance Matrices for the Case of One Autosomal Locus
From Equation (2.15), it can be seen from Equation (2.16) that the k × 1 vector expression μ (x, y) − μ may be expressed as the k × 1 vector equation for every genotype (x, y) ∈ G. From the definition of the vector α (x) in Equation ( 2.12), it can be seen that its expectation with respect to the genotypic distribution D Geno is a k × 1 zero vector.Similarly, it can be shown that where 0 is k × 1 vector of zeros.
By definition, the additive genetic covariance matrix is and the intra-allelic interaction covariance matrix is defined as Observe that the matrices in (3.4) and (3.5) are of order k × k.
If a population is not in a Hardy-Weinberg equilibrium, then it is necessary to define cross-covariance matrices.For example, the covariance matrix of the vectors α (x) and α (y) is defined by the matrix expression The covariance matrices of the pairs of vectors α (x) and α (x, y) as well as α (y) and α (x, y) are defined similarly.If the population is in a Hardy-Weinberg equilibrium, then it can be shown that all cross covariance matrices are zero matrices, and it follows that the genetic covariance matrix may be partitioned into and additive and intra-allelic interaction covariance matrices.Thus, when a population is in a Hardy-Weinberg equilibrium, the If, however, the population is not in a Hardy-Weinberg equilibrium, then the analogue of Equation (3.8) is more complicated.In this case, it will be helpful to express Equation (3.8) in a more general form.To that end, consider the 3k × 1 column vector denote a 3k × 3k matrix of k × k sub-matrices on the principal diagonal and k × k cross-products matrices off the principal diagonal.The expectation Ψ G of this matrix with respect to the genotypic distribution is defined by In general, when the population is not in a Hardy-Weinberg equilibrium, it will be helpful to represent the matrix Ψ G in the succinct partitioned form where every sub-matrix is of order k × k.In terms of this matrix, it follows from the vector in (3.9) that the additive covariance matrix cov and the intra-allelic interaction covariance matrix in (3, 9) is Given these definitions, the desired extension of Equation (3.9) to the case a population is not in a Hardy-Weinberg equilibrium has the form Let X denote any k × 1 random vector with a covariance matrix cov [X] = cov i j [X] such that all its elements are finite.
, where X i is the i-th component of the vector X.In terms of this notation, the ν-th component on the principal diagonal is matrix Equation (3.16) has the form as a measure the contribution of the additive genetic variance var A [W ν ] to the total genetic var G [W ν ].Moreover, it is interesting to note that the ratio and may be interpreted as a measure of the contribution of the intra-allelic interaction variance to the total genetic variance for trait ν.An investigator may also be interested in computing the ratio as a measure of departure from a Hardy-Weinberg equilibrium.Observe that if cov G [W ν ] is close to zero, then a Hardy-Weinberg disequilibrium would have a minimal effect on estimating r A (ν is relatively large, then the effect of a disequilibrium could be significant in estimating either of these ratios. Correlation matrices may also be of interest to some investigators.For example, for the case of the genetic covariance matrix cov G [W] in (3.16), an investigator may wish to look at the correlation matrix derived from this covariance matrix.Let the covariance of traits ν and ν such that ν ν .Then, the genetic correlation coefficient of traits ν and ν has the form Then, the k × k genetic correlation has the form Observe that all the components of the principal diagonal of this matrix are 1.If an investigator were also interested in the correlation matrices corresponding to the covariance matrices cov A [W] and cov IAI [W] in (3.16) , then the correlation matrices for these covariance matrices could also be computed.For all the matrices just mentioned, as well as those that may arise in subsequent sections of this paper, the value of correlation coefficients may be interpreted as measure of pleiotropic effects.

Estimation of Mean Genetic Vector and Covariance Matrices From Data for the Case of One Locus
It should be stated at the outset that one could assume that the multivariate k × 1 random vectors W (x, y) under consideration for each genotype were distributed independently with multivariate normal distributions with expectation vector μ (x, y) and covariance matrix Ψ (x, y) for all genotypes (x, y) ∈ G. But, even though such assumptions may be a valuable in coming to grips with some of the problems that will be encountered in subsequent sections, particularly those involving multiple comparisons, a decision was made to minimize the formalism presented in this paper with the hope that a less involved notation would attract more readers with interests in statistical genetics and related disciplines.Thus, for the most part, the methods of estimation to be described in this section are extension of the method of moments, which is among the most primitive methods used is statistical estimation, but is, nevertheless, still effective in many situations in which problems of estimation of parameters arise.If, however, a reader is interested in pursuing theoretical treatments of multivariate statistics, it is suggested that the books by Anderson (1984) and Muirhead (1982) be consulted.
For each genotype (x, y) ∈ G, suppose there are n (x, y) ≥ 2 individuals of genotype (x, y) that are measured with respect to the k quantitative traits under consideration, and let the random k × 1 vectors W ν (x, y) for ν = 1, 2, • • •, n (x, y) denote the observed data for the n (x, y) individuals of genotype (x, y).It will be assumed that these random vectors are distributed independently and that each vector has the same distribution as the phenotypic vector W (x, y) for individuals of genotype (x, y) defined in section 2.Then, the random vector is an estimator of the k × 1 expectation vector μ (x, y) defined in (2.3) for all genotypes (x, y) ∈ G. Given the estimator μ (x, y), it can be shown that the random matrix is an estimator of the covariance matrix Ψ (x, y) defined in (2.4) for all genotypes (x, y) ∈ G.It also is well known that both these estimators are unbiased in the sense that denote that total number of observed individuals.Then, is an estimator of the frequency of genotype (x, y) in the population.If it is assumed that the observations {n (x, y) | (x, y) ∈ G)} are a sample from a multinomial distribution with the genotypic distribution D Geno in (2.1) as its probabilities with sample size n, then for all genotypes (x, y) ∈ G so that p (x, y) is an unbiased estimator of p (x, y) under these assumptions.From (2.5) it follows that the random variable μ = (x,y) is an estimator of the unconditional expectation vector μ defined in (2.5) .Similarly, the random matrix is an estimator of the unconditional covariance matrix Ψ defined in (2.6).
Given the estimators described above, it follows that the covariance matrices on the right side of Equation ( 2 is an estimator of the genetic matrix in (2.23) .From these results, it follows from Equation (2.23) that the random is an estimator of the phenotypic covariance matrix in (2.23).Given the estimates of covariance matrices on the right in (4.10) , an investigator may wish to estimate the measure of heritability of each trait, using the formula in Equation (2.25) .
Although the formal details will not be given here, it is easy to see that the effects defined in section 2 as well as the variance components defined in section 3 could be estimated directly without recourse to any analysis of variance procedure.This estimation procedure could be carried out either under the assumption that the population was in a Hardy-Weinberg equilibrium or was not in such an equilibrium.In either case, all the covariance matrices defined in section 3 could be estimated.
It is easy to see from (4.9) that the genetic covariance is symmetric, because for every genotype (x, y) ∈ G. Therefore, since the transpose of a sum of matrices is the sum of the transpose of each matrix in (4.9) in the sum, it follows that so that cov G [W] is a symmetric matrix, and will thus have real eigenvalues.As suggested in section 3, suppose an investigator wished to estimate the genetic correlation matrix corr G [W] defined in (3.22) .As will be shown subsequently, if an estimate of genetic covariance matrix is positive definite, then the corresponding estimates of the genetic correlation matrix will be such that all correlation coefficients ρ will have values in the open interval (−1, 1).
Recall that for k ≥ 2, a k × k symmetric real matrix A is positive definite if, and only if, w T Aw > 0 for all k × 1 vectors w ∈ R k such that w 0, where R k is the set of k × 1 of vectors real numbers and 0 is the k × 1 vector of zeroes.If for some vector w ∈ R k , such that w 0 and w T Aw = 0, then the matrix A is said to be positive semi-definite.From this definition, it also follows that any square sub-matrix B ν of A consisting of ν rows and ν columns is also positive definite for A positive definite matrix may be characterized in a number of ways.For example, it is well known that a matrix A is positive definite if, and only, its eigenvalues are all positive.If a reader is interested in finding a proof of this statement as well as a treatment of positive semi-definite matrices, it is suggested that the key phrase, "positive definite matrices" be typed into an internet search engine, where a wealth of material on linear algebra may be found in brief but reliable accounts.Among the many examples on the internet is Wikipedia article with this title.
It is interesting to observe that it follows from its construction that the covariance matrix corr G [W] is positive semidefinite.For example, for any w ∈ R k it can be seen from (4.11) that w T μ (x, y) − μ = u (x, y) = μ (x, y) − μ T w, where u (x, y) ∈ R 1 for all genotypes (x, y) ∈ G. Therefore, for all w ∈ R k .
At this juncture, it seems fitting to point out that a necessary property of any covariance matrix A is it must be positive semi-definite.To see that this statement is indeed true, let W be a k × 1 column vector of random variables with values in R k with covariance matrix A and let a ∈ R k .Then, let Z = a T W denote a random variable taking values in R 1 .Then, from the definition of the variance of a random variable, it follows that var [Z] ≥ 0 for all a ∈ R k .But, it is well known that var [Z] = a T Aa ≥ 0 so that the matrix A must be positive semi-definite.Because the estimate of the genetic covariance matrix will always be positive semi-definite when the estimation procedure outlined in this section is used, it has a definite advantage when compared with that used by Mode and Robinson (1959), which was carried out within a framework of an analysis of covariance table, because when using such procedures one could not, in general, guarantee that an estimated genetic covariance matrix was indeed always be positive semi-definite so on some occasions an investigator may get negative estimates of a variance component.If a reader is interested in the method of estimation just mentioned, it is suggested that the formal procedures accompanied Table 1 of the paper by Mode and Robinson just cited be consulted.It should also be mentioned that at time this work was done in the 1950s, the presumed set of loci as well as the alleles at each locus governing the quantitative genetics of the traits under consideration were treated abstractly and without any knowledge of their locations in the genome of a species under study in an experiment.Consequently, the estimation procedure suggested in this section was not even conceivable the during the 1950s.
But, as mentioned above, it would be desirable to know whether the estimated genetic covariance matrix is indeed positive definite so that all estimated correlation coefficients would lie in the open interval (−1, 1).It is recommended, therefore, that if an investigator is interested in estimating the genetic correlation matrix and using it to draw genetic inferences about a population, a first step would be that of computing the eigenvalues of the estimated genetic covariance matrix to determine whether all is eigenvalues are positive.Many software packages contain programs for calculating the eigenvalues of a square symmetric matrix so that a software developer would have little difficulty in finding a program to compute the eigenvalues of a covariance matrix.It should also be mentioned that if one or more of these eigenvalues are near zero, then the genetic covariance matrix would be nearly singular.
Among the several ways of characterizing a real positive definite matrix is a procedure known as the Sylvester's criterion.For any symmetric k × k matrix A = a i j , define k determinants as follows: Then, the matrix A is positive definite if, and only if, D ν > 0 for all ν = 1, 2, • • •, k.From the computational point of view, the Sylvester criterion could also be used to check whether a square matrix is positive definite, but to carry out such a procedure a software developer would need to have access to a program based on vary efficient algorithms for finding the numerical value of the determinants in (4.13).As will be shown below, this criterion also has interesting implications for testing whether all estimated correlation coefficients ρ belong to the open interval (−1, 1).
For suppose, A is the estimate of the genetic covariance cov G [W] matrix in (4.12) and suppose it has been determined that the matrix is positive definite.To simplify the notation, represent the element of this matrix by cov G [W] = σ i j .To further simplify the notation, the symbol •, indicating estimates of parameters are under consideration, has been omitted for the elements of the estimated genetic covariance matrix.In this matrix σ ii = σ 2 i , the estimated variance of trait i for i = 1, 2, • • •, k, and the covariance σ i j = ρ i j σ i ρ j , where i j, ρ i j is an estimate of the genetic correlation between traits i and j, and σ i and σ j are, respectively, the estimated genetic standard deviations for traits i and j.Because of symmetry, ρ i j = ρ ji for all i j.Because the estimated genetic covariance has, by assumption, been shown to be positive definite, it follows the Sylvester's criterion will be satisfied.In this notation, the first two determinates in Sylvester's criterion are D 1 = σ 2 1 > 0 and But, because σ 2 1 σ 2 2 > 0, this equation, implies that 1 − ρ 2 12 > 0. Thus, ρ 2 12 < 1, which implies −1 < ρ 12 < 1.By preceding systematically in this way by choosing 2 × 2 sub-matrices corresponding to two rows and two columns, it can be shown the all estimated correlation coefficients ρ i j for all i j satisfy the condition −1 < ρ i j < 1.For example, the 2 × 2 sub-matrix corresponding to rows and columns 1 and 3 of a k × k positive definite matrix has the form which has the same form as the matrix in (4.14), which implies that −1 < ρ 13 < 1. (4.17)

Measures of Pleiotropism and Epistasis for the Case of Two Autosomal Loci
Let A 1 denote the set of alleles at locus 1 and let A 2 denote the set of alleles at locus 2. It will be assumed that each of these sets contains at least two alleles but the total number of alleles at each locus is finite.Any genotype will be denoted by a symbol of the form (x 1 , y 1 , x 2 , y 2 ), where the subscript, 1 or 2, denotes and locus, the symbols x and y denote alleles contributed the maternal and paternal parent, respectively.In order to simplify the notation, a genotype will often be denoted by the single letter symbol z = (x 1 , y 1 , x 2 , y 2 ) .Let G denote the set of genotypes under consideration.The number of genotypes in the set G may be quite large, particularly if there are several alleles at each locus.Because the set of genotypes contains all possible combinations of alleles at each locus, the set G may be represented as the product set (5.1) For every genotype z ∈ G, let p (z) denote the frequency of genotype z in the population.Then, p (z) ≥ 0 for all z ∈ G and z∈G p (z) = 1. (5.2) Just as in the case on one locus, the genotypic distribution will be denoted by As in the foregoing sections, to accommodate pleiotropism it will also be supposed that k ≥ 2 traits are under consideration and that the random k × 1 phenotypic vector W along with its distribution characterize the observed variation with respect to k traits in a population or sample that is under consideration.Just as in section 2, the k × 1 conditional expectation or genetic vector along with the conditional covariance matrix which are defined for all genotypes z ∈ G, will play essential roles in this section.
In section 3, where only one autosomal locus was under consideration, an effect was introduced that was a measure of interactions of alleles at one locus.When two of more autosomal loci are under consideration, however, interactions among alleles at different loci will need to accommodated in a model that will form the basis for a statistical analysis of data gathered in an experiment devoted to quantitative genetics.In classical genetics, interactions among alleles at different loci are referred to as epistasis.Briefly, the primary focus of attention in this section is to extend the results in section 3 by defining effects, which are functions of the genetic expectations in (5.4), that are measures of epistasis as well as other interactions among sets of alleles.Just as in the foregoing sections, the unconditional genetic expectation vector As a first step in defining effects for the case of two autosomal loci, it will be necessary to represent the genotypic distribution in a more explicit form.For every genotype z = (x 1 , y 1 , x 2 , y 2 ) ∈ G, let p (x 1 , y 1 , x 2 , y 2 ) denote its frequency in the population.Then, by definition, is the marginal frequency distribution of alleles x 1 ∈ A 1 in the population.The marginal frequencies of alleles y 1 , x 2 and y 2 are defined defined similarly.To lighten the notation, no subscripts, such as p 1 (x 1 ) , will be attached to marginal distributions, because the subscript on each allele will denote the locus under consideration.Let p (y 1 ) , p (x 2 ) and p (y 2 ) denote, respectively, the marginal frequencies of alleles y 1 , x 2 and y 2 .For the case of two autosomal loci, a population is said to be in linkage equilibrium if p (x 1 , y 1 , x 2 , y 2 ) = p (x 1 ) p (y 1 ) p (x 2 ) p (y 2 ) (5.9) for all genotypes z = (x 1 , y 1 , x 2 , y 2 ) ∈ G.
One would not, in general, expect that a population would be in linkage equilibrium for the case under consideration so that it becomes necessary to define k × 1 vectors of effects using conditional distributions.For example, the conditional distribution of the alleles (y 1 , x 2 , y 2 ), given x 1 , is (5.10) for p (x 1 ) 0. It is easy to see that if (5.9) is satisfied, it follows that p (y 1 , x 2 , y 2 | x 1 ) = p (y 1 ) p (x 2 ) p (y 2 ) (5.11) for all triples of alleles (y 1 , x 2 , y 2 ) ∈ A 1 × A 2 × A 2 .Let the k × 1 vector μ (x 1 ) denote conditional expectation (5.12) for all x 1 ∈ A 1 .Then, just as in the one autosomal locus case, the vector α (x 1 ) of effects for allele x 1 in the population is defined by α (x 1 ) = μ (x 1 ) − μ (5.13) for all a k × 1 vector of zeroes.The first order effects just defined can easily be extended to define the k × 1 vectors α (y 1 ) , α (x 2 ) and α (y 2 ), for all alleles y 1 ∈ A 1 and all pairs (x 2 , y 2 ) ∈ A 2 × A 2 .
For the case of two autosomal loci, it is possible to define many more effects than for the case of one autosomal locus.To provide a framework for defining and classifying these effects, it will be helpful to consider a set S = (1, 2, 3, 4) of four positions that are occupied by alleles at the two loci under consideration.For example, for any genotype, positions 1 and 2 are occupied by alleles at locus 1, and positions 3 and 4 are occupied by alleles at locus 2. To get a grasp of how many effects that can be defined for the case under consideration, it is helpful to think of the class T of all subsets of the set S. Among the sets in T is ϕ, the empty set.To provide a means for describing effects that are measures of interactions of alleles at one locus or effects that are measures of epistatic interaction among alleles at two or more loci, it is useful to enumerate classes of subsets the the class T. For ν = 0, 1, 2, 3, 4, let T ν denote the class of subset containing ν positions.
Thus, the class T 0 = {ϕ} contains only the empty set, but the class T 1 consists of the singletons (5.15) which were used to define the first order effects listed above.From elementary combinatorics, it is easy to see that the number of sets in the class T 2 is 4 2 = 6.
In particular, the subsets in the class T 2 are Observe that the subclass of sets of positions will form a basis for defining second order effects that are measures of intra-allelic interactions at loci 1 and 2, respectively.But the subclass of sets provide a basis for defining second order epistatic effects among the two loci under consideration.There are subsets in the class T 3 , which consists of the subsets (5.19) As will be shown subsequently, the third order effects corresponding to the set in this class provide a basis for defining various types of effects that are measures of epistatic interactions.Finally, the class of sets contains all positions and provides a basis for defining a fourth order effects that are measures of epistatic and intralocus interactions among any set of four alleles at the two loci under consideration.
From the foregoing discussion, it can be seen that a total of 15 effects may be defined for the case of two autosomal loci, and to derive a formula for computing each effect, it would be necessary to consider 15 conditional probability distributions.Rather than deriving a formula for each of these 15 subsets of S, it will be helpful to set down general formulas for marginal and conditional distributions.Let denote the union of the classes of subsets of S corresponding to effects under consideration.For every set A ∈ E, let A c denote its complement with respect to S, and let z (A) and z (A c ) denotes the sets of alleles in genotype z corresponding, respectively, to the positions in the sets A and A c .For any genotype z ∈ G, z = z ((A) , z (A c )), where, by definition, the positions in the sets A are fixed in the operations that follow.For example, the formula denotes the marginal distribution of the alleles in the position of the set A, where the sum runs over all alleles in the positions of the set A c .Thus, in this succinct notation is the conditional distribution of the alleles in the positions in set A c , given the fixed alleles in the positions of the set A such that p (z (A)) 0. Let μ (z(A)) denote the conditional expectation of the vector μ (z) , given the fixed alleles in the positions corresponding to the set A. Then, for every where the sum runs over all the alleles in the positions of the set A c .
To illustrate using formula (5.24) , consider those sets A ∈ T 1 .If A = {1}, then μ (z (A)) = μ (x 1 ) so that α (x 1 ) = μ (x 1 ) − μ could be estimated for for all alleles x 1 ∈ A 1 .The remaining first order effects, α (y 1 ) , α (x 2 ) and α (y 2 ) could also be derived and estimated in a similar way, using formula (5.24).The number of sets in the class T 2 is 6 so that there are 6 conditional expectations of the form μ (A) for A ∈ T 2 .Suppose, for example, A = {x 1 , y 1 }.Then μ (A) = μ (x 1 , y 1 ), and, just as in the case of one autosomal locus, the effect α (x 1 , y 1 ) would be defined as for all pairs of alleles (x 1 , y 1 ) ∈ A 1 × A 1 .Observe this effect is a measure of intralocus interactions of alleles.If the alleles at locus 1 acted in a purely additive manner, one would expect that α (x 1 , y 1 ) would be small, particularly if x 1 = y 1 were the same allele.But, if x 1 y 1 , then there may be intra-allelic interaction between the two alleles so that this effect may be greater than that for homozygous genotypes.
is a measure of interloci or epistatic interactions for all pairs of alleles ( It is of interest to note that among the four remaining effects for sets A ∈ T 2 , three would be measures on interloci interactions and one would be an effect similar to that in (5.25) corresponding to positions {3, 4}.
For every set A ∈ T 3 , there corresponds and effect α (z (A)) and, as can be seen from (5.19), there are four subsets in T 3 .To illustrate the procedure used to define each of these effects, suppose A = {1, 2, 3}.Then, μ (x 1 , y 1 , x 2 ) is the conditional expectation that needs to be derived, using formula (5.24).Briefly, for each subset B of A, such that A B, there will be an effect that is used in defining the effect α (z(A)).In particular, The procedure illustrated in (5.27) may also be used to derive an expression for α (z (A)) for any set A ∈ T 3 such that A {1, 2, 3}.
The last effect that needs to be defined is that for the class Let α (z), where z = (x 1 , y 1 , x 2 , y 2 ), denote this effect.Then, α (z) is determined by solving the equation for α (z) for all genotypes z ∈ G.The number of solutions to this equation depends on the number of alleles at each locus.For example, if there are two alleles at each locus, then there are 16 possible values of α (z) = α ((x 1 , y 1 , x 2 , y 2 )).Moreover, each solution is a k × 1 column vector, corresponding to the k ≥ 2 traits under consideration.Let α ν (x 1 , y 1 , x 2 , y 2 ) denote the effect for trait ν = 1, 2, • • •, k in the k × 1 vector α (z).An investigator may be interesting in estimating the component of variance But, because this is a summary statistic, it may mask the most interesting effects, i.e. those with the largest values, making up this variance component.Furthermore, because each effect in the sum (5.29) has been estimated, it would be possible to inspect the numerical values of each term in the sum of Equation (5.29) for indications of usually large values that would be indicative of significant interactions among the alleles at the two loci under consideration.In the next section, a procedure for inspecting the numerical values of all the estimated effects will be suggested.
Before proceeding to the next section, it is interesting to note that Hemani et al. (2013) have provided an evolutionary perspective on epistasis and the missing heritability and have also suggested that genome-wide association studies would be improved by searching directly for epistatic effects.It seems plausible, therefore, that the measures of epistatic effects defined in this section as well those for multi-loci effects that will be introduced in a subsequent section may provide a means for searching for epistatic effects not only in quantitative genetics but for also in measuring these effects in genome-wide association studies.

Searching for Unusual Effects and Interactions Among the Alleles at Two Autosomal Loci
The strategy for searching unusual effects and interactions at the two autosomal loci will be that of inspecting the squares of each effects and then picking out the largest of them as indicators of unusual effects of alleles or interactions among the alleles at the two loci.It will be assumed that k ≥ 2 traits are under consideration, but to simplify the notation each set of effects will not be distinguished by the subscript ν, indicating the trait, but it will be tacitly be assumed that any search procedure would be carried out for each trait.Let denote the set of squared first order effects.For the case there are two alleles at each locus, any position may be occupied by one of two alleles, and because there are four positions under consideration, there 8 effects in the set E 1 .Consequently, in this case, an investigator could easily find the effect with the largest value by inspection, but in cases with a large number of effects, it would be necessary to write a computer program or use existing software to find largest of them.
The set of second order effects may be partitioned into two subsets, see (5.17) and (5.18).The set of squared effects for intra-allelic interactions is In this case each of the two sets in A ∈ T 2IAI contains two positions and each position may be occupied by two alleles.Hence, the number of effects in the set E 2IAI is 8.The set of squared effects for second order epistatic interactions is For this case, there are 4 sets of two positions in the set T 2EPI , and each position occupied by two alleles.Consequently, the number of effects in the set E 2EPI is 16.
The set of squared effects for third order epistatic interactions is Each of the four sets in T 3 consists of 3 positions and each position may be occupied with one of two alleles.Consequently, for each set of three positions there will correspond 2 3 = 8 effects and, because there of 4 sets in T 3 , there will be a total of 32 effects in the set E 3EPI .Rather than relying on a visual inspection of 32 effects, in this case, as well as in cases in which 3 or more autosomal loci are under consideration, an investigator may prefer to write a computer program or use existing software to find the largest squared effect.On the other hand, an investigator may not want to focus on the largest effect in each case just enumerated, but have a computer order the squared effects from the smallest to the largest so that, for example, attention could be focused on the three largest squared effects.For the case of two autosomal loci, the set of squared fourth order effects is As mentioned in section 5, the number of squared effects in this set is 16 so that in this case an investigator may wish to select the largest one in a search for unusual epistatic interactions of fourth order.
If an investigator were interested in estimating a components of the total genetic variance for any trait, it would be possible to do so for any of the sets of squared effects listed above.For example, suppose attention was focused on sum of the variance components corresponding to the squared effects in the set E 3EPI .For every A ∈ T 3 , let p (z (A)) denote the corresponding marginal probability.Then, for the trait ν = 1, 2, ... k under consideration, the component of the genetic variance corresponding to the set E 3EPI is defined by (6.6) and the variance components corresponding to the other sets listed above could be estimated similarly.On the other hand, one could proceed as in the one locus case discussed in section and define a 15k × 1 column vector Φ (x, y) as in (3.9) and a 15k × 15k covariance matrix Ψ G as in (3.11), but the formal details of this derivation will be left to the reader as an exercise.This covariance matrix may be particularly interesting when the sample or population is not in linkage equilibrium at the two loci under consideration.
At this point in the discussion, it is appropriate to mention the pioneering paper by Cockerham (1954), who was among the first to consider variance components for epistatic interactions in quantitative genetics and introduced a distinctive nomenclature for such interactions.In his terminology but in a different notation, the genetic variance component corresponding to the set T 1 , would be called the additive component with subscript (A) for trait ν = 1, 2, ) , representing a partition of the genetic covariance matrix into component matrices for the case of one autosomal locus, could also generalized to the case to two autosomal loci using the set-based classification of effects outlined in this section, but this matrix partition will also be left to the reader as an exercise.
As will be discussed in the next section, when many loci are under consideration, the potential number of effects that may be considered can be very large.In such cases, to find unusually large effects and interactions even for the relatively simple case of only two alleles per locus may entail searches of hundreds of thousands or even millions of squared effects along with testing for their statistical significance.As has been widely recognized, when large numbers of statistical tests are considered, there is a risk of false discovery rates.It is beyond the scope of this paper to discuss the technicalities underlying false discovery rates, but it is suggested that an interested reader consult the papers Benjamini and Hochberg (1995), Benjamini andYekutieli (2001, 2005).

Overview of Cases for l > 2 Autosomal Loci
Let l > 2 denote the number of autosomal loci under consideration in a some diploid species such as man.Then, the number of positions that may be occupied by alleles at each locus is N = 2l.Let S the set of N positions numbered 1, 2, • • •, N, let T denote the class of all subsets of S and let T ν denote that class of subsets containing ν positions for ν = 0, 1, 2, • • •, N.Then, as is well known, the number of subsets in the class T is 2 N and the number of subsets in the class T ν is N ν (7.1) From the point of view of judging the feasibility of defining effects based on ν ≥ 1 positions, this number is essential, because it gives us the number of effects that may be defined.In what follows, it will also be helpful to recall the well known equation from combinatorics, that is sometimes listed in text books on introductory probability theory.Because, no effect is associated with the empty set ϕ, which is the only member of the class T 0 and N 0 = 1, it follows that for the case of l autosomal loci, the number of effects that may be defined is 2 N − 1.
By way of an illustrative example, suppose the number of autosomal loci under consideration is l = 4 so that N = 8.Thus, in this case it would be possible to define 2 8 − 1 = 255 (7.3) effects.It is also interesting to note that if there were 2 alleles per locus, then the number of possible genotypes would be 2 8 = 256.If an investigator had access to a large sample of individuals whose genomes had be sequenced, it may be feasible to consider this many possible genotypes.But the study of 256 genotypes in small samples would be problematic, because even in a sample of size n = 256, all possible genotypes may be not represented and in rare samples there may only one observation for each genotype.If a sample were sufficiently large, an investigator may undertake a four autosomal loci study, but as a first approximation, a decision may be made of to define and estimate only first, second and third order effects rather dealing all the 255 possible effects.
It is clear that the number of sets in the class T 1 is 8 so that 8 first order effects could be defined and estimated, using the principles outlined in section 5 for the case of two loci.The number of second order effects would be 8 2 = 28 (7.4) and the number of third order effects would be 8 3 = 56.(7.5) If an investigator were skilled in writing computer code, it would be possible to write programs to enumerate the sets in the classes T 2 and T 3 as well as to compute the effects associated with each class.Ideally, it would be close to optimal if a team of investigators were working on the project.At a minimum, a team should consists of at least two individuals: one individual should have expertise in genetics and managing data bases consisting of individuals whose genomes have been sequenced and a second individual would expertise in writing software.It is interesting to note that with respect to the class T 2 , there would be 4 sets of two positions such that effects that were measures of intra-allelic interactions could be estimated, but the remaining 24 sets would from a basis for defining and estimating effects that are measures of epistatic interactions among the alleles at four loci taken two at a time.
All 56 effects corresponding to subsets in the class T 3 could be interpreted as measures of epistatic interactions among the alleles at the four loci under consideration.In order to include the phenomenon of pleiotropism in the formulation, it will be assumed, as in previous sections, that all effects are k × 1 column vectors, where k ≥ 2.
To provide a succinct overview of considering only first, second and third order effects in a variance component model, it will again to helpful to let z denote a genotype in the set G of all possible genotypes, and let z where the remainder effect is determined by solving Equation (7.7) for α R (z) for every genotype z ∈ G.For any trait ν in the vector valued terms in this equation, an investigator could search for usually large squared effects following the search procedures suggested in section 6 with a goal of finding epistatic interactions that would be of interesting for interpreting the data.In this connection, and difference in these effects among the k traits under consideration would attributable to pleiotropism.
To provide a measure of the adequacy of the approximation in (7.7) based on only first, second and third order effects is would be of interest to estimate the variances for each trait in vector of remainder effects α R (z).For example, let α At this point in the discussion of variance component models that are used in quantitative genetics, it is appropriate to mention that the number of genotypes under consideration may be significantly reduced if only three genotypes are identified at any locus for the case of two alleles per locus.To demonstrate this idea, it is helpful to resort to methods for representing genotypes used in classical Mendelian genetics.Suppose at some locus there are to alleles B and b.Then, there are four genotypes BB, Bb, bB and aa.If, however, the heterozygotes Bb and bB are lumped into one class, then only three genotypes would be distinguishable at each locus.Thus, if this idea were used, the number of distinguishable genotypes with respect to four autosomal loci would be 3 4 = 81.(7.10) This number is significantly less than 256, which was the number of genotypes in which 8 positions were used in defining genotypes.If this classification of genotypes were used, the procedures for estimating effects would need to be modified by taking into account the lumping of heterozygotes into one class.
During the past five to ten years, a rather large number of papers have been published by members of the genetic community in which genome wide sweeps have been made of the human genome with goals of finding genomic regions that are implicated in such neurological conditions as Alzheimer's and Parkinson's diseases.In an interesting paper, Raj et al. (2012) considered 11 regions (loci) in the human genome that have been implicated with Alzheimer's disease and present evidence that four of these loci are involved in protein interaction network that has been maintained in the population by positive natural selection.It is also stated in this paper that 12 loci have been implicated in Parkinson's disease.It is of interest, therefore, to consider the extent the methodology presented in this paper may be applied to traits whose genetics are governed by 11 or 12 loci with two alleles per locus.If three genotypes were distinguished at each locus, then, for the case of 11 loci the number of genotypes that could be distinguished for 11 loci is 3 11 = 177, 147,  (7.11)   and for the case of 12 loci, this number is 531, 441.When it is required that the genomes of all individuals in a sample have been sequenced, it is doubtful at the present time whether of samples sizes of n > 177, 147 or n > 531, 441 would be available to an investigator or a team of investigators who are interested in applying the ideas presented in this paper.
Even though sample sizes this magnitude may not be available to an investigator, their research, however, could proceed by using the available data to detect epistatic and pleiotropic interactions among the alleles represented in the sample.For example, for the case of 11 loci, it is suggested that an investigator perform a preliminary survey of the data to estimate the number n (x, y) of individuals of genotype (x, y) are available, where the alleles in x and y have been ascertained with respect to 11 or fewer loci.If some number or numbers n (x, y) are too small to yield reliable statistical information, then an investigator may make a decision to restrict attention to a subset of the 11 loci such that the sample sizes for each genotype are judged sufficiently large to draw statistically reliable inferences from the data as to the presence of epistatic and pleiotropic interactions.
(A) denote a set of alleles corresponding to the positions in any set A ∈ T. Next suppose that all the effects in the classes{α (z (A)) | A ∈ T ν } (7.6)have been defined and estimated for ν = 1, 2, 3, using the principles outlined in section 5.Then, the linear model expressing a genetic value μ (z) as a function of effects would have the form μ (z) = μ + (ν)  R (z) the element in this vector for trait ν for ν = 1, 2, • • •, k and let var R [W ν ] variance component corresponding to the remainder term in (7.7).Then, the ratiovar R [W ν ] var G [W ν ] , (7.9) where var G [W ν ] = cov νν [W ν ] is element νν on the principal diagonal of the genetic covariance matrix cov G [W] in (2, 23), is a measure of the contribution of the variance component in (7.8) to the total genetic variance for trait ν = 1, 2, • • •, k.In relative terms, small values of this fraction for each trait ν would be indicators of the goodness approximation in (7.7), using only first, second and third order effects.
.21) where, as indicated, 0 k×1 is a k × 1 of zeros and 0 k×k is a k × k matrix of zeros.By way of validating this last step in (2.21), let X and Y one dimensional random variables, taking values in the set of real numbers.Then, it is wellknown that E [XY | X] = XE [Y | X].Moreover, it can be shown that if X and Y are matrices such that the product XY is well defined, then the equationE [XY | X] = XE [Y | X] also is valid for matrices.Furthermore, it can be shown that E [XY | Y] = E [X | Y]Y is also valid for matrices.It was this well known property that was used to justify the second expression in Equation (2.21).These properties were also be used to show that the third term on the right in Equation (2.20) is equal to 0 k×k , a k × k matrix of zeros.If f (X) is a matrix valued function of a matrix X, then it can also be shown .23) can be estimated.For example, from the definition of the environmental matrix in Equations (2.22) and (2.23) , it follows that cov E [W] = Ψ E = Ψ, (4.8)see (4.7).Similarly, the random matrix • • •, k.The component of variance var D [W ν ], corresponding to the set E 2IAI would be called the dominance component with subscript (D), and the variance component var AA [W ν ] corresponding to the set E 2EPI would be called the additive by additive component with subscript (AA).Similarly, the variance components corresponding to the set E 3EPI would be designated as the additive by dominant component, with subscript (AD), Finally, that corresponding to the set E 4EPI would be labeled the dominant by dominant, with subscript (DD), component of the genetic variance for any trait ν = 1, 2, • • •, k.It should also be mentioned that Equation (3.16