Inference of Specific Gene Regulation by Environmental Chemicals in Human Embryonic Stem Cells

We are exposed to many environmental chemicals in our daily life. Certain chemicals threaten our health, especially that of embryos and can cause serious developmental problems. To prevent abnormal development and diseases caused by chemicals, it is important to clarify the mechanisms of chemical toxicity in embryonic cells. The gene regulatory network is one of the useful methods for clarifying functional mechanisms in living cells, so we applied a statistical method to infer the gene regulatory network in human embryonic stem cells. In this study, we improved our previously developed SEM approach for inferring a network model from 9 gene expression profiles in human embryonic stem cells, which were exposed to various chemicals. The estimated regulatory models clarified the differences between chemicals, and the shapes of the inferred models reflected the features of the chemical toxicities. The toxicity of acrylamide affected neuronal cell-related genes, while that of diethylnitrosamine disturbed cell differentiation-related genes. On the other hand, the TCDD network reflected feedback regulation, and finally disturbed neuronal cell-related genes. In the Thalidomide network, cell differentiation genes related to axis formation in embyronic cells were affected by thalidomide toxicity.


Introduction of the Problem
Environmental pollution is a byproduct of our usual life activities.Vehicle exhaust contains gases, including many noxious chemicals.Factories discharge industrial waste in the air, ground, and water.Many rivers are polluted by domestic sewage and wastewater.The emitted chemicals are sometimes trapped in clouds and then contaminate the ground in rainfall.Thus, we are exposed to many chemicals in our daily life, and some environmental chemicals can cause serious developmental toxicity effects.Developmental toxicity is either a structural or functional alteration, and these alterations interfere with the normal developmental programming in early embryos.These interferences can cause abnormal development and diseases (Baccarelli & Bollati, 2009;Hou et al., 2012).One of the most infamous environmental chemicals is methylmercury, which is known to affect fetal development (Yuan, 2012;Tatsuta et al., 2012).Furthermore, other chemicals are also considered to be toxic, since they can cause abnormal cell differentiation in embryos (Rappolee et al., 2012;He et al., 2012;Harrill et al., 2011).
To prevent chemically-induced developmental abnormalities and diseases, it is important to clarify the mechanisms of chemical toxicity in embryonic cells (Gündel et al., 2007;Thompson & Bannigan, 2008).The gene regulatory network is one of the useful methods to clarify the regulatory mechanisms.To infer the networks among the genes from the mRNA levels, various algorithms, including Boolean and Bayesian networks, have been developed (Akutsu et al., 2000;Friedman et al., 2000).In our previous investigation, we developed an approach based on graphical Gaussian modeling (GGM) in combination with hierarchical clustering, and we could infer the huge network among all of the genes by this approach.(Aburatani et al., 2003;Aburatani & Horimoto, 2005).However, GGM infers only the undirected graph, whereas the Boolean and Bayesian models infer the directed graph, which shows causality.Although all of these approaches are suitable for establishing the relationships among the genes, they cannot reveal the relationships between un-observed factors and genes, because of insufficient information in the gene expression profiles.To clarify the mechanisms of biological processes in living cells, un-observed factors, which affect the target gene's expression, should also be considered.Thus, an alternative approach that includes un-observed factors should be applied.
Recently, we developed a new statistical approach based on Structural Equation Modeling (SEM), to infer the protein-DNA interactions for gene transcriptional control from only the gene expression profiles, in the absence of protein information (Aburatani, 2011;2012).We applied this approach to reveal the causalities within the well-studied transcriptional regulation system in yeast (Aburatani, 2011).The significant features of SEM are the inclusion of latent variables within the constructed model and the ability to infer the network, including the cycle structure.Furthermore, the SEM approach allows us to strictly evaluate the inferred model, by using fitting scores.The linear relationships between variables are assumed to minimize the differences between the model's covariance matrix and the calculated sample covariance matrix.Some fitting indices are defined for evaluating the model adaptability, and thus the most suitable model can be selected by SEM (Bollen, 1989;Duncan, 1975;Pearl, 2001).
Here, we applied the SEM approach to infer the regulatory relationships among 9 neurodevelopmentally-related genes.The expression profiles of these genes were measured in human embryonic stem cells exposed to four environmental chemicals.The chemicals are known to have harmful toxicities that affect the developmental process in human embryos.Thus, inferring the regulatory network among the developmentally-related genes will help us to reveal the mechanisms of toxicity-dependent responses in the embryo.Furthermore, we improved our SEM approach for assuming preliminary initial models from the time-series data.By using this new approach, we can construct an initial model for the SEM calculation in the absence of known regulatory interactions.The resulting gene expression data clarified the chemical-specific interactions among the developmentally-related genes.

Expression Data
We utilized the expression data that were measured to clarify the effects of environmental chemical exposure on neuronal differentiation (He et al., 2012;Fujibuchi et al., 2011).In these expression data, nine genes considered to be affected by chemicals were measured in human embryonic stem cells: GATA2, Lmx1A, MAP2, Nanog, Nestin, Nodal, Oct3/4, Pax6 and Tuj1 (He et al., 2012;Fujibuchi et al., 2011).The expression of beta-actin was also measured, as an internal control.The expression levels of these 10 genes were measured in human embryonic stem cells exposed to four chemicals: acrylamide, diethylnitrosamine, TCDD and thalidomide (He et al., 2012;Fujibuchi et al., 2011).The toxicities of these chemicals are different: acrylamide is neurotoxic, diethylnitrosamine is genotoxic, TCDD is carcinogenic, and thalidomide has other toxicity.The human embryonic cells were exposed to each chemical for several time periods: 24 hours, 48 hours, 72 hours and 96 hours.Each chemical was also tested at 5 concentrations: very low, low, medium, high and very high.The expression of the selected genes was measured twice under each condition by RT-PCR, and thus 160 (4 time periods x 5 concentrations x 2 repeats x 4 chemicals) expression patterns per gene were measured (Fujibuchi et al., 2011).
First, the expression level of each gene was normalized to the internal beta-actin control and averaged, as follows: Here, N is the number of repeated experiments, i e g is the measured expression level of gene g under one set of conditions, and i bActin e is the beta-actin expression level measured under the same conditions.By dividing by the expression level of beta-actin, the intracellular expression level of each gene was normalized.To minimize the experimental error, the logarithms of the normalized expression data were obtained and averaged.

Extraction of Causalities from Expression Data
Usually, we assume an initial model from previous knowledge for the SEM calculation, but there are no defined regulations among the selected genes in this study.Thus, we had to construct an initial model of each chemical from the regulatory relationships between the gene pairs.To detect the regulatory relationships from the measured time series expression data, cross correlation coefficients were applied to the expression profiles.These cross correlation coefficients were calculated for each chemical and each concentration.Cross correlation is utilized as a measure of similarity between two waves in signal processing by a time-lag application, and it is also applicable to pattern recognition (Li & Caldwell, 1999).In a time series analysis, the cross correlation between two time series describes the normalized cross covariance function.Therefore, the range of cross correlation values is from -1 to +1.If we let represent two time series datasets including N time points, then the cross correlation is given by where d is the time-lag between variables X and Y.In this case, the expression profiles were measured at four time points, and thus three cross correlations of each gene pair were calculated with d=-1, 0, and 1.

Construction of the Initial Models
To infer the chemical-dependent regulatory networks, the differences between times and concentrations should be merged.In this study, we developed a new method for constructing an initial model of each chemical, with the merging of time and concentration conditions.Figure 1 shows the newly developed method.First, we constructed lag matrices to simplify the information from the time series data.The elements of the lag matrices were the time lags, which were defined for the calculation of the cross correlation.In this study, cross correlations were calculated with three lags, -1, 0, and +1.The absolute values of these three cross correlations were compared, and the lag value d with the highest absolute value was arranged as a matrix element.Lag matrices were constructed for each concentration, and thus five lag matrices were obtained for each chemical (Figure 1a).
In the next step, we merged the difference in the concentrations of each chemical.Binomial relationships were extracted from each lag matrix.For each chemical, there are five lag matrices according to the chemical concentration, and we considered that the chemical-specific relationships among the genes will be conserved in several lag matrices.If the same relationships existed in several lag matrices, then the binomial relationships were duplicated (Figure 1b).
We subsequently constructed one frequency matrix of binary relationships for each chemical.We counted the frequency of the appearance of relationships in binomial relationships.The number representing the frequency of each gene pair was arranged in this matrix, and thus the range was from 0 to 5 (Figure 1c).In the frequency matrix, we can merge the differences in the concentrations, since the elements of the frequency matrix indicate the information for the different concentrations.We selected the possible relationships from the frequency matrix.It is considered that a possible relationship would be indicated by its frequency of appearance.Thus, we selected the relationships with two or more values in the frequency matrix (Figure 1d).At the final step, an initial model was constructed with the selected possible relationships.By this approach, an initial model can include cyclic structures (Figure 1e).The procedure for constructing an initial model from the time-lag information of the cross correlation coefficients.(a) Time-lag matrices for each chemical.In this study, three time-lags were selected for the calculation of the cross correlation coefficients.Thus, three cross correlation coefficient values were obtained between all gene pairs.The time-lag value with the highest absolute value among the cross correlation coefficients was selected.Time-lag matrices were constructed for each concentration, so five time-lag matrices were obtained for each chemical.(b) Binomial relationships.These relationships were extracted from the five time-lag matrices.If the same relationships exist in several concentration matrices, then the extracted binomial relationships are duplicated in this step.(c) Frequency matrix of causal relationships between all gene pairs.From the binomial relationship, we can count the frequency of relationships between gene pairs.(d) Selection of possible causal relationships from the frequency matrix.The possible relationships between genes are considered to persist at several chemical concentrations.Thus, we selected the relationships with two or more values in the frequency matrix.(e) Construction of an initial model with selected causal relationships.By this approach, an initial model can include cyclic structures.

Structural Equation Modeling without Latent Variables (SEM without LV)
After the construction of an initial model for each chemical, we applied the SEM calculation to infer the network model that fit the measured expression data.Usually, two types of variables can be included in the SEM model: observed and latent.These variables constitute the structural models that consider the relationships between the latent variables and the measurement models that consider the relationships between the observed variables and the latent variables.These relationships can be presented both algebraically, as a system of equations, and graphically, as path diagrams.
In this study, the nine developmentally-related genes (GATA2, Lmx1A, MAP2, Nanog, Nestin, Nodal, Oct3/4, Pax6 and Tuj1) were defined as the observed variables.Meanwhile, none were defined as latent variables, which were common regulators of several genes.The un-observed factor, which affected each gene's expression, was displayed as an error.The observed variables were classified as one of two types: exogenous variables and endogenous variables.Exogenous variables are not regulated by other variables in the system, as opposed to endogenous variables, which are regulated by other variables in the system.In the initial model, the starting genes are defined as exogenous variables without errors, while all other genes are defined as endogenous variables with errors.We inferred the regulatory relationships that exist between the observed variables in the network model.The model is defined as follows: Here, y is a vector of p observed variables (measured gene expression patterns), and  is a p p  matrix representing the regulatory relationships between the observed variables.Errors that affect the observed endogenous variables are denoted by  .The above equation can be represented in the SEM matrix format as: In this study, we did not define the latent variables, and thus Os were arranged as zero partial matrices, which denote no relationships with q latent variables.The SEM is based on a covariance analysis defined as , where S is the covariance matrix calculated from the observed data and   Each element of the covariance matrix model is expressed as a function of the parameters that appear in the model.The unknown parameters were estimated, in order to minimize the difference between the model covariance matrix and the sample covariance.
The SEM software package SPSS AMOS 17.0 (IBM, USA) was used to fit the model to the data.The quality of the fit was estimated by the goodness-of-fit index (GFI), which measures the relative discrepancy between the empirical data and the inferred model, and the adjusted GFI (AGFI), which is the GFI modified according to the degrees of freedom.Furthermore, we used CFI and RMSEA as fitting scores, to evaluate the model fitting.Since these indices have threshold values, as criteria to decide whether the model is suitable to obtain data independent of a huge sample number, they are considered to be useful to clarify the degree of model fitting in this study.

Parameter Estimation
Parameter estimation was performed by comparing the actual covariance matrix S, calculated from the measured data, and the estimated covariance matrices     of the constructed model.To minimize the difference between S and     , the Maximum Likelihood (ML) method is commonly used as a fitting function to estimate the SEM parameters: is the estimated covariance matrix, S is the sample covariance matrix,  is the determinant of matrix  , ) ( tr is the trace of matrix  , and p is the number of observed variables.The principal objective of SEM is to minimize , which is the objective function and is used to obtain the maximum likelihood.Generally, and to find the solutions (Joreskog & Sorbom, 1984).

Iteration for Optimal Model
In the SEM analysis, both the parameters and network structures are fitted to the measured data.The parameters are estimated by maximum likelihood, and the network structures are evaluated by the scores of goodness of fit indices.The goodness of fit scores indicate the similarity between the constructed model and the measured data.Through the acceptance or rejection of the models, the optimal model that describes the measured data can be selected.
By using the estimated parameters, the variance-covariance matrix between the variables could be calculated in the network model.This model's variance-covariance matrix is compared with the actual variance-covariance matrix between observed variables, which is calculated from the measured data.The similarity between a constructed model and the actual data is defined in a quantitative manner by the fitting scores.In this study, four different fitting scores were utilized: GFI, AGFI, CFI and RMSEA.Values of GFI, AGFI and CFI above 0.90 are required for a good model fit.RMSEA is one of the most popular parsimony indexes displayed in the table, and RMSEA values below 0.05 represent a good model fit (Spirtes et al., 2001).Furthermore, RMSEA values of 0.10 or more are considered to indicate that the constructed model is far from the actual data.To optimize the model, we developed an iteration algorithm as follows: Step 1: Deletion of a non-significant edge from the constructed network model Use 0.05 as the significance level for the determination of the significant regulation among the variables.After the parameters are estimated, the inverse matrix of the Fisher information matrix of parameters is calculated.The inverse matrix of Fisher information represents the asymptotic parameters' covariance matrix.The probability of each parameter is calculated by using this asymptotic parameters' matrix, since all of the parameters are usually normally distributed.
Step 2: Reconstruction of the network model The structure of the network model without the non-significant edge is completely different from that of the former model.Thus, all parameters should be re-calculated from the reconstructed model, and the similarity of the network structure should also be re-calculated.
Step 3: Iteration of Steps 1 and 2 until all edges become significant Since the probabilities of all of the edges in the reconstructed models have also changed, the deletion of the non-significant edges is executed step-by-step.
Step 4: Addition of a possible causal edge to the reconstructed model According to the Modification Index (MI), we add a new causal edge between the observed variables.The MI measures how much the chi-square statistic is expected to decrease if a particular parameter setting is constrained (Joreskog & Sorbom, 1984).The MI value indicates the possibility of new causality between the variables, and thus we add a new edge according to the highest MI score.
Step 5: Iteration from Steps 1 to 3 The addition of a new edge to a constructed model changes the structure of the network model again.In other words, all parameters, including the probabilities of all edges, have also changed.Thus, we execute the iteration from Step 1 to Step 3 again.
Step 6: Determination of significant relationships among error terms After all of the edges are significant and all of the MI scores are lower than 10.0 in the constructed model, the significant relationships between the error terms are estimated by the MI scores.The relationships among the error terms have no direction, and thus they are a correlation between error terms.The relationships between the error terms were considered to be other regulatory systems in the living cell.Thus, these relationships among the error terms were used for the calculations, but were not incorporated into the network, and thus they have been excluded from the figures.

Initial Model Assumption
To construct the initial network model of each chemical, we utilized our newly developed method.One of the distinguishing features of our new method is its ability to include the cyclic structure in the network model.Cyclic regulation, such as feedback regulation, is considered to be important for living cells to control normal gene expression, and the new method is useful to detect the cyclic regulation from the gene expression data.The initially constructed models are shown in Figure 2. The initial model of TCDD was the most complex structure.The components of the constructed models were 9 genes with 19 relationships in Acrylamide, 8 genes with 12 relationships in Diethylnitrosamine, 9 genes with 23 relationships in TCDD, and 8 genes with 10 relationships in Thalidomide.
There are some obvious features in the network diagram of each initial model.The numbers of exogenous and endogenous genes are different from each other.In the initial Acrylamide model, four genes were arranged as exogenous variables, but only Oct3/4 was arranged as the last endogenous variable.Thus, it is considered that acrylamide quickly affected the expression of many genes, and only one gene was affected later.In contrast, only one gene was arranged as an exogenous variable and many genes were arranged as the last endogenous variables in the initial Thalidomide model.These differences between the initial chemical models summarized the distinctive gene expression profiles for each chemical.The initial TCDD model involved some cyclic regulation, even though the other models had only hierarchical regulation.Before the calculation of SEM, all of the initial models were simplified, since the initial models included some duplicated interactions among the genes, such as direct interactions between two genes and indirect interactions between them.In the simplification process for the initial models, the longest path between two genes was retained, since the arrows indicated only time precedence, not causalities in the initial model.Therefore, the difference between direct and indirect interactions is not important.By retaining the longest paths, all of the preceding information was included, as the simplest diagram.

Inferred Networks by SEM
The final inferred networks for each chemical and the goodness of fit scores are depicted in Figure 3, and the estimated regression weights of the edges are displayed in Table 1.The inferred networks of the chemicals revealed distinct structures.The differences between the gene regulation by chemicals were clarified by the shapes of the inferred network models.The Acrylamide network was a centralized model, the Diethylnitrosamine network was a ladder-like model, the TCDD network was a closed circular structure, and the Thalidomide network was a diffusion type.One of the unique features of the inferred Acrylamide network was that many genes were arranged at the top phase in the regulatory network, and only one gene was arranged as the final result of all regulation in the network.On the other hand, the shape of the Diethylnitrosamine network looked like a ladder, and two serial regulations interacted with each other.One serial regulation started from Lmx1A, and the other started from Tuj1.These top phase genes were considered as signal input genes, and they were different from those in the Acrylamide and Thalidomide networks.For example, Tuj1 was arranged as a signal input gene in the Diethylnitrosamine network, but it was arranged as an output object in the Acrylamide and Thalidomide networks.The unique feature of the TCDD network is the involvement of some closed circular structures in the inferred model.Among the parts of the circular structure, the regulatory direction from GATA2 to Nodal was different from the other relationships.Furthermore, the regression weight between GATA2 and Nodal was estimated as a negative value.Thus, it was considered that the inferred regulation from GATA2 to Nodal reflected feedback control by GATA2.In the Thalidomide network, the shape of the network model was reversed, as compared to that of the Acrylamide network.Only two genes were arranged at the top phase in the regulatory network, but many genes were arranged at the middle phase in the model.This means that only a few genes are directly affected by thalidomide, but finally many genes are affected throughout the gene regulatory network.Table 1.Regression weight and probability of each edge

Discussion
Our inferred model revealed the differences between the gene regulation by environmental chemicals.Furthermore, the shapes of the network models reflected the different features of the chemical toxicities well.In the Acrylamide network, the effects of acrylamide toxicity finally aggregated to Tuj1, which is known to contribute to microtubule stability in neuronal cells (Rosenstein et al., 2003).Acrylamide is neurotoxic, and thus it is reasonable that its effect finally aggregated to a neuronal cell-related gene.
In the Diethylnitrosamine network, the cell differentiation genes were arranged from the middle to lower steps.This means that diethylnitrosamine disturbed normal cell differentiation in the embryonic stem cell.These harmful effects were considered to be caused by the carcinogenic genotoxicity of diethylnitrosamine (Ito et al., 1992;Puatanachokchai et al., 2006;Iatropoulos et al., 2006).On the other hand, the neuronal-related genes were arranged at a later phase in the TCDD network model.Although both diethylnitrosamine and TCDD have the same carcinogenic toxicities, their regulatory mechanisms were different.
From the Thalidomide network, it was considered that the receptors of thalidomide toxicity may be rarer than those of other chemicals.However, several types of genes are finally affected by thalidomide chemical toxicity.Among the cell differentiation genes, Nodal and Nanog are important for normal early embryonic development.Nodal is related to the development of the left-right axial structure (Hamada et al., 2002;Grandel & Patel, 2009), and its signaling pathway is important very early in development, for cell fate determination and many other developmental processes (Grandel & Patel, 2009).Nanog is a key factor for maintaining pluripotency in embryonic stem cells (Mitsui, 2003;Chambers et al., 2003).According to the abnormal expression of these cell differentiation-related genes, the thalidomide phenotype, with its harmful side effects, may occur in the human embryo.
We applied an improved SEM approach to reconstruct a gene regulatory model from the gene expression data in human embryonic stem cells, and we have shown that SEM is a powerful approach to estimate the gene regulation caused by chemical toxicity.The inferred networks clarified the differences between the gene regulation by chemicals, and the features of the chemical toxicities were well reflected in the network structures.Thus, the network construction by SEM is one of the useful approaches for inferring the regulatory relationships among genes.Furthermore, the inferred network among genes can be utilized for the estimation of a chemical's effect, from experimentally obtained expression profiles.The ability to identify expression profiles and the corresponding biological functions is expected to provide further possibilities for SEM in the inference of the effects of chemical toxicity on regulatory mechanisms.

Figure 1 .
Figure 1.Developed procedure for initial model construction function of the parameter  .Let  denote the covariance matrix of the error terms  , and G

Figure 2 .
Figure 2. Initial network models The initial models of the selected chemicals were constructed by the developed approach.One initial model was constructed for each chemical, since the initial model included summarized time-series information and concentration information.(a) Initial model constructed from all gene expression profiles with all Acrylamide exposure.(b) Initial model of Diethylnitrosamine.(c) Initial model of TCDD.(d) Initial model of Thalidomide.The numbers of genes in the initial models were 9 in Acrylamide, 8 in Diethylnitrosamine, 9 TCDD, and 8 in Thalidomide.

Figure 3 .
Figure 3. Inferred Toxic-dependent Networks The optimal model for each chemical, obtained by the developed SEM iteration procedure.A positive relationship between genes is displayed with a solid arrow.A negative relationship between genes is displayed with a dashed arrow.Gene names with blue characters indicate "neurodevelopment related genes", genes with red characters indicate "cell differentiation-related genes" and genes with black characters indicate "related to transcription of insulin".(a) Acrylamide model; (b) Diethylnitrosamine model; (c) TCDD model and (d) Thalidomide model.The fitting scores are displayed under each model.