Inferring Transcriptional Regulatory Relationships Among Genes in Breast Cancer : An Application of Bayes ’ Theorem

The introduction of Deoxyribonucleic acid (DNA) microarray technologies provides a means of measuring the expression of thousands of genes simultaneously. It has generally sought to revolutionalize biological research by significantly elucidating biological processes. Gene networks may be inferred from such microarray data. Bayes’ theorem, in this work is applied to the problem of inferring new transcriptional regulatory relationships among gene products in Breast Cancer. A compendium of human breast epithelial cell probe level microarray data from the Gene Expression Omnibus (GEO) repository was subjected to the Robust Multiarray Average (RMA) procedure for normalization and background correction. A subset of the resulting expression matrix consisting of the expression values of only relevant probe-set identifiers (IDs) representing the genes of interest in the data were extracted with a LISP code. This subset was supplied to a Bayesian Network inference learning algorithm to unearth new regulatory relationships from the data. Variations in parameters of the learning algorithms resulted in the prediction of at least 10 new relationships among genes in breast cancer. Among these were the direct regulatory signaling relationship between S-phase kinase associated protein 2 (SKP2) and the Cell division cycle 25A (CDC25A) and that between the cyclin-dependent kinases regulatory subunit 1 (CKS1B / CDC28) and E2F transcription factor 3 (E2F3). The identified causal networks are potentially useful for understanding complex drug actions and dysfunctional signaling in breast cancer.


Introduction
The process of inferring gene regulatory networks from gene expression data has brought on the need for experts and analytical scientists to develop several algorithms that seek to model cellular networks, protein signaling pathways and genetic data analysis for biological research (Friedman, 2004;Sachs, Perez, Pe'er, Lauffenburger, & Nolan, 2005;Beaumont & Rannala, 2004).The biomedical literature presents a plethora of algorithms that have been proposed to solve the problem of inferring networks.Clustering algorithms are used to visualize and analyze expression data whereby genes with similar expression profiles are grouped into clusters (Eisen, Spellman, Brown, & Botstein, 1998).Co-expressed genes are viewed as being functionally linked to each other at a higher chance.
At the other end of the spectrum of algorithms are the information theoretic approaches.This approach sought to correct the failure of clustering approach to capture more complex statistical dependencies between expression patterns by proposing the Mutual Information (MI) measure.The MI among all pairs of genes are computed, and inference of interaction is established by comparing it to a set threshold value (Butte & Kohane, 2000).However, the MI and the Pearson correlation may yield almost similar results (Stuer et al., 2002).The networks inferred from information theoretic approaches represent statistical dependencies but not necessarily direct causal interactions among genes.The Context Likelihood of Relatedness (CLR) algorithm, Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) and Minimum Redundancy NETwork (MRNET) algorithms provide refinements to the MI and can distinguish between direct and indirect interaction networks (Faith, Hayete, Thaden, Mogno, & Wierzbowski, 2007;Margolin, Wang, Lim, Kustagi, & Nemenman, 2006;Meyer, Kontos, Lafitte, & Bontempi, 2007).
A non-probabilistic inferring algorithm is the Ordinary Differential Equation (ODE) approach (Gardner, di Bernardo, Lorenz, & Collins, 2003).This deterministic approach models a system of differential equations with each equation describing a gene as a function of other genes.Networks derived from the ODE-based technique show causal interactions.However a large number of experimental data are necessary for good performance.ODEbased algorithms such as Network identification by multiple regression and microarray network identification have been proposed in the literature (Gardner, di Bernardo, Lorenz, & Collins, 2003;Di Bernardo et al., 2005).Probabilistic graphical models such as Bayesian Networks and Gaussian graphical models have also been proposed to solve the problem of inferring gene regulatory networks from microarray data (Friedman, 2004;Pe'er, Nachman, Linial, & Friedman, 2008).The Gaussian graphical model assumes that gene expression values are normal variates within undirected networks.A robust approach to estimate the Gaussian graph for a high dimensional data is further presented in the literature (Ambroise, Chiquet, & Matias, 2009).The undirected nature of Gaussian graphical model makes it undesirable when it competes with other approaches such as Gene Network Inference with Ensemble of trees (Genie3) that predicts direct causal interaction but assumes availability of high dimensional data (Huynh-Thu, Irrthum, Wehenkel, & Geurts, 2010).
In this work, a network inference algorithm based on the Bayes' theorem, called Bayesian Network, is used to infer transcriptional regulatory relationships among genes in breast cancer.A Bayesian Network (BN) is a probabilistic graph model describing the multivariate probability distribution for a set of variables.Due to their probabilistic nature, BNs can cope with the noise present in microarray measurements, while their graphical nature makes it easy to convey linear and higher order dependencies between genes.The BN models direct causal relationships as a direct acyclic graph (DAG) when the Causal Markov Assumption holds.This assumption can be stated as: a variable X is independent of every other variable conditional on all its causes (Bansal, Belcastro, Ambesi-Impiobato, & Bernardo, 2007).This work further seeks to establish optimal settings for building a Bayesian Network using actual breast cancer data.BN algorithms are furthermore suited for the inferring since they can be extended to dynamic Bayesian networks (DBN) to model time series and data that incorporates feedback (Friedman, Murphy, & Russell, 1998).

Bayesian Networks
A BN, also called Belief Network, is an augmented directed acyclic graph (DAG) represented by the nodes or variables, X i for i = 1, 2, . . ., n connected by directed edges which denotes relationships among the variables.The nodes denote the genes in a gene network.Figure 1 is an example of a BN describing a gene regulatory network.T is regulated by M and L is regulated by S and M. T is conditionally independent of L whereas M and S are independent.Furthermore, M is said to be a parent of T and both S and M are parents of L.
The relationships between variables are described by a joint probability distribution, P(X 1 , X 2 , . . ., X n ), expressed in terms of the conditional probabilities.Given that the Markov assumption holds, then the resulting P(X 1 , X 2 , . . ., X n ) is given by Equation (1).
where parents(X i ) denotes the set of parents for each variable X i .This rule is based on the Bayes' theorem: (A|B) = P(B|A)P(A) P(B) . (2)

Learning BN Structure
The BN procedure involves finding a DAG, G, which best fits the dataset, D. Algorithms for learning BN consist of two components namely a scoring metric and a search technique.The Scoring metric computes how well a proposed network fits the data whereas a search procedure tries to identify network structures with high scores traversing networks.

Scoring Metric
In general, the Bayesian scoring metric that evaluates the network is described as where D is an assumed multinomial sample.Conditioning the network parameters with Q, then P(D|G) is expressed as With these bases, the Bayesian Dirichlet equivalence metric (BDe) was developed to avoid overfitting the data (see Buntine, 1991;Heckerman, Geiger, & Chickering, 1995;Cooper & Herskovits, 1992 for details).The BDe assumes event equivalence property which implies that the scores of two isomorphic belief networks are equal (score equivalence).The sum up of the likelihood equivalence assumption is: the data does not discriminate equivalent structures.Furthermore, the BDe metric requires the prior over parameters P(Q|G) to have Dirichlet prior distributions and a uniform prior P(G).The BDe is used in this work since it discriminates the simple and complex structures just as the Occam's Razor at work.

Searcher Methods
The search space for Bayesian network structures is the set of all possible DAGs that fits the data set.The number of all possible static DAGs given the number of nodes as determined by Robinson (1977) is given in Equation ( 5): for n > 2 where f (0) = 1 and f (1) = 1.
Searching for all possible networks and generating high scoring networks is an NP-hard problem and therefore heuristic methods have been proposed (Chickering, 1996).Generally, heuristic methods perform any of the following at each iteration: an existing link between variables can be reversed or removed, or new links may be introduced to connect variables that were not connected previously while ensuring an acyclic network.In this work two searchers, namely: Greedy Hill Climbing and Simulated Annealing are used to learn the BN of the breast cancer data.

Greedy Hill Climbing Search
This method initiates with an initial structure also called prior network, considers all possible changes to it and the resulting network (neighbour) with the highest score is then used to initiate another search.If no neighbour has a score higher than current structure, then the algorithm stops with the current structure as optimal.The greedy search may fail to reach the global optimal network as it always stops on a local optimum.To circumvent this limitation of the method, it is advisable to repeat the process several times with different random initial networks and choosing the best local optimum.In this case, the maximum number of restarts or specified period may be set for the search to terminate.

Simulated Annealing Search
This method starts with an initial network, accepts changes to existing network and improves the current network if its score is higher or if its score is lower but with a probability based on a system parameter, T, known as "the temperature".At the start of the process, a lot of changes are accepted but reduces as the value for T is lowered gradually.This allows enough exploration of the search space so that the problem of terminating at a local optimum is solved.The mechanism of the simulated annealing search is further explored by Heckerman (1995) and Janzura and Nielson (2006).

Inferring Networks From Data
The process of learning the Bayesian structure from the breast cancer data is shown in Figure 2. A compendium of microarray data was generated from the Gene Expression Omnibus (GEO) record numbers GDS3716, GDS3139 and GSE21947.The GDS3716 data sets consist of data obtained from samples of reduction mammoplasty breast epithelium of patients, samples of histologically normal breast epithelium from prophylactic mastectomy patients, samples of histologically normal breast epithelium from Estrogen Receptor -Negative (ER-) breast cancer patients, and samples of histologically normal breast epithelium from Estrogen receptor -positive (ER+) breast cancer patients.The GSE21947 data sets consist of data obtained from gene expression patterns in histologically normal breast epithelium of breasts containing estrogen receptor positive (ER+) and estrogen receptor negative (ER-) in cancer patients.The GDS3139 data set included samples obtained from epithelium adjacent to a breast tumor and patients undergoing reduction mammoplasty without apparent breast cancer.The data set on breast cancer will enable the results to have direct interpretation in relation to cancer.
The resulting compendium contains the gene expression microarray experiment data from human breast epithelial cells under a variety of conditions.These arrays are based on the Affymetrix Human Genome U133A platforms.Probe level data derived from microarray experiments may harbour obscuring variations (Hartemink, Gifford, Jaakola, & Young, 2001).In all there were 22,283 probe sets and 89 arrays from the human genome U133A after the .CEL files format were downloaded and subjected to Robust Multi-array Average for normalization and background correction (Irizarry et al., 2003).This eliminates the effects of all obscuring variations in the data derived under different conditions.Expression data associated with the relevant probe set IDs representing the genes of interest were extracted with a LISP code.The previously known interactions of these genes in cell cycle were derived using reactome by Joshi-Tope et al. ( 2005) to set up an initial structure (prior) of 36 relationships (edges) in learning the static BN structures (See Table 1).The genes are also referred to as the variables or nodes of the network.Each variable at source node (second column) regulates its corresponding target variable (in the third column).The variables are denoted by gene symbols of genes or transcription factor.

Results
The results herein are presented in two parts based on the search techniques: Simulated Annealing and Greedy Search techniques.In implementing any of these methods, two different edge change methods namely, All-Local-Moves and the Random-Local-Moves were varied.All-Local-Moves involves composing a list of all available local moves (adding or subtracting of a parent, or the reversing of a parent relationship), given the current state of the network, and then selects the move that yields the highest scoring network.The disadvantage in this approach is that it is computationally expensive compared with experiments with the Random-Local-Move.On the other hand, Random-Local-Move selects a move at random from all possible local moves.In all these operations, the acyclic requirements of networks are satisfied.The computations to derive the networks are manually infeasible and so we implement these concepts of BN structure learning with Banjo (Sladeczek, Hartemink, & Robinson, 2008).All algorithms were made to search for best BN structures over 2 hours since Janzura and Nielson (2006) showed that increasing time for a simulated annealing search does not guarantee an improvement in its outcome.However the visualization of the optimal networks of predictor relationships was developed with Cytoscape (Shannon et al., 2003).

Simulated Annealing With All-Local-Moves (SAA)
The optimal network showed 69 predictory relationships indicated by the edges as shown in details in Figure 3.The predicted relationships by SAA included the initial relationships.See Table 2 for details of other observations of results.This method predicted ATM regulated TP53 when initial structure with five dropped edges (including ATM regulates TP53) was used to learn BN structure from the data.The gene symbols are the nodes or variables of the network and the regulatory relationships are denoted by the directed edges.All the genes of interest were found to be present in the optimal network with 69 established regulatory relationships (directed edges).For instance, directed edge from CDC34 to E2F3 indicates that CDC34 regulates E2F3.The observations are derived from the Bayesian Network inference from a microarray dataset of 24 Variables/genes and 89 arrays.The Edges recovered represent the number of 5 excluded edges prior to Bayesian Network inference from data.SAA is Simulated Annealing with all-local-moves, SAR is Simulated Annealing with random-localmoves, GA is Greedy all-local-moves and GR is Greedy random-local-moves.

Simulated Annealing With Random-Local-Moves (SAR)
The optimal network comprised of 85 edges including the 36 initial relationships (edges) which is shown in details in Figure 4. See Table 2 for details of other results.SAR method predicted that ATM regulates TP53 and CDK4 regulates E2F3 when initial structure with four dropped edges (including ATM regulates TP53 and CDK4 regulates E2F3) was used to learn BN structure from the data.The gene symbols are the nodes or variables of interest and the regulatory relationships denoted by the directed edges.For example, the gene with symbol TP53 regulates the gene with the symbol CDKN1A.

Greedy With All-Local-Moves (GA)
The BN structure learning with the Greedy search predicted 46 relationships (edges) in the revealed optimal network (see Figure 5).See Table 2 for details of other GA results.However, GA failed to predict any dropped relationship when five edges of the initial structure were dropped and the remaining structure used for the learning BN with GA. various approaches may be subject to validation by literature and laboratory experimentation.
Results from all four methods above predicted that SKP2 regulates CDC25A and CKS1B regulates E2F3.These discoveries are supported by report in the biomedical literature (Hartman et al., 2009;Zimmermann et al., 2011).
Results from SAR and GR showed 50 similar predicted relationships.Prominent among these is SKP2 regulates CDK2.This is supported by literature in Patsoukis et al. (2012).The fact that some of the consensus predicted relationships have been established in existing literature makes for greater confidence in all the other predictions.
Figure 7 shows several other equally predicted relationships.The new predicted regulatory relationships (Figure 7) provide further insights into underlying biological processes leading to the breast cancer.Cancer is failure of controls over cellular births (through cell cycle) and deaths.Since the predicted relationships among the genes and transcription factors are involved in cell cycle, the suggested regulatory relationships are key components in elucidating the etiology of breast cancer and consequently identifying new therapies for the disease.In particular, CDC25A is required for progression from G1 to S-phase of the cell cycle as it activates the CDK2.Therefore the inferred novel indirect regulatory relationship between TP53 and CDC25A through CDKN1A is particularly interesting, suggesting the added role of TP53 in cell cycle progression in breast cancer.The predicted regulatory relationships between SKP2 and CDC34, and CDC25A further suggest a possible cause for the dysregulation of the cell cycle genes leading to tumorigenesis.Furthermore, observation of DNAs from different cancer patients generally contains the mutation or deletion of TP53 which identifies its role in tumor suppression.Therefore, the predicted relationships involving TP53, CKS1B and MDM2 are of particular importance since they affirm earlier work which suggested that inhibiting TP53-MDM2 interaction will be fascinating new anticancer agents necessary for activating some TP53 in tumors (Hamzehloie, Mojarrad, Hasanzadeh Nazarabadi, & Shekouhi, 2012).

Conclusion
Inferring BN with Simulated Annealing has the best performance based on scores compared to that with greedy search which is biased towards selecting only high scoring networks.In particular, Simulated Annealing with Random-Local-Moves and Greedy with Random-Local-Moves have equal true recovery rates as both recovered two of the dropped or excluded initial relationships (edges).Therefore Simulated Annealing with Random-Local-Moves and Greedy with Random-Local-Moves are useful for effective inference with BN.Consensus predictions among four different search strategies make for greater confidence in the inference.A regulatory signaling relationship is predicted by consensus between S-phase kinase associated protein 2 and the Cell division cycle 25A.This interaction prediction occurs directly.Another regulatory signaling relationship is predicted by consensus between the CDC28 protein kinase regulatory subunit 1B (CDKN1B) and E2F transcription factor 3. This interaction prediction occurs directly.There is also a consensus predictions from Simulated Annealing with random local moves and greedy with random local moves: a regulatory signaling relationship is predicted by consensus between S-phase kinase associated protein 2 and the cyclin-dependent kinase 2. This interaction prediction occurs directly.That is BN structure learning methods can be valuable in harnessing signaling relationship studies in genes given a sample data.

Figure 1 .
Figure 1.Example of BN

Figure 2 .
Figure 2. The procedure of inferring relationships from data Probe level data is subjected to Robust Multi-array average (RMA) to make the data set ready for the Bayesian Network Inferring algorithms.The relevant data set of the genes of interest is extracted from the entire results of the RMA with a LISP code.

Figure 3 .
Figure 3. Network generated via simulated annealing with all-local-moves

Figure 4 .
Figure 4. Network generated via simulated annealing with random-local-moves

Figure 5 .
Figure 5. Network generated via greedy search with all-local-movesThe gene symbols are the nodes or variables of interest in the network and the directed edges represent the regulatory relationships.For example gene with the symbol CDKN1B regulates the gene with the symbol CDC25A.

Figure 7 .
Figure 7. Consensus predictions of greedy and simulated annealing with random-local-moves The gene symbols are nodes or variables of interest and the regulatory relationships are indicated by the directed edges as predicted by both the Greedy and the Simulated Annealing with random local moves.Common predictions by Bayesian Network structure learning with Greedy and the Simulated Annealing indicate the greater confidence in these prediction.In particular, MDM2 and CKS1B all directly regulate TP53.

Table 1 .
Initial regulatory relationship

Table 2 .
Performance of BN with simulated annealing and greedy search methods