Based on Bipartite Graph Label Gene Extraction Algorithm of Network Structure

Firstly we make a pretreatment to the original gene data, and analyze the information of sample gene graph by two steps. First step is removing the unrelated genes; Second step is use an extraction algorithm of label gene based on bipartite network structure to handle the candidate gene, and get gene interactive relationship network. Finally extracts some label gene form the gene interactive relationship network.


Introduction
Occurrence and development of cancer is a complex process of regulation of gene expression, and with the development of large-scale gene expression profiling technology.Thousands of gene expression levels can be obtained simultaneously in the experiment from tissue samples by using DNA Chips, but the way to find a set of genes which decide the sample gene's character from thousands of genes which measured by DNA chip, that is, " Gene information " or " Label gene "(informative genes) , which is the key factor for establish an effective classification model, properly identify tumor type, give to reliable diagnostic and simplify experimental analysis, and has important significance in the study of cancer pathogenesis, diagnosis and treatment.
Given the importance of tumor classification information selection, now a substantial research literature for this problem has already appeared.In 1999 Golub (T.R. Golub, et al. (1999) used "signal to noise ratio"as an indicator feature to extraction, and studied the classification of the two subtypes of leukemia with voting method.In 2000 Guyon(2000) put forward a new features extraction method-use support vector machine as classification tools which based on Golub's study, only with eight genes as classification feature had achieved 100% classification effect.In 2002, Sigh(2002) and others use the same feature extraction indicators as Golub, using k-nearest neighbor method as classification of prostate cancer gene research.Since then, Ruan Xiaogang and others are using pattern recognition methods, calculate technologies, floating order search algorithm and support vector machines for more precise identification of the 5 label genes to identify subtypes of acute leukemia.For many types of tumor subtypes, Khan (2001) using neural network method on the small round blue cell tumor (SRBCT)'s four subtypes for the diagnosis, extracted 96 characteristic genes, and got a good result.Tibshirani (2002) extracted 43 characteristic genes with the recent contractions centroid algorithm, be able to identify 20 subtypes of blind samples.Yeo and Poggio (2001) decomposition the four categories of questions into multiple two-class problem with k-nearest neighbor algorithm (KNN) and weighted voting method and support vector machine.In addition, other researchers also put their respective statistical calculation method (Fu K, Iqbal J, Chan W C. 2005.Van't Veer L J, Dai H, van de Vijver M J, et al. 2002. Zhang H, Yu C Y, Singer B. 2003).
Genetic correlation measure is used for assess the relevance among genes, Measure quality in some extent determine the success or failure of genes select.Gene correlation measure and machine learning and data mining in statistics area are common measure methods, various statistics parameter estimated and non-parameter estimated of measure, as t-test (Baldi P, Long AD. (2001) parameter estimated, TNOM (Ben. DorA, Bruhn L, Friedman N, et al. (2000), and MDMR 9 Park P J , Pagano M.( 2001), and WEPO (Chuang H Y, Tsai H K, Tsai Y F , et al. (2003), and SAM (Tusher VG, Tibshirani R , Chu G.(2001) etc. non-parameter estimated is widely used in gene micro-array column expression data.Jeffery (2006) using10 different statistics measure methods on 9 micro-array column data set, calculation the front 50, front 100, and front 200 genes of intersection respectively, eventually results display10 different measure method by produced of gene arranged just has 8%-21% same genes.
In the last ten years, complex network as an emerging cross-discipline, with strong correlations to physical, biological, and social problems have aroused more and more people's attention.Understanding the topology structure of complex network, dynamics and function is the current main objective of the study of complex networks.Complex networks can be divided into social networks (such as a film actor collaboration network), the information network (such as the World Wide Web), technology networks (such as power networks), and biological networks (such as the protein network) (Li Ze, Bao Lei, Huang Y.W.( 2002) etc. according to research area.
In complex networks, based on the nature of different networks can be divided into single mode networks, dual mode network and multi-mode networks.In single mode networks, only a kind of nodes, and in the dual-mode network, contains two types of distinct nodes, at the same time, similar node does not exist edge between nodes, edges can only exist between different kinds of nodes, this network also known as the bipartite network.Multi-mode includes more different kinds of node set.For bipartite network widely exist in real life, such as (film actor network), recently got a lot of attention, such as Zhou Tao (2007)consider the bipartite network topology, design a personalized recommendation algorithm based on network structure.Is aware of the importance in bipartite graph structure in the network, in this article also attempts to using bipartite network search label genes.

Classification independent gene filtering
In gene expression data, differences in gene expression levels of data showing a different.The distribution of some gene of expression level in ALL and AML categories, its value also variance are no obvious differences, these gene will increased the calculation complexity of search information genes, and does not provides useful information on sample type distinguish, to these gene and sample category have nothing relation, so is necessary remove these independent genes.T-test as a representative of parametric statistics measure, if assumes that information gene expression data exists two types of samples; the t value ) ( i g t of information gene i g can be given by the following formula: In the formula: r  And r  are respectively the mean and standard deviations of the gene i g to class r. r n is the number of samples Class r, 2 , 1  r .The higher value of i g , indicated that gene information i g has stronger similarity.t-test can only be used for two types of situations samples.There are some similar measures like t-test such as: Fisher, F-test, S2N (signal-to-noist ratio) and other similar measures.
Microarray gene expression data contain large volumes of data, usually has tens of thousands of genes and of very limited sample size, such as the t-test methods are too simple and parameter estimation of measurement is instability in selection process, so the mean and standard deviation statistics is likely to cause loss of statistical information, which is not conducive to label gene extraction.Considering the problems of t -test, Tusher(2001) put forward SAM algorithm to defined statistics: In the denominator adding a small constant c to avoid the FC small problem, because the statistics d is not subject to t Distribution, Tusher using the displacement test to directly control FDR.Efron discussed the selection of constant c, and confirmed in most of the test, select all the standard gene expression value standard deviation of the 90% quantile can achieve good results.In our tests, for simplicity, set c=1.
Calculate the statistics for each gene d, and gives the results of the histogram.We can find from figure1, the weights of most of these gene property classification less than 0.4.In article we choose 127 genes which weights greater than 0.4, other genes were deemed useless genes and removed from the original sample.

Tag extraction and result analysis
Based on different nature of nodes, network can be divided into two different sets of nodes -gene sets and In the first step, disease-related genes have been filtered, assume that all the genes are associated with disease, and interactions of these genes will only strengthen without offset.Disease always the result of the interaction of multiple genes, in order to better confirm the label gene, you first need to determine the strength of the interaction between genes, search for the existence of a particular gene has to do with many genes, in that gene interaction network, is there some core node.This core node is essential for robustness of gene interaction network (Xiaofeng Liu, Chen guohua. (2007).
Through the bipartite network projection you can get a gene interaction networks.In this article, we only need to study interactions between genes, the projection only require projection from a sample set to gene set.If two genes have common neighboring nodes, then these two genes be joined, otherwise the two genes are not joined.See Figure3.
Disease Occurrence or not are closely related with the level of gene expression.For genes i and j in the sample on the expression levels respectively between gene i and gene j from sample .Therefore, in gene interaction networks, gene interaction strength between: n n  interaction strength matrixI , and the matrix is a symmetric matrix.From upper triangular of gene interaction matrix, there are some genes have very large strength, which means very high level of gene expression, these genes is exactly what we need to search.
Using SAM algorithms remove unrelated genes, get 127 relate genes.Get Gene interaction strength matrix sam I from (3) and (4), and then extract the upper triangular matrices, in row order transform into a one-dimensional gene sequences as shown below: Can be seen some of the gene pairs have very strong interaction strength from Figure4, and then the probability of occurrence of diseases is very high.We are given a threshold to determine these notable gene pairs.When we take threshold value for 0.6, get gene pairs on sequence: (2 3), (2 4), ( 2 94), (76 105), total of 27 pairs (the order in which the label is not in the source data, only serial number).According to the relationship between 27 gene pairs, by the strength of the interaction between them to build the network as in Figure5 shows: From Figure5 we can clearly find that the network constructed by filtered genes exist some very large degree nodes, these nodes of a greater degree may have maximum influence in induce disease based on the assumption, according to the topology of the network, by the naked eye can be find label gene.In Figure5, H05899 and L11706 are the label genes what we need.So after removing these nodes in the network, network robustness obvious variation, sees Figure6, appeared many isolated nodes in the network, there is reason to believe that gene H05899 interactions with other genes induced the disease.
Did bipartite network projection with the filtered 127 Candidate genes, received 16 characteristic genes (table 1).
Existing in the research literature on colon cancer (Gennadi V. Glinsky, Yelena A. Ivanova, Anna B. Glinskii. (2003).Li Jiangeng, Gao Zhikun, Ruan Xiaogang, Yan Chi. (2009).Ian W Taylor, Rune Linding, et al. (2009).Han-Yu Chuang, Eunjung Lee, et al. (2007), many scholars have found some labels gene.By comparing with literature, based on the network structure labels gene extraction method of bipartite network (SIGABN) resulting in colon label gene, many of them are coinciding with the tabs for the other documents found in the genes, this shows that the method is effective, as for the other genes, pending further validation of medical workers.

Analysis and discussion
In this article classification information gene selection is divided into two steps, independent gene filtering and removal of redundant genes (label gene extraction).Redundancy removal will not increase characteristics of classification information contained.Therefore, in conducting the independent gene filtering should conduct a comprehensive analysis of gene samples contained classified information, so as not to filter out genes that contain important information.Through bipartite network projection of the filtered candidate label genes, get a related network.Form gene related network, impact of gene on the network can recognize the importance of genes and to identify a label gene.Although the algorithm in the article is simple, test results in better, whether other using pending further validation of medical workers.

,
Where the n represents the number of genes, m representative number of samples. of gene i in sample j.Figure2 is a sample bipartite network.

Figure 1 .
Figure 1.Gene SAM Algorithm of histogram

Table 1 .
Select a tag in the article gene and literature which can verify