A Novel Approach of Multiple Submodel Integration Based on Decision Forest Construction

Limin Wang College of Computer Science and Technology, JiLin University Changchun 130012, China Tel: 86-431-8517 2081 E-mail: wanglim@jlu.edu.cn Xiaolin Li (Corresponding author) School of Business, Nanjing University Nanjing 210093, China Tel: 86-431-8517 0836 E-mail: lixl_126@126.com Yuting Mao Information dissemination Academy of Engineering, ChangChun University of Technology Changchun 130012, China Tel: 86-4318571 6001 E-mail: valeriazuo@126.com Abstract An analytical general solution is derived for reasoning uncertain knowledge by multiple sub-model integration. By choosing decision rule for each specific instance, a decision forest rather than a tree will be constructed, thus all relatively independent attribute sets can be determined automatically without any human intervention. Necessary discretization for mixed-mode subset will be processed based on post-discretization strategy to minimize information loss.

The volume of data for discovery of decision rules and recognition of patterns is growing at an exponential rate, both in the number of attributes (features) and objects (instances).One way to reduce computational complexity of knowledge discovery is dimensionality reduction, which includes projection pursuit, factor analysis, and principal components analysis.In artificial intelligence, decomposition methodology is a major tactic both for ensuring the transparent end-product and for avoiding the combinatorial explosion.And for this, the conditional independence assumption has been widely used, e.g. in Bayesian network structure learning.However, despite its popularity, the independence assumption always supposes that all attributes are discrete and continuous ones have to be discretized before learning even at the cost of information loss.Wang et al. reported that an unsteady special solution, which supposed that continuous feature subset and discrete subset are independent.This paper presents an analytical general solution to further handle mixed-mode subset based on decision forest construction to divide original feature space into several parts automatically.
Suppose instance space D with mixed-mode data has two types of attribute sets.The first k attributes are continuous and others are discrete.After pre-discretization the conditional independence assumption can be expressed as: Where lower-case letters denote specific values taken by corresponding attributes (for instance, x i represents the event that X i = x i ).And △ i is arbitrary interval of the values of attribute X i , ) (⋅ P refers to the probability.The following result can be formulated based on bayes theorem and differential theorem: is constant irrelevant to class value.
) (⋅ p refers to the probability density function.Maximum likelihood estimation is chosen to estimate probability and joint probability, and Kernel-based density estimation is chosen to estimate conditional probability density function: (3) where is the corresponding value of attribute X i when C = c, . And h i is the corresponding kernel width, m is the number of training instances when C = c.
Let {T 1 ,…,T P } denotes a decomposition of the attribute set A into p mutually independent subsets, each containing discrete and pre-discretized attributes or continuous input attributes.The aim of classification is to decide and choose the class that maximizes the posteriori probability, an analytical general solution based on conditional independence assumption can be obtained as follows: (4) Where t i and P(c|t i ) denote any reasonable combinations of attribute values in subset T i and the classification accuracy of submodel constructed by t i , respectively.t i and P(c|t i ) can be determined flexibly during the learning procedure of decision forest, which is composed of a set of decision trees.Attributes in the same tree should be dependent, while independent classification rules should be in different trees as the independent assumption suggests.
The original information entropy of class attribute C for instance space D is: (5) The information entropy of C for subspace D' which satisfied X i = x i is: The Gini index defined above just consider the information that each attribute value rather than specific attribute gave to class label.Since it is applicable to both continous and discrete attributes, the information loss caused by pre-discretization can be effectively avoided.The first part of Eq.( 6), which describes the information entropy of class label itself, is the same to all attribute values.Thus the second part of Eq.( 6) should be considered only during test selection procedure, that is to maximize the conditional information entropy.The construction procedure of decision forest can be described as follows: Input: Training set D with n predictive attributes and N instances.
Output: Decision forest composed of n decision trees at most. 1.As to any given instance , sort all attribute values according to the Gini index defined in Eq.( 6) and select the one x i which maximize the Gini index as the root node.2. As to continuous values, the discretized interval △i and the scope of the next subspace are determined to minimize information loss.

Search for the next attribute value as the branch node in the instance subspace which satisfies Xi = xi if Xi is discrete or
if Xi is continuous, until the class label is the same or height of the decision tree is n.Then a leaf node is generated.Each path from root node to leaf node is corresponding to a classification rule, and the pre-condition is the combination of all attribute values in this path.
4. Apply the learning procedure described above recursively, after N iterations each instance can be assigned a classification rule. 5. Combine those rules which have the same root node, then subtrees or relatively independent classification rules can be determined automatically without any human intervention.
6. Prune rule sets in each subtree repeatedly until this will not help to improve classification accuracy.7. Eliminate those rules that will result in high misclassification rate.Then decision forest with more powerful expressive ability to uncertain knowledge is constructed.
The continuous attributes in mixed-mode subset have to be discretized in step 3.According to post-discretization strategy [8], the boundary of continuous attribute Xi can be decided based on information gain: Where S is sorted sequence of the attribute values, N is the number of instances in set S, △(Xi,B; S) = log 2 (3 k -2) -[k×Ent(S) -k1×Ent(S 1 ) -k 2 ×Ent(S 2 )], and {S1; S2} are any given adjacent partitions.Ki is the number of class labels represented in set S i .In order to evaluate the performance of submodel integration of decision forest, we conducted an empirical study on 12 data sets from the UCI machine learning repository to compare it with C4.5 release 8.Each data set consists of a set of classified instances described in terms of continuous or discrete attributes.Since the essence of submodel integration can be considered as partial leave-one-out validation, we also applied it to C4.5 release 8. Figure 1 summarizes the experimental results and from it, the superior generalization accuracy of submodel integration can be clear4ly seen.