Performance of Robust Linear Classifier with Multivariate Binary Variables

This paper focuses on the robust classification procedures in two group discriminant analysis with multivariate binary variables. A normal distribution based data set is generated using the R-software statistical analysis system 2.15.3 using Barlett‟s approximation to chi-square, the data set was found to be homogenous and was subjected to five linear classifiers namely: maximum likelihood discriminant function, fisher‟s linear discriminant function, likelihood ratio function, full multinomial function and nearest neighbour function rule. To judge the performance of these procedures, the apparent error rates for each procedure are obtained for different sample sizes. The results obtained ranked the procedures as follows: fisher‟s linear discriminant function, maximum likelihood, full multinomial, likelihood function and nearest neigbour function.


Introduction
Over the years, a considerable body of research has accumulated on classification analysis, with its usefulness demonstrated in various fields, including engineering, medical and social sciences, economics, marketing, finance and management (Anderson 1972, McLachlan 1992, Joachimsthaler and Stam 1988, 1990, Ragsdale and Stam 1992, Huberty 1994, Onyeagu, 2003, Okonkwo 2011, Ekezie 2012, Egbo, Onyeagu and Ekezie 2014).Most of the research in classification analysis is based on statistical methods (Dillon and Goldstein 1978, Hand 1981, McLachlan 1992, Onyeagu 2003).However, the classification performance of existing parametric and non parametric statistical methods has not been fully satisfactory.For instance, it is well documented that parametric statistical methods such as Fisher"s linear discriminant function (LDF) (1936) and Smith"s quadratic discriminant function (QDF), Smith (1947) may yield poor classification results if the assumption of multivariate normally distributed attributes is violated to a significant extent (McLachlan 1992, Huberty 1994).
A number of the statistical classification methods are based on distance measures, some involve probability density functions and variance covariance and have a Bayes decision theoretic probabilistic interpretation, while others have a geometric interpretation only.An example of a distance-based measure is the Euclidean distance measure, which obviously has a geometric interpretation.If the attribute variables are independent, the Euclidean distance measure is equivalent to the Mahalanobis distance, with the usual probabilistic interpretation.However, if the variables are correlated the Euclidean measure does not have a probabilistic justification, as it does not involve any function of the probability density functions.
In this paper, we focus on two-group classification problems with binary attribute variables.There are numerous real-life binary variable classification problems; e.g. in the field of medical disease diagnosis, where the medical conditions of patients is evaluated on the basis of the presence or absence of relevant symptoms.It is obvious that the multivariate distribution of the binary attributes is non-normal, and it appears promising to analyze such problems using some statistical discriminant approaches.The statistical classification methods either minimizes some function of the undesirable distances of the training sample observations from the separating surface or minimizes the number of misclassified observations directly.set of training data.Many applications such as characters recognition, decision-making and disease diagnosis, can be viewed as extensions of the classification problem (Hen and Kamber 2001).A classification instrument can be modeled using different structures such as decision graphs, decision trees, neural networks and rules.Reducing the processing time and increasing the classification rate are the two main issues in the classification problem.We consider a classical problem of discriminant analysis: an individual is to be allocated to one k distinct classes w 1 ,…w c , whose members are described by an r-component vector of binary variables X= (x 1 ,x 2 …x r ).These binary variables can be viewed equivalently as a single multinomial variable having S = 2 r states.The problem of classification is that of assigning item(s) into one of k, k ≥2 known populations assuming that the items actually belong to one of the populations.Suppose only two populations are admitted with infinite number of individual objects.Let there be r characteristics of interest with corresponding measurement variables X 1 , X 2 …X r , r ≥1.Let the response vector of individual objects in 1  be X 1 = (X 11 , X 12 …X 1r ) 1 and in 2  be X 2 = (X 21 , X 22 …X 2r ) 1 .Suppose we find an object 0 with measurement vector X 0 = (X 01 , X 02 …X 0r ) outside 1  and 2  .The problem is how to classify 0 into 1  and 2  in an optimum fashion.The measurement vector X can be discrete or continuous.It can also be a mixture of discrete and continuous variables.In this study, our interest is about X whose arguments are discrete.The problem is to classify 0 with measurement vector X 0 into 1  and 2  .In this inferential setting, the researcher can commit one of the following errors.An object from 1  may be misclassified into 2  .Also an object from 2  may be misclassified into 1  .If misclassification occurs, a loss is incurred.Let c(i/j) be the cost of misclassifying an object from j objective of the study is to find the "Best" classification rule."Best" here means the rule that minimizes the expected cost of misclassification (ECM).Such a rule is referred to as the optimal classification rule (OCR) in this study we want to find the OCR where X is discrete and to be more precise, Bernoulli.Whereas classification rules with optimal properties for discriminant problems with multivariate normally distributed attribute variables are well known (Wald 1944(Wald , 1949;;Smith, 1947;Adebanji, Adeyemi and Iyaniwura, 2008;Oludare, 2011), alternative rules be more appropriate if some of the attributes are skewed.Most of the studies that compared non-normal classification methods with normality-based methods for various different data conditions have assumed equal misclassification costs across groups.Hence, it is not clear to what extent the conclusions in these studies can be generalized to typical problems with distributions that are skewed with unequal misclassification costs across groups.The purpose of the current study is to establish guidelines for choosing an appropriate classification method if the problem at hand is characterized by Bernoulli multivariate data.To achieve this objective, several Monte Carlo simulation experiments are conducted to compare the performance of some traditional classification methods designed specifically to handle problems with Bernoulli multivariate data.This study is limited to the two-group classification problem.

Maximum Likelihood Rule (ML, Rule)
The maximum likelihood discriminant rule for allocating an observation x to one of the population n   ... 1 is to allocate x to the population which gives the largest likelihood to x. that is the maximum likelihood rule says one should allocate x to and to 2  or otherwise.

The Fisher's Linear Discriminant Function (FLDF rule)
The linear discriminant function for discrete variables is given by  respectively.The classification rule obtained using this estimation is: classify an item with and to 2  or otherwise.

The Likelihood Function Rule (LF rule)
Consider the generalized ratio test for the hypothesis H0: X, X 11 ...X 1n ~ f 1 (x) and X 21 ... X 2n ~f2 (x) against H1: X 11 ...X 1n1 ~ f 1 (x) and X 21 ...X 2 2n ~ f 2 (x).As was proposed by Anderson (1982), Pires & Bronco (2004) and Onyeagu et al (2013) found that the likelihood ratio criterion also handles the problem of zero frequency.For multinomial model, they proposed a test statistic that is a function of X and is given by: This rule fails to take account of several factors that may be important in practice.These factors are the differential priorprobabilities of observing individuals from the two populations and differential cost incurred by misclassification and a-prior probabilities and if n 1 (x) =0 and n 2 (x) =0, the classification rule becomes: Classify item with response pattern into 1  if L(x) >1 and into 2  L(x) <1.For n 1 =n 2 , this rule falls back to the Full Multinomial Rule.The LF Rule also solves the zero frequency problem.A new observation X with n 1 (x) =0 will be classified in 1

Full Multinomial Function Rule (FMF rule)
Suppose we have a d-dimensional random vector ) ,...  with priori probabilities 1 The two group problem attempts to find an optimal classification rule that assigns a new observation is the number of individuals in a sample of size i n from the population having response pattern X .The classification rule is: classify an item with response pattern X into and to 2 and with probability The full multinomial rule is simple to apply and the computation of apparent error does not require rigorous computational formula.However, Pires and Bronco (2004) noted as pointed out by Dillon and Goldstein (1978) that one of the undesirable properties of the full multinomial Rule is the way it treats zero frequencies.
, a new observation with vector X will be allocated to 2  , irrespective of the sample sizes 1 n and 2 n .

Nearest Neighbour Function Rule (NNF rule)
Hills (1967) introduced perhaps the simplest nearest neighbour estimator for binary data, which classifies a particular response vector x based on the number of cells in response vectors y that differ from x. Specifically, let k be the number of cells in which x and y differ.Then define } to be a rule which classifies x if each of its cells differs by no more than k components.That is, classify x into 1 and into 2  otherwise.
For example, with d = 3 and x = (111), the neighbours of order k = 1 are R 111 = 110, 101, 011.Note that k = 0 reduces to the full multinomial model.In practice, one simply needs to construct the table of frequencies for all possible pattern of x and use a counting procedure over the set Rj to form the sample-based likelihood ratio for classification purpose.If the cell count for the jth cell is nij, then the nearest neighbour procedure assigns the observation to 1 where A is the set of neighbour of state j. Hills comments that the estimate of the likelihood ratio has less sampling variability than the simple method using cell frequencies.

Testing Adequacy of Discriminant Coefficient
Consider the discriminant problems between two multinomial populations with mean , where , the coefficients of sample MLDF given by d MW a has been proposed by Rao (1965) this test statistics uses the statistic: where distribution and we reject H 0 for large value of this statistics.

Evaluation of Classification Functions
One important way of judging the performance of any classification procedure is to calculate the error rates or misclassification probabilities (Richard and Dean, 1988).When the forms of parent populations are known completely, misclassification probabilities can be calculated with relative ease.Because parent populations are rarely known, we shall concentrate on the error rates associated with the sample classification functions.Once this classification function is constructed a measure of its performance in future sample is of interest.The total probability of misclassification (TPM) is given as: The smallest value of this quantity by a judicious choice of 1 R and 2 R is called the optimum error rate (OER).

Probability of Misclassification
In constructing a procedure of classification, it is desired to minimize on the average the bad effects of misclassification (Onyeagu 2003, Richard and Dean, 1988, Oludare 2011).Suppose we have an item with response pattern x from either 1  or 2  .We think of an item as a point in a r-dimensional space.We partition the space R into two regions 1  .Also the researcher can classify an item from 2  as coming from 1  .We need to know the relative undesirability of these two kinds of errors in classification.Let the priori probability that an observation comes from j  be 1 q , and from 2  be 2 q .Let the probability mass function of 1  be 2 R .Then the probability of correctly classifying an observation that is actually from 1 and the probability of misclassifying such an observation into 2 Similarly, the probability of correctly classifying an observation from 2 The total probability of misclassification using the rule is In order to determine the performance of a classification rule R in the classification of future items, we compute the total probability of misclassification known as the error rate.Lachenbruch (1975) defined the following types of error rates.
(i).Error rate for the optimum classification rule, R opt .When the parameters of the distributions are known, the error rate is which is optimum for this distribution.
(ii) Actual error rate: The error rate for the classification rule as it will perform in future samples.
(iii) Expected actual error rate: The expected error rates for classification rules based on samples of size 1 n from 1  and 2 n from 2  .
(iv) The plug-in estimate of error rate obtained by using the estimated parameters for 1  and 2  .
(v) The apparent error rate: This is defined as the fraction of items in the initial sample which is misclassified by the classification rule.The table above is called the confusion matrix and the apparent error rate is given by 1967) called the second error rate the actual error rate and the third the expected actual error rate.Hills showed that the actual error rate is greater than the optimum error rate and it in turn, is greater than the expectation of the plug-in estimate of the error rate.Martin and Bradley (1972) proved a similar inequality.An algebraic expression for the exact bias of the apparent error rate of the sample multinomial discriminant rule was obtained by Goldstein and Wolf (1977), who tabulated it under various combinations of the sample sizes n 1 and n 2 , the number of multinomial cells and the cell probabilities.Their results demonstrated that the bound described above is generally loose.


. These samples are used to construct the rule for each procedure and estimate the probability of misclassification for each procedure is obtained by the plug-in rule or the confusion matrix in the sense of the full multinomial.
(ii) The likelihood ratios are used to define classification rules.The plug-in estimates of error rates are determined for each of the classification rules.
(iii) Step (i) and (ii) are repeated 1000 times and the mean plug-in error and variances for the 1000 trials are recorded.The method of estimation used here is called the resubstitution method.
The following table contains a display of one of the results obtained Tables 3.1(a) and (b) mean apparent error rates increases with the sample size in all the classification rules except in the Nearest neighbour rule where the mean apparent error rates decreases with the increase in sample sizes.The actual error rates decreases with the increase in the sample sizes.In terms of performance, Fisher"s linear discriminant function ranked first followed by maximum likelihood function rule, full multinomial function, Nearest Neighbour Rule and likelihood ratio function came last.

Classification Rule
Performance/Rank Fisher linear discriminant function rule 1

Maximum likelihood 2
Full Multinomial function rule 3 Nearest Neighbour function rule 4 Likelihood function rule 5

Conclusion/Recommendation
We considered eight population pairs for the case of four variables.On the average, fisher"s linear discriminant function rule was the best in terms of estimating the probability of misclassification because it gives values closer to the actual probability of misclassification.The next is the maximum likelihood function rule which was better than the full multinomial function rule, the fourth is the Nearest Neighbour rule while the likelihood ratio occupied the last position and is the worst.This study, in addition to its mean structures characterized by marginal probabilities 1 P and 2 P , we considered structures determined by the difference .It was observed that as d increases from 0.1 to 0.4 the accuracy of the procedures also increased.This shows that accuracy increases with increasing d.It is important to note that Fisher"s linear discriminant function (FLDF), maximum likelihood function Rule and Full multinomial function Rule performed also very well in situation where 2 .0  d in the three variables.It was also observed that the more the information or the number of variables, the lower the probability of misclassification.This implies that accuracy increases with increasing number of variables.Fisher"s linear discriminant function outperformed other classification rules.From the analysis so far carried out, the procedures can be ranked as follows: Fisher"s linear discriminant function rule, maximum likelihood function rule, Full multinomial function rule, Nearest Neighbour rule and likelihood function rule.Secondly, we conclude that it is better to increase the number of variables because accuracy increases with increasing number of variables.We recommended that the work be extended to the area of multiple group discrimination and classification.
2.1)where kj s are the elements of the inverse of the pooled sample covariance matrix, the two distinct values: 0 or 1.The sample space then has a multinomial distribution consisting of the 2 d possible states.Given two disjoint populations, 1  and 2 represents the total number of sample observations.The full multinomial model estimates the class-conditional densities by . A simulation experiment which generates the data and evaluates the procedures is now described.

Table 3 .
1(a).Effect of input parameters P 1 and P 2 on classification rules at various values of sample size and Replications (mean apparent error rates)