Strategies for Non-Parametric Smoothing of the Location Model in Mixed-Variable Discriminant Analysis

The non-parametric smoothing of the location model proposed by Asparoukhov and Krzanowski (2000) for allocating objects with mixtures of variables into two groups is studied. The strategy for selecting the smoothing parameter through the maximisation of the pseudo-likelihood function is reviewed. Problems with previous methods are highlighted, and two alternative strategies are proposed. Some investigations into other possible smoothing procedures for estimating cell probabilities are discussed. A leave-one-out method is proposed for constructing the allocation rule and evaluating its performance by estimating the true error rate. Results of a numerical study on simulated data highlight the feasibility of the proposed allocation rule as well as its advantages over previous methods, and an example using real data is presented.


Introduction
Various methods for constructing allocation rules in discriminant analysis with mixtures of variables have been proposed and discussed by researchers.Broadly speaking there are three possible strategies: (i) transform the variables so they are all of the same type, and then apply an allocation rule appropriate to this type; (ii) apply separate allocation rules to each type, and then combine the results for an overall classification; (iii) develop a model that pays regard to the separate types, and then derive an allocation rule from this model.Strategy (i) entails possible loss of information (Krzanowski, 1993;Hand, 1997); strategy (ii) has had limited study (Wernecke, 1992;Xu et al., 1992), but strategy (iii) has received much more wide-spread attention.Some methods available so far are non-parametric kernel and nearest neighbour approaches (reviewed by Silverman and Jones (1989)), semi-parametric methods such as logistic discriminant analysis (Anderson, 1972) and fully parametric methods based on the location model (Chang and Afifi, 1974;Krzanowski, 1975).
The use of the location model in discriminant analysis has been discussed by many researchers (Chang and Afifi, 1974;Krzanowski, 1975Krzanowski, , 1980;;Daudin, 1986;Knoke, 1982;Vlachonikolis and Marriot, 1982;Titterington et al., 1981;Kiang, 2003).In a recent development, Asparoukhov and Krzanowski (2000) have shown how non-parametric smoothing can be used to estimate the classifier's parameters.This approach is particularly useful for situations with sparse data, where traditional maximum likelihood methods run into problems.
The aim of this paper is to carry out further investigation on non-parametric smoothing of the location model.Attention is focused on classifying objects with continuous and binary variables to one of two groups, but expanding the idea to more than two groups and more general categorical variables can be executed without difficulty.Existing methodology is summarized in the second section, problems with previous methods are highlighted and some new ideas are presented in the third section, a Monte Carlo study to investigate the methods is described in the fourth section and results are presented in the fifth section.We also compare the proposed methods with standard maximum likelihood ones in situations where the latter are possible.An example of real data is then presented, and brief conclusions are given in the final section.

Non-parametric smoothing of the location model
Suppose there are two groups, and , both of which consist of objects with binary and continuous variables.We denote the vector of binary variables as and the vector of continuous variables as , and let the former have q components, while the latter has p components, .Hence, we may present the vector of variables observed on each object in both groups as .
Let each binary variable take a value of 0 or 1.Then we can express the q binary variables as a multinomial , where , and each distinct pattern of defines a multinomial cell uniquely, with falling in cell .We denote the probability of obtaining an object in cell s of (i = 1,2) by . Next, we assume the vector of continuous variables to have a multivariate normal distribution with mean in cell s of and a homogeneous dispersion matrix across cells and populations, , thus .The joint probability of observing an object in group with associated and is . (1) The application of this joint probability to the problem of allocating a future object with mixed variables to one of two groups was first studied by Chang and Afifi (1974) and generalised by Krzanowski (1975).By assuming that the costs due to misallocating future objects in both groups are equal and that the covariance structures in both groups are homogeneous, we allocate a future object with the vector of observed variables to if its x falls in cell s of the multinomial variable and (2) otherwise to .
Usually, parameters and are unknown and have to be estimated from initial samples, known as training sets.We focus here on estimating these parameters using non-parametric smoothing methods.The mean of the jth continuous variable for cell s of group is estimated through (3) under conditions and where s, k = 1, . . ., m; i = 1, 2 and j = 1, . . ., p.In this form, is the number of objects of that fall in cell k, is the jth continuous variable value of the rth object falling in cell k of and is the weight with respect to variable j and cell s of all objects of that fall in cell k.
One may consider any suitable function for the weights, , but we prefer the exponential function due to its simple form of where . Here, d(s, k) is the dissimilarity coefficient between the sth cell and the kth cell of binary vectors, measured using distance function .All cells that have equal dissimilarity with respect to cell s will thus have equal weight in the estimation of cell means.
In any practical application, the degree of smoothing represented by needs to be determined.One possible way of doing this is to select the set of smoothing parameters for continuous variables, , as the values in the interval [0,1] that maximise the leave-one-out pseudo-likelihood function (Asparoukhov and Krzanowski, 2000) (5) where is the probability density of if object r falls in cell s of and is the training set of and objects with object r excluded.
Next, the smoothed cell means (3) are used in the estimation of the smoothed covariance matrix where is the number of objects of and is the number of non-empty cells in the training sets from .
Finally, Asparoukhov and Krzanowski (2000) employed the adaptive weighted near-neighbour estimators as originally proposed by Hall (1981), for estimating the cell probabilities.These estimators have the form where s, k = 1, . . ., m; i = 1, 2 and is the number of training objects from that fall in cell k and are j binary variables distant from the cell s, .The weights, , are chosen to minimise mean squared error: .
The proposed strategies for estimating the parameters are satisfactory, but some potential problems may arise in certain situations.The first problem concerns the choice of smoothing parameters by maximising the leave-one-out pseudo-likelihood function (5).This function may be satisfactory for data from normal populations, but less satisfactory if this assumption is not appropriate.Secondly, the estimation of cell probabilities using the adaptive weighted nearest neighbour method determines the amount of smoothing by computing the weights automatically.It is an easy technique, but Hall (1981) warned that sometimes these weights take negative or zero values.These phenomena usually happen when the probabilities are small and the sample is not sufficiently large.These problems are addressed in the rest of this paper, and some alternative methods are investigated.

Parameter estimations
The sample-based version of allocation rule (2) that assumes equal cost of misclassifying objects in both groups and homogeneous covariance matrix in the two groups, will allocate a future object, to if it satisfies (9) and otherwise to , where and are the smoothed parameter estimates obtained from the training sets.
We keep both the smoothed estimators and as given in equations ( 3) and ( 6) respectively.However, we apply a restriction by having a single smoothing parameter, , across all continuous variables and groups.Therefore, a new and we use this quantity for estimating the cell means (3).The reason for imposing a single smoothing parameter is to have less complexity on the designed model and greater ease in handling the process for selecting the smoothing parameter.
We overcome the problem of obtaining negative or zero cell probabilities when using the adaptive weighted nearest neighbour (7) by imposing the following restrictions.
(i) All weights, should be between 0 and 1, so that any weight that is less than or equal to 0 is replaced by 0.00001 while any weight that is greater than or equal to 1 is replaced by 0.99999.
(ii) If restriction (i) fails to avoid obtaining a zero probability, then any cell probability that has zero value is replaced by 0.00001.
In addition, we consider two alternative non-parametric smoothing methods for estimating .These methods are; 3.1.1The kernel method (Aitchison and Aitken, 1976) that estimates the probability of observing cell s of as ( 11) where has the same interpretation as the adaptive weighted near-neighbour estimator which was discussed earlier and is a smoothing parameter.

3.1.2
The modified maximum likelihood estimator in the weighted form: .
In order to have the simplest allocation rule, we suggest that the weights follow the exponential function ( 10), standardised to probabilities in each group by . (13) We will term this the exponential method.

Selecting optimised smoothing parameters
Both the kernel and the exponential methods ( 11) and ( 13), as well as the estimation of cell means (3), need an identified value of the smoothing parameter, .Therefore, we investigate three different strategies for obtaining its value.The that satisfies a given criterion in each strategy is termed .
The first strategy is to select through the maximisation of the leave-one-out pseudo-likelihood function.This was successfully applied by Asparoukhov and Krzanowski (2000) and we are interested in studying its feasibility.The second and the third strategies choose as the value that provides the best allocation performance as measured by the true error rate and the Brier score respectively.The choice of based on the true error rate was used by Raudys and Jain (1991), who selected for the Kernel window classier as the value that provides the smallest classification errors.The error rate is computed by taking the proportion of the number of objects misclassified by the rule to the total number of objects in the sample.
The error rate takes a discrete value, so an alternative criterion that takes values from a continuum is considered.This criterion is known as the Brier score.Suppose denotes the true group of object in the training set, being equal to 1 if is from and 0 otherwise, and is the probability that object with measurements and belongs to .Then for the two groups case, the Brier score is defined as (Hand, 1997) To have a good estimate of the Brier score, an appropriate method that gives the best estimate of has to be used.If the group densities are known to follow the location model, we may obtain easily through Bayes theorem: (15) where is the prior probability of obtaining an object from and We estimate both ( 15) and ( 16) by replacing all the parameters ( ) with the values of the corresponding smoothed estimators ( ) obtained from the training set, choosing prior probabilities appropriate to the substantive application.Like error rate, we choose the that provides the smallest value of Brier score.The computation of both error rate and Brier score are performed through the leave-one-out process, so that biased estimation of these criteria is avoided and comparison among these three strategies for selecting is done without bias.

Assessment of smoothed allocation rule
There are nine possible allocation rules altogether, obtained from the combination of the three different methods for smoothing cell probabilities and the three different strategies to choose .In each of these rules the exponential smoothing method is used for estimating the cell means and the covariance matrix.We compare these rules by measuring their performance using misclassification of future objects as the criterion.For this purpose, the leave-one-out error rate is preferred because it removes the bias in the apparent error rate and easily represents a classifier's inaccuracy by giving the percentage of objects it misclassifies.
The methods for selecting and evaluating rule performance both need the leave-one-out process.We conduct these leave-one-out processes in nested fashion; the inner leave-one-out is used to select the optimised smoothing parameter and the outer leave-one-out is used to measure rule performance.The details of the proposed design are outlined as follows: 1. Omit each object r in turn from the sample ( ).
1.2 Compute and using the obtained value of and the sample without object r.
1.3 Construct an allocation rule using the estimates obtained in (1.2).
1.4 Predict the group of object r.If correct, then , otherwise .
2. Compute the leave-one-out error rate, .
For option (1.1.2) in step 1.1, we evaluate the allocation rule performance over a set of values of in (0, 1) and choose the value which provides the smallest leave-one-out error rate as .However, we obtain for both options (1.1.1)and (1.1.3)using built-in functions ('optimize' and 'nlminb') as provided in S-Plus.It is worth emphasising that the optimised value obtained from any of these strategies is not necessarily the global optimum.
Since there are nine rules, we distinguish them by labelling each rule with the smoothing methods used to smooth the cell probabilities, thus we have (i) Nearest neighbour LL , Nearest neighbour ER and Nearest neighbour BS , (ii) Kernel LL , Kernel ER and Kernel BS and (iii) Exponential LL , Exponential ER and Exponential BS .The subscripts refer to the strategies (log-likelihood, error rate and Brier score) for obtaining the optimised smoothing parameter, .

Monte Carlo simulation
The rules discussed in the previous section were evaluated through simulations.We generated data having a mixture of continuous and binary variables by following Everitt and Merette (1990) The differences between these two matrices is in their settings of where the variances of the continuous variables are very small (unity) in the first matrix, whilst they are very dispersed in the second matrix.Finally, the largest proportions of the binary variable values are given in Table 1.
SET 1 in Table 1 gives large separation of cells, so we will have more objects in one category than in the other category for the same binary variable.Therefore, we expect to obtain many empty cells with this set especially when the size of sample is very small.SET 2 and SET 3 give moderate and small separation of cells respectively.In SET 3, full cells are expected.There are 18 combinations of (i) size of groups, , (ii) dispersion matrices, or and (iii) proportions in category of binary variables (see Table 1).Since the simulation process is extensive, we only had one artificial data set for each combination and we restricted our investigation to have equal sample size in both groups for these 18 data sets.

Selecting optimised smoothing parameters
Figure 1 shows the variability of across one data set for the three allocation rules when they are chosen to minimise the leave-one-out Brier score.These examples are taken from data set 4, one of the 18 data sets generated in the previous section.The patterns of other data sets and the patterns that correspond to the other two optimisation criteria were similar.They describe some differences in the amount of smoothing from one sample of size (n 1) to other samples of the same size.This variability occurs because different samples have slightly different distributions when a different object r is omitted.
We have labelled several points that are further from the majority by their omitted objects (Figure 1(a) and Figure 1(c)) and circled numbers (Figure 1(b)), to highlight the possibility of having potential outliers.This is quite interesting to highlight because we can see in each plot that different methods identify different potential outliers.

Performance of the smoothed allocation rules for normal populations
The performances of the smoothed allocation rules are depicted in Figure 2.Each plot displays the behaviour of the performance of each smoothed allocation rule, estimated through the leave-one-out error rate (Y-axis) and its relation to the distance between two groups measured through the Kullback-Leibler divergence (X-axis).The three lines in each plot represent the three different strategies for selecting .As clearly seen, all plots show the same decreasing pattern when the distance between two groups increases.It is quite difficult to spot whether any single rule always gives the smallest estimated error rate when comparing all Figures 2(a), 2(b) and 2(c), so Table 2 shows the frequencies with which each allocation rule was best.
In this table, the best allocation rule refers to the one that gives the lowest error rate among the three for each of the 18 artificial data sets.The scoring system is as follows; we assign 1 point to a single best rule, 1/2 point to both rules if there are two winners, and 1/3 for all rules if all of them are the lowest.Among the smoothing methods, the nearest neighbour rule is clearly best when its is chosen through the minimisation of the leave-one-out Brier score and the maximisation of the leave-one-out log-likelihood function, while the kernel rule is the winner when its is chosen through the minimisation of the leave-one-out error rate.However, in terms of this criterion all three methods perform very similarly.

Investigation with non-normal populations
The rules where is obtained through the maximisation of the normal log-likelihood function should show some deterioration when the data are in fact non-normal.In such a case, their performance should differ from those of the other rules because the other two criteria (i.e.optimising error rate and Brier score) do not rely on distributional assumptions.
To investigate the performance of the proposed rules under non-normal conditions, we chose randomly five out of the 18 generated data sets and transformed all continuous variables in these subsets using the inverse of Johnson's systems (Lachenbruch et al., 1973) so that they no longer had a multivariate normal distribution.The chosen transformations were: log-normal (data sets 1 and 8), logit (data sets 2 and 14) and hyperbolic sine normal function (data set 17).Details on these transformation can be found in Lachenbruch et al. (1973).To avoid complexity, the same transformation function was applied to the two groups.The same investigations as performed for normal populations were repeated, and results are shown in Table 3.
Our results do not show the expected effect, possibly because of the estimation of parameters, in which they have been smoothed for density estimation purposes.However, it is strongly advised that extra precautions need to be taken for dealing with this situation, and choosing smoothing parameter through inaccuracy measures is advised.

Comparison of maximum likelihood and smoothed classification rules
Smoothing is essential whenever some of the multinomial cells are either empty or have very few observations.In the former case it is impossible to construct the maximum likelihood classification rule in such cells, while in the latter case the rule will be very poorly estimated.However, it is also of interest to enquire how the smoothed rules will compare with maximum likelihood ones when all cells have sufficient numbers of objects in them.If the smoothed rules show good performance in this comparison, they can be applied very generally rather than only in cases where cell membership is sparse.
To investigate this question, we extracted from the foregoing data sets those that had observations in all the multinomial cells; there were 7 such sets among the normal ones and 3 among the non-normal ones.For each data set, the error rate and the Brier score were found for the maximum likelihood (ML) rule and for each of the three smoothed rules.
Results are shown in Table 4, where we give the number of sets in which the ML rule performed better than all the smoothed rule ("ML best"), the number in which all the smoothed rules performed better than ML ("ML worst"), and the number in which either one or two smoothed rules performed better than ML ("ML intermediate").Noting that ML was beaten by at least one smoothed rule on both assessment criteria in nine of the ten data sets, these results therefore suggest that the smoothed rules constitute good classifiers whether the multinomial data are sparse or not.

Practical example
An example of using the proposed rules for a real problem is provided by data set 2 reported by Krzanowski (1975).These data concern patients suffering from jaundice.These patients comprise two groups: denotes patients who required medical treatment and denotes patients who required surgical treatment.There are 30 patients in and 63 patients in .Four continuous and three binary variables are available for distinguishing these two groups.Since the costs of misclassifying patients to wrong groups are unavailable, we assume equal costs and prior probabilities.
The performance of the proposed allocation rules is shown in Table 5.Since there are many cells without patients, no allocation rule can be derived through the maximum likelihood estimation.Among smoothed allocation rules, Nearest neighbour LL is best.In fact, the numbers of misclassifications recorded by nearest neighbour rules are smaller than those for kernel and exponential rules.These findings are consistent with the ones in Table 2, where the nearest neighbour rules are best, followed by the kernel rules and finally the exponential rules.

Discussion
The results that we have obtained through simulation study and numerical example show that the choice of the optimised smoothing parameter depends on both the criterion used for the optimisation and the smoothing method.In respect of criterion, we recommend selecting through minimisation of the leave-one-out Brier score due to its continuous nature and since it does not make any distributional assumptions.The continuous nature makes it more amenable to standard function optimisation routines, and avoids the problem of having many local optima with discrete error rate.Lack of distributional assumptions circumvents potential objections to the use of log-likelihood ratio.
As regards smoothing method, our findings show that the nearest neighbour rule is the best and is followed by the kernel rule.This situation may relate to the lower restriction of the smoothing applied by them.Despite always being the best, however, nearest neighbour needs to be handled carefully because it is sensitive to the occurrence of cells without objects.On the other hand, the kernel method is heavy in computational time relative to its competitors.Alternatively, one may choose the exponential method if using a single smoothing parameter is preferred.It is possible to use this method because its results are not much poorer than the others.As a final comment, the use of smoothing methods in the location model should be based on the problem in hand, rather than being a mere tool of the estimation method.

Figure 1
Figure 1(a) illustrates the variability of for smoothing cell means using the nearest neighbour rule while Figure 1(c) illustrates the variability of for smoothing cell means and cell probabilities using the exponential rule.In both plots, the Y-axis represents the value of and the X-axis represents the object r.Each point in these plots indicates the smoothing parameter value for the allocation rule constructed from the sample on omitting object r.The kernel rule applies two different smoothing parameters, one each for smoothing cell means and cell probabilities.So, we plot the corresponding in the Y-axis and X-axis (Figure 1(b)).Each point in this plot represents the sample on omitting object r.All these plots show a clear scatter of points especially in Figure 1(c).They describe some differences in the amount of smoothing from one sample of size (n 1) to other samples of the same size.This variability occurs because different samples have slightly different distributions when a different object r is omitted.

Figure 1
Figure 1.Example: that minimises the leave-one-out Brier score

Table 1 .
Proportions in categories of binary variables.

Table 2 .
Frequency of being a winner among smoothing methods based on leave-one-out criteria.

Table 3 .
Performance of the rules for non-normal populations.

Table 4 .
Comparison of the performance of the ML rule with those of the smoothed rules.

Table 5 .
Location model assessment for jaundice patients.