Consumer Credit Customers ’ Financial Distress Prediction by Using Two-Group Discriminant Analysis : A Case Study

This study estimates a two-group discriminant function to determine the expected financial health of the consumer credit customers’ of a bank of Bangladesh by using thirteen demographic, socio-economic, and loan characteristics of the sample borrowers. The estimated function is significant at one per cent level of significance and the model estimates financial health/group membership with average seventy-five per cent accuracy. Like developed countries, it is expected that use of the estimated discriminant function in the consumer credit decision making will decrease bad debts, will help to set risk based credit pricing for the clients and will make the credit granting faster and more accurate.


Introduction
The idea of consumer credit is extensive.In general, consumer credit is the term stands for the express loan facilities to the common people that have to repay with interest by equal monthly installment and the credit is not used for any commercial purpose.In the US, in the year 1979, 20-30 per cent of all consumer credit decisions are made based on the discriminant analysis and most of the large institutions in the sectors: banks, finance companies, oil companies, retail merchants, and travel and entertainment cards used the discriminant analysis for their credit granting decision making (Credit Card Redlining, 1979).Unlike the US, in Bangladesh, banks and other financial institutions sanction loan to their client with the help of traditional credit approval method-based on the human assessment and the experience of the previous decisions.In this way, the various aspects of the consumer credit application are manually evaluated and based on that the decision is made about whether to grant credit.It is not possible to generate a concrete score about the new applicants whether to grant or not grant credit by using this conventional process.So, it is substantially better to use discriminant analysis to determine the expected position or a score for the borrower to make the credit grant decision.
In this study, an effort is made to model the consumer credit of a bank of Bangladesh by using socio-economic, demographic, loan characteristics and discriminant analysis for reliable and efficient loan operations and to minimize the consumer credit risk.In other words, a quantitative effort is made to forecast the expected position of the consumer credit applicant via the discriminant analysis.The discriminant analysis is look like the regression analysis in terms of the number of dependent variables (one for both), the number of independent variables (multiple for both) and the nature of independent variables (metric for both).But, the discriminant analysis and the regression analysis are different in terms of the nature of dependent variables.In the regression analysis, the dependent variable is a metric variable whereas in the discriminant analysis, the dependent variable is a categorical/binary variable.Besides, the nature of the dependent variable in the binary logit model and the two-group discriminant analysis is the same.The linear discriminant analysis model involves linear combinations of the equation 1 form: the two groups substantially.This happens when the ratio-between-group sum of squares to within-group sum of squares is at maximum point.For any other combination, the ratio will be smaller.
The figure 1 shows the pictorial presentation of the data collected on the two variables: X1 and X2 for the cases of the two-group G1 and G2.The X1 axis represents X1 variable and the X2 axis represents X2 variable.The discriminant analysis tries to separate the two groups by drawing a line as under.If the data is collected on more than two variables, than it is not possible to draw a scatter diagram as under as we have fixed two axes in a graph.But regardless of the number of variables, the discriminant analysis can generate positive and negative Z scores for the cases of the groups and possible to draw a diagram as a lower part of the figure 1.The lower part represents the group membership by using the estimated discriminant scores (Z) of the groups cases.The shaded proportion represents the misclassification of the group membership.The smaller the shaded proportion, the bigger the estimation accuracy is assumed (Malhotra & Das, 2011;Boyd, Westfall, & Stasch, 2005).The first section of this research report is about introduction to the study which comprises prologue, objectives and methodology of the study.The second section contains literature review and the variables selection for the study.Findings and their analysis are in the third section of the report.Fourth section consists of recommendations for the policy makers and conclusion of the study.

Literature Review
Wiginton (1980) conducted a discriminant analysis to model the consumer credit behavior by using demographic and economic variables.The demographic variables used are: number of dependents, living status, moved during last year, business use of vehicle and pleasure use of vehicle.The economic variables include-industry class of employment, class of occupation and years in present employment.The right prediction power of the model estimated by the researcher is not encouraging and predicting group membership by using logit model provided better forecasting accuracy.It is concluded that years in present employment, living status and occupation type are significantly related to the credit risk rating.Grablowsky (1975) conducted a two-group stepwise discriminant analysis in order to model risk in the consumer credit by using behavioral, financial, and demographic variables.The behavioral data is collected from the two hundred borrowers through a questionnaire of summated ratings scale and the financial and demographic data are collected from the loan application forms of the same two hundred borrowers.The researcher has started the 2 Z / / analysis with thirty six variables and after a comprehensive sensitivity analysis, found that thirteen variables are enough to model the consumer credit risk.Although the both set of data-analysis sample and holdout sample violated the equal variance-covariance assumptions, the estimated model classified the validation sample 94 per cent correctly.Awh & Waters (1974) conducted a study to determine the bank's active and inactive credit card holders by using two types of variables-quantitative (economic and demographic) and attitudinal.The quantitative variables used are: (a) income, (b) age, (c) education, and (d) socio-economic standing.The socio-economic index is based on the respondents' particular position suggested by Reiss (1961).The attitudinal variables used are: (a) use or non-use of other credit cards, (b) attitude toward credit, and (c) attitude toward bank charge-cards.The data for the quantitative and attitudinal variables on the same respondent is collected from the loan application forms and by the questionnaires respectively.The discriminant function estimated by them is significant at 0.01 level and forecasted the group membership with 78 per cent accuracy.
Hand & Henley (1997) reviewed available credit scoring techniques in their article titled-"Statistical Classification Methods in Consumer Credit Scoring: A Review."In addition to the judgmental method, the available quantitative methods are logistic regression, mathematical programming, discriminant analysis, regression, recursive partitioning, expert systems, neural networks, smoothing nonparametric methods, and time varying models.They have concluded that there is no best method.What is the best method depends on the structure and characteristics of the data.For a data set, one method may be better than the other method but for another data set, the other method may be better.In addition, Davis, Edelman & Gammerman (1992) conducted a comparative study of various methods and concluded that all of the methods are performed at the same accuracy level but the neural network algorithms take much longer time to train.
According to Hand & Henley (1997) Dinh & Kleimeier (2007) conducted a study for the Vietnam's retail banking market by using logistic regression analysis method.The variables they have used are age, education, occupation, total time in employment, time in current job, residential status, number of dependents, applicants annual income, family income, short-term performance history with the bank, long-term performance history with the bank, total outstanding loan amount, other services used, cash in hand and at bank etc.They have argued that by using quantitative credit scoring, the default rate can be minimized from 3.3 per cent to 2.0 per cent.They also argued that by quantifying the credit risk, it is possible to set up risk-based pricing in the retail banking market.Consequently, the bank can become more efficient and competitive in the market.The most important predictors they found are time with bank, followed by gender, number of loans, and loan duration.
Based on the above literature review, experience of the researcher and availability of the data, thirteen demographic and socio-economic variables are selected for this study.The variables are the loan amount, number of dependents, years of experiences at present job, salary per month, living status, savings per month, cash in hand and at bank, Net worth, ACT, N-EMI, EMI, interest rate (%), and Guar.The data is collected on the variables from the application forms of the consumer credit customers by filling up the pre-designed questionnaire.

Data
Both primary and secondary data are used in this study.The primary data is collected by a pre-determined questionnaire from the loan application forms of a private bank of Bangladesh and the secondary data is collected from the published journal articles, books, www, and SPSS manual.The primary data is collected on 15 default cases and 15 regular cases.A set of data is formed called-analysis sample by combining 10 regular and 10 default cases and a set of data is formed called-holdout sample or validation sample by combining the remaining 5 regular cases and 5 default cases.The analysis sample is used to estimate the discriminant function and the holdout sample is used to check the validity of the model.If possible, it is wise to collect the data for a large sample size and to split the sample into two parts-analysis sample and holdout sample and to use the analysis sample to estimate the function and to use the holdout sample to check the validity of the model.After that, reverse the role of the data sets, to estimate the function by using the holdout sample and to use the analysis sample to check the validity of the model.This process is known as double cross-validation.

Data Analysis Technique, Software Used and Cautions
To analyze the collected data and to answer the research questions, the direct method discriminant analysis is used as data analysis technique for this study.According to the direct method of discriminant analysis, all of the variables are included in the study simultaneously without considering the discriminant power of the variables.This method is used when based on the previous research or a theoretical model, researcher wants that discrimination should be based on the all variables.The alternative of this approach is stepwise discriminant analysis.According to this approach, variables are included in the model according to their discriminating power.The softwares used in this study to analyze the data are SPSS, and MS-Excel.Like regression analysis, the sample size should be large enough to estimate a discriminant function.Inadequate sample size may produce wrong discriminant function.A substantially larger sample size than that is used in this study is expected for a true discriminant function to use in real life decision making.The multicollinearity problem is handled professionally.The author was very careful about selecting independent variable and ensured that any unnecessary independent variable is not included in the study.The quality of the dependent variable is ensured in this study.Sometimes, the quality of dependent variable may be poor.For instance, if the dependent variable is successful and unsuccessful salesman and the target to be successful salesman was set unrealistically high, the quality would be poor.

Description of the Variables
The variables used in this study are divided into two types: dependent variable and independent variables.The only dependent variable is status of the borrower that is a categorical variable.Based on the historical data, if a borrower's position is default then s/he is denoted by 1 and if the borrower's position is regular then s/he is denoted by 2. There are two types of the independent/predictor variables used in this study.Some variables are related with the loan and the others are related with the demographic and socio-economic conditions of the borrower.The independent variables related with the loan are as follows.Loan:

Group Means
Group means and standard deviations are calculated for each variable of the default and the regular groups.By examining the difference between the group means and the standard deviations, it is possible to see whether the variables can differentiate between default customers and regular customers.The groups statistics of the two-group can be used as characteristics profile of the two-group.The table 1 shows that group means are different for the groups for the variables-loan amount, dependents, monthly salary, savings, cash, net-worth, EMI and interest rate.So, these variables can differentiate the group membership successfully.Other variables: Y-P-J, living, ACT, N-EMI, and Guar.look similar in terms of magnitude-means that those variables do not play significant role in the case of determining group membership.The pooled within group correlations matrix is not reported here because of space problem shows very low correlations between variables-which indicates that there is no multicollinearity problems in the data.In the

Tests of Equality of Group Means
In order to test the equality of the group means, the Wilks' lambdas and the F rations are estimated and reported as under.The Wilks' lambda (λ) for each predictor is the ratio of the within-group sum of squares to the total sum of squares.Its value varies between 0 and 1.The large value of λ indicates that group means are not different.On the other hand, small value of λ indicates that the group means are different.Sometimes, Wilks' λ is known as U statistics.The table 2 shows that the values of Wilks' λ are equal to 1 for the variables: Y-P-J, living and ACT.
Consequently, these variables are insignificant in the case of determining group membership.In general, Wilks' λ is acceptable when its value is less or equal to 0.95.So, if we eliminate the variables having Wilks' λ greater or equal 0.95, our result of analysis should not be changed.The tests also shows that some predictors-interest rate, savings, EMT, net-worth, loan and dependents have significant role to distinguish default and regular borrowers.F values are calculated from a one-way ANOVA where the group variable serve as the categorical independent variable and each predictor variable serve as the metric dependent variable.The lower significant ratio for the corresponding F ratio means-the variable is very significant in the case of determining group membership.
Conversely, the very high significant ratio for the corresponding F ratio means-the variable is very insignificant in the case of predicting group membership.covariance matrices.In addition, the pooled within-groups is a matrix composed of by taking the average of each corresponding value within the two 13X13 covariance matrices of the two levels of the groups.The Box's M is a measure of the multivariate normality of the data which is based on the similarity of the log determinant of the two groups' covariance matrices.A transformed value of the Box's M is F ratio which tests the equality of the log determinants of the two covariance matrices.The F is conceptually equal to the F ratio in ANOVA which is the ratio of between group variability to within group variability.A significance value of .000indicates that the data differ significantly from multivariate normal.However, a value less than 0.05 do not automatically disqualify the estimation of the discriminant analysis.Although the assumption is violated, the estimation is worthwhile which is validated in assessing the validation of the model section.This is surprising true for many cases.However, since the significance ratio is very low, it is justifiable to check the uni-variate normality of the variables.

Determine the Significance of the Discriminant Function
Function 1, in the table 4, means that one discriminant function is estimated as we have two groups in the dependent variable.The eigen value means a ratio of between group sum of squares to within group sum of squares.The higher the value, the better estimation of the function and the minimum acceptable eigen value is more than one.The eigen value of the estimated function is 21.8 that counts for 100 per cent variance explained.The cumulative percent is also the same-100 per cent.The canonical correlation measures the association between the discriminant scores and the groups.The canonical correlation associated with the estimated function is 0.978.The coefficient of determination is equal to the square of the correlation coefficient that is (0.978) 2 = 0.9565 which means that 95.65 per cent of the variance in the dependent variable is explained by the estimated discriminant function.The Wilks' λ associated with the estimated function is 0.044 which is used to check the significance of the estimated function.The transformed χ is 35.92 with 13 degrees of freedom.The p-value (Sig.)associated with chi-square function is 0.00 which means that the null hypothesis is rejected at 1 per cent level of significance.So, estimating and interpreting the discriminant function are significant.Note: a First 1 canonical discriminant function is used in the analysis.

Structure Matrix
The structure correlations are also referred as discriminant loadings.The structure correlations represent the simple correlations between the predictors and the discriminant function.These correlations are used to determine the relative importance of the variables in predicting the group membership.The variables are ordered by absolute size of the correlations between the discriminating variables and the un-standardized canonical discriminant function in the table 5.The table 5 shows, the positions of the variables in determining the group membership according to the most important variable to the least important variable.According to the table, the most important variables those can determine the group membership are interest rate followed by savings, EMI, net-worth, loan, dependents, and cash.The least important variables are ACT followed by living, Y-P-J, N-EMI, and Gur.The variable values of a new loan applicant will have to be substituted in the above equation 2 from the loan application form.If the estimated Z score of a loan applicant is positive, then the expected position of the applicant is default as the centroid is positive for the default group and the application should be rejected.The larger the distance between positive Z and 0, the default risk of the borrower is higher.Consequently, the management should look for higher risk premium.And if the estimated Z score of the credit applicant is negative, then the expected position is regular as the centroid is negative for the regular group and hence the loan should be allowed to the borrower.The larger the distance between negative Z and 0, the default risk of the borrower is lower.Consequently, the management should look for lower risk premium.Thus, management can use Z scores to set risk-based interest rate.

Group Centroids
The group centroids are the averages of the Z values calculated by the estimated model and reported in the last column of the table 8 for the default and regular groups.In other form, if the average values of the variables are substituted in the estimated discriminant function, the function generates the centroids.There are as many centroids as there are groups.There are two centroids in a two-group discriminant analysis-one for each group.
In this study, the centroid of the default group is 4.422 and the centroid of the regular group is -4.422.The group centroids are used to evaluate the expected position of the consumer credit customers.Now, if a consumer credit customer applies for a loan his raw/un-standard values for the variables will be substituted in the estimate discriminant function, the function will generate a positive or a negative value.The bigger the value the better forecasting is made.If the estimated Z value of a case is positive then the expected status of the case is default because the centroid value is positive for the default group and if the estimated value of a case is negative then the expected position of the case is regular as the centroid value is negative for the regular group case.The centroids are reported in the table 7:

Casewise Statistics
The table 8 provides an excellent summery of the analysis.In the casewise statistics, the actual group means the actual position of the consumer credit customer on which the data is collected and the predicted group means the predicted position of the actual group member by the estimated discriminant model.The highest group means the highest possibility of being in a group according to the estimated discriminant model.The second highest is the alternative of the highest group as our analysis is the two group discriminant analysis.The last column is the estimated z values of the analysis sample cases.Cross validation is done only for those cases in the analysis.In cross validation, each case is classified by the functions derived from all cases other than that case.a.

Histogram of Z Values of Status-1 (Default) & Status-2 (Regular)
The Z values estimated for the analysis samples in the last column of the above table 8 are presented in the bar diagrams-figure 2. The first bar diagram is prepared for the default group.The bar diagram and the above table 8 show that the minimum Z value is 2.74, the maximum Z value is 5.69, the average value is 4.42 and the standard deviation is 0.952.The estimated Z values are substantially higher than 0, indicates that the model forecasted the group membership of the samples of the default group in the analysis sample very accurately.The bar diagram in the right hand side shows the Z values of regular group.The bar diagram and the above table 8 show that minimum value is -5.62, the maximum value is -2.37, the average is -4.42 and the standard deviation is 1.05.The Z values are substantially negative which indicate that the accuracy of the model for the regular group is very high.The classification matrix is also known as confusion or prediction matrix and the matrix is used to check the validity of the model.The primal diagonal shows the correctly predicted cases and the off-diagonal shows the wrongly predicted group membership.The total of the primal diagonal element divided by the total number of cases used in the study is the correctly predicting rate-which is also known as hit ratio.
The classification matrix of the original sample (table 9) shows that 100 per cent of the cases are predicted by the model correctly.Since at the time of estimating classification matrix of the original cases, the sample for which the prediction is made included in the sample, the classification matrix may be biased.So, cross-validated classification matrix is made based on the activity that the case for which the prediction is being made will be kept out of the analysis sample and the model is estimated.After that, the model is used to predict the membership of the case which was out of the sample at the time of the estimation of the function.The process is continued as many times as many cases in the analysis sample.Finally, the classification matrix is made.The lower part of the table 9 shows that 85 per cent of the cross-validated grouped cases are classified correctly.The cross validated hit ratio should be considered first compare to original hit ratio in order to assess the validity of the model.

Casewise Statistics of the Holdout Sample
By putting the values of the hold out sample in the estimated discriminant function, the table 11 of casewise Z values is constructed.Here, we see, in the holdout category, 5 default customers out of 5 are classified correctly and 3 regular customers out of 5 are incorrectly forecasted.In total, 7 out of 10 are classified correctly and 3 out of 10 are incorrectly predicted.To sum up, 70 per cent of the cases are classified correctly.

Classification Matrix Using Total Sample as Analysis Sample
In this section, the analysis sample and the holdout sample is used as analysis sample again and the confusion matrix is constructed as under (table 13).It reveals that around 87 per cent of the original grouped cases and around 77 per cent of the cross-validated grouped cases are classified correctly.It is also wise to compare the hit ratio estimated based on the discriminant analysis and the hit ratio if the decision would be made by chance-randomly.If the groups are equal in size, then the hit ratio is 1/number of groups.In this study, there are two groups, so, if the decision is made randomly, the hit ratio is 50 per cent.There is no specific rules/guide line when the discriminant analysis should be conducted.However, some researchers argued that the hit ratio of the discriminant analysis should be higher at least by 25 per of the hit ratio that obtained by chance (Joseph, William, Barry, & Ralph, 2010;Glen, 2001).In addition, Boyd et al. (2005) mentioned that more than 70 percent accuracy is justified to conduct discriminant analysis.For this study, the average hit ratio is more than 75 per cent and hence, the validity is satisfactorily justified.

Conclusion
This study estimates a two-group discriminant analysis in order to determine the expected status of the consumer credit customers of a bank in Bangladesh.The estimated function is significant at 1 per cent level of significance and could forecast financial health with average 75 per cent accuracy.Thus, the study proposed that the demographic, socio-economic and loan related variables can be used to determine the expected group membership of the borrowers in Bangladesh.Discriminant function estimated for an institution or bank cannot be used for other bank or institution, because the discriminant function coefficients will vary based on a bank/institution's data set.Hence banks/institutions should use own data base to estimate it's own discriminant function to use.By using the estimated function, the consumer credit disbursement decision can be faster, more

Figure 2 .
Figure 2. Histogram of Z values of status-1(default) & status-2 (regular) 4.6 Assessing the Validity of the Model 4.6.1 Classification Matrix of the Analysis Sample Cross validation is done only for those cases in the analysis.In cross validation, each case is classified by the functions derived from all cases other than that case.b 100.0% of original grouped cases correctly classified.c 90.0% of cross-validated grouped cases correctly classified.
Designation of the present job of the borrower.Although data is collected on the designation, the variable is not included in the study because of extreme diversity in the designation.
The loan variable indicates the amount of loan borrowed by the borrower.N-EMI:The number of equal monthly installment.EMI: The amount of equal monthly installment paid by the borrower per month.Interest: The interest rate determined by the bank for the loan.andGua.: The Gua. represents personal guarantor of the borrower.If the borrower provided personal guarantor then it is denoted by 1; otherwise denoted by 0.The variables related with the demographic and socio-economic conditions of the borrower are as follows.Dependents: Dependents mean the number of persons who are dependent on the borrower.Y-P-J: Y-P-J stands for years of experience in the present job.Salary: Salary variable denotes the salary drawn by the borrower per month.Living: Living means status of living where the borrower resides.It may be rental or own.If own then it is denoted by 1 and if rental it is denoted by 0. Savings: Savings represent amount of money saved per month.Cash: Cash denotes amount of money present in hand & at bank of the borrower.Net worth: Net worth means personal net worth of the borrower.Net worth is calculated by subtracting the total liabilities from the total assets.ACT: Total number of bank accounts belonging to the borrower in other banks.and Designation:

Table 1 .
Group statistics table-1 and paragraph-4.2,we have statistically tested whether the group means are different or same.

Table 2 .
Tests of equality of group means Test of Equality of Covariance Matrices by Using Box's M:To estimate a valid discriminant function, an important assumption is that each of the groups is a sample from a multivariate normal population and the two groups have equal co-variance matrices although the two groups have different mean values.The Rank, in the table 3, means the size of the covariance matrices.The 13 means that this is a 13X13 matrix, the number of variables in the Discriminant function.The log determinants mean the natural log of the determinant of the

Table 3 .
Test of equality of covariance matrices by using box's M Note: Tests null hypothesis of equal population covariance matrices.

Table 4 .
Determine the significance of the discriminant function

Table 5 .
Structure matrixEstimating the discriminant function coefficients is our main concern of this study.The discriminant function coefficients (unstandardized) are the multipliers of the variables, when the variables are in the original units of measurement.By using the estimated discriminant function coefficients, the required discriminant function, often called-"the discriminator" is as equation 2:

Table 7 .
Functions at Group Centroids

Table 9 .
Classification results (b,c) Cross validation is done only for those cases in the analysis.In cross validation, each case is classified by the functions derived from all cases other than that case.b 100.0% of original grouped cases correctly classified.c85.0% of cross-validated grouped cases correctly classified.4.6.2ClassificationMatrix of the Holdout SampleThe holdout sample is also used to check the validity of the model.After putting the values of the holdout sample on the estimated discriminant function, the Z values are computed for the cases.By using the Z values and centroids, group membership is predicted.The table 10 shows that 70 percent of cases are correctly classified.

Table 11 .
Casewise statistics-holdout sample 4.6.4ClassificationMatrix Using Holdout Sample as Analysis SampleWhen the holdout sample is used as the analysis sample, the prediction matrix, table 12, is found.The matrix shows that 100 per cent of the original grouped cases and 90 per cent of the cross-validated grouped cases are classified correctly.

Table 12 .
Classification results (b,c)-holdout sample as analysis sample

Table 13 .
Classification results (b,c) -total sample as analysis sample Cross validation is done only for those cases in the analysis.In cross validation, each case is classified by the functions derived from all cases other than that case.b 86.7% of original grouped cases correctly classified.c 76.7% of cross-validated grouped cases correctly classified.