Borrower Level Models for Stress Testing Corporate Probability of Default and the Quantification of Model Risk

This paper addresses the building of obligor level hazard rate corporate probability-of-default (“PD”) models for stress testing, departing from the predominant practice in wholesale credit modeling of constructing segment level models for this purpose. We build models based upon varied of financial, credit rating, equity market and macroeconomic factors with an extensive history of large corporate firms sourced from Moody’s. We develop distance-to-default (“DTD”) risk factors and design hybrid structural/Merton-reduced form models as challengers to versions of the models containing only the other variables. We measure the model risk attributed to various modeling assumptions according to the principle of relative entropy and observe that the omitted variable bias with respect to the DTD risk factor, neglect of interaction effects and incorrect link function specification has the greatest, intermediate and least impacts, respectively. Our conclusion is that validation methods chosen in the stress testing context should be capable of testing model assumptions, given the sensitive regulatory uses of these models and concerns raised in the industry about the effect of model misspecification on capital and reserves. Our research is accretive to the literature by offering state of the art techniques as viable options in the arsenal of model validators, developers and supervisors seeking to manage model risk.


Introduction and Summary
The importance of stress testing in assessing the credit risk of bank loan portfolios has grown over time. Currently these exercises are accepted as the primary means of supporting capital planning, business strategy and portfolio management decision making (U. K. Financial Services Authority, 2008). Such analysis gives us insight into the likely magnitude of losses in an extreme but plausible economic environment conditional on varied drivers of loss. It follows that such activity enables the computation of unexpected losses that can inform regulatory or economic capital according to Basel III guidance (Basel Committee for Banking Supervision, 2011).
The standard manner in which credit models have been adapted for stress testing is through modification of probability of default ("PD") models at the disposal of financial market participants. PD models are meant accurately measure an obligor's ability and willingness to meet future debt obligations over some horizon and are typically associated with a credit score or rating. The majority of PD risk rating methodologies or models currently used in the industry are characterized by a dichotomy of outcomes deemed point-in-time ("PIT") vs. through-the-cycle ("TTC"). In the so-called PIT rating philosophy such PD models should incorporate a complete set of borrower specific and macroeconomic risk factors that will measure default risk at any point in the economic cycle. In contrast, according to the TTC rating philosophy the model should abstract from the state of the economy or cyclical effects and measure default risk over a more extended time horizon that incorporates a variety of macroeconomic states. This TTC orientation implies that ratings derived from the model should show the feature of "stability" wherein material changes in ratings can be ascribed to fundamental as opposed to transient factors. While PIT PD models are typically deployed in loan pricing and early warning systems, TTC ijef.ccsenet.org International Journal of Economics and Finance Vol. 14, No.4;2022 76 PD model feature prominently in credit underwriting and portfolio management applications.
More recently in this domain, in satisfaction of the Current Expected Credit Loss ("CECL"; Financial Accounting Standards Board, 2016 (Note 1) accounting standards or compliance with the Federal Reserve's Dodd-Frank Stress Testing ("DFAST"; Board of Governors of the Federal Reserve System -"FRB", 2016 (Note 2)) program, we observe that the predominant types of models used in the industry differ slightly from the aforementioned applications. This is especially the situation prevailing for wholesale credit asset classes (e.g., C&I Large Corporate or Middle Market), where instead of using PIT PD models directly for stress testing, the practice is to add sensitivity to macroeconomic variables to TTC PD models. Such TTC PD models are commonly found in the revised Basel framework or for use in credit underwriting, as previously mentioned. A predominant manner in which TTC PD models are used in stress testing is through a rating transition model construct (Cihak et al., 2020), where the level of modeling is at the rating level, where credit ratings are aggregated for different modeling segments across a bank's portfolio.
This research is distinguished from the previous literature by utilization of an obligor level and dynamic modeling framework that considers financial, credit rating and macroeconomic variables that are time varying (Note 3). We estimate these models over a history that contains several economic cycles and apply them to a CECL stress testing exercise. We implement this exercise through construction of discrete time survival models of default, a class of dynamic PD models, utilizing a dataset of corporate ratings and defaults sourced from Moody's. This methodology has the benefit of accommodating the discrete character of our data (being quarterly snapshots) while also featuring a relatively less complex estimation algorithm as compared with other models in this class. Use of such discrete time survival models have featured in the prediction of corporate defaults, albeit not featuring macroeconomic risk factors or applied to stress testing (Shumway, 2001;Cheng et al., 2010).
In this study we employ a large corporate ($1 Billion in sales or greater and U.S. domiciled) agency rated obligor level historical dataset provided by Moody's that covers the period 1990-2015. We build a modeling dataset featuring an exhaustive array of covariates spanning the categories of equity market (i.e., a structural Merton model style distance-to-default ("DTD") measure), financial, duration (i.e., time from observation point to horizon), credit quality (i.e., TTC agency PD ratings) and macroeconomic. The target variable is a 1-quarter default indicator, and we construct discrete time hazard rate models of PD, featuring a modeling data design that allows for a computationally efficient estimation technique. Following the extant literature, the DTD risk factor is derived from equity prices and accounting measures of leverage, from which we design as challengers hybrid structural-reduced form models, which are compared to the versions of these models containing only the other variables. It is demonstrated that the challenger models result in improved model performance (i.e., measures of discriminatory power and predictive level PD accuracy), comparable quality of CECL scenario forecasts and that introduction of the structural DTD risk factor does not result in the other variables rendered statistically insignificant.
In the second major part of this study we quantify the degree of model risk due to several forms of model misspecification or violations of model assumptions utilizing the principle of relative entropy. This methodology studies the distance of an alternative to a reference model according to some suitable loss metric (Hansen & Sargent, 2007;or Glasserman & Xu, 2013) and can capture dimensions of model uncertainty error beyond parameter estimation error. We observe that omitted variable bias due to leaving out DTD is maximally impactful, that incorrect specification of the logistic regression link function is of minimal importance, and that the model risk attributable to neglected interaction terms amongst the explanatory variables is of intermediate influence.
The remainder of this study will proceed as follows. The subsequent Section 1 constitutes a review of the literature relevant to the subject matter of this paper, studies on hazard rate modeling to predict binary outcomes, either the probability or occurrence else or the duration to the time of event, and we specify the discrete time version within this class of models that we employ in the PD context. Section 3 is an outline of the methodology for our modeling exercise, where we discuss the general framework and then different sub-classes of this, leading to the particular technique employed in this research. In Section 4 we present the empirical results of this study including descriptive statistics of the modeling dataset, estimation results and model performance metrics and the exercise in which we quantify model risk. Finally, in Section 5 we summarize our study and discussion future directions of extending this line of inquiry.

Review of the Literature
Rating or scorecard models historically have focused on estimating the PD as opposed to the severity of losses in the event of default or loss-given-default ("LGD"). Default is usually defined as a "failure" such as bankruptcy, ijef.ccsenet.org International Journal of Economics and Finance Vol. 14, No.4;2022 77 liquidation, failure to pay, deemed unlikely to pay, etc. This construct does not consider downgrades or upgrades in credit ratings as considered in mark-to-market ("MTM") models of credit risk. These default mode ("DM") credit risk models project credit losses only due to events of default, as opposed to MTM models that consider as credit events all credit quality changes. Amongst such DM models we can identify three broad categories: expert-based systems (e.g., artificial neural networks), risk rating methodologies (e.g., agency credit ratings from S&P or Moody's) and credit scoring models (e.g., scorecards developed by banks or FICO scores).
The PD scoring model is most prevalent amongst credit measurement methodologies used historically. One of the first models in this class of this multiple discriminant analysis ("MDA") as illustrated by the classic paper by Altman (1968). These types of models have the advantages of being cost efficient to deploy and are not subject to subjectivity or inconsistency as observed in expert systems. Altman and Narayanan (1997) documented how these models became prevalent across the industry and academia and conclude that the similarities across applications are more pronounced than the differences. Another class of credit scoring models that is now most widespread is the logistic regression model (Hosmer et al., 1997;"LRM"), a prime example of which being the RiskCalc TM model of the vendor Moody's Analytics (Dwyer et al., 2004) used for commercial credit risk and considered an industry standard.
In the seminal paper of the structural approach to credit risk Merton (1974) models equity in a firm having leverage as a call option on assets where the strike price coincides with the face value of the debt. The PD is derived by solving for the option value numerically with the unobserved asset value and its volatility given the quantity of debt and a valuation horizon. The product of this process is a firm's distance-to-default ("DTD") that represents the number of standard deviations of separating asset and debt repayment values, which is inversely related to the PD. The CreditEdge TM public firm model developed by the vendor Moody's Analytics is a well-established implementation of this framework that also uses historical default rates to empirically calibrate the output and produces the expected default frequency ("EDF"). Since the EDFs are ultimately based upon equity prices a consequence is a heightened sensitivity to the changing financial state of the obligor, in contrast to agency credit ratings that are more reliant on static data available at the time of underwriting or periodic reassessment of the borrower.
Currently used credit risk modelling frameworks arise from the Merton-structural approach just discussed and an alternative reduced form framework originated Jarrow and Turnbull (1995) and Duffie and Singleton (1999) which utilizes intensity-based models to estimate stochastic hazard rates. This school of thought differs in the methodology employed in estimating PDs. While the structural-Merton approach considers an economic process that produces defaults, the reduced form approach extracts a random intensity process that generates defaults from the prices of defaultable debt. A prominent example of a model in this class is the proprietary Kamakura Risk Manager ("KRM"), which incorporates an econometric methodology based upon Chava and Jarrow (2004). This so-called Jarrow-Chava model ("JCM") is sometimes called a hybrid approach, in that it combines the direct modeling of default as in LRM with the use of either equity or debt market data, and in the case of traded debt instruments this construct has the potential to control for the distorting effects of illiquidity on the measurement of default risk. Note that a critique of the JCM is that the presence of anomalies such as embedded options in the debt markets can adversely impact the accuracy of these models. In this study, we circumvent this limitation by combining the use of fundamental factors as in the LRM with equity market information as in the Merton-structural model in our version of a hybrid hazard rate model, as will be detailed in the model methodology section of this paper.
Stress testing based upon hypothetical scenarios, usually a blend of macroeconomic projections and the application of judgmental elements, has become a prevalent tool in the supervision of financial institutions (FSA, 2008). The qualitative aspect of stress testing process is considered by some as a deficiency, as illustrated by the critique of the exercise conducted by the U.S. supervisors during the financial crisis of the late 2000s, where the projection of unemployment in the 2009 adverse scenario fell short of the realization of this factor in under a year and therefore was deemed to be insufficiently severe (FRB, 2009). In the analysis of this incident Baker (2009) claims that the supervisors may have under-predicted loan losses on the order of $120 billion, placing him among those who conclude that this surprise on the part of the regulators is evidence that these tests are failures. In the retail credit risk context, Haldane (2009) attributes this weakness to either the omission of, or the underprediction of the impact of, risk factors in the dynamic macroeconomic model that was used (so called disaster myopia). The author also points to the phenomenon termed misaligned incentives that means institutions had no intention of designing realistic stress tests. Jacobs (2019) shows that in addition to these downward biases, the prevalent econometric methodologies used by many supervisors and banks may be subject to heightened inaccuracies, which are attributed to a mispecified dependence structure between risk factors.

Model Methodology and Conceptual Framework
Let us denote t as calendar time and ii ta   , i a as a time of origination (i.e., the date where the snapshot is measured) and a time of duration i  (i.e., a time from origination to the measurement of snapshot) for an obligor i , 1,.., iN  obligors. We may record time at various granularities, in the case of C&I borrowers either quarterly or annually, and the while spacing may be irregular due to the exact timing of when financial statements are spread, in general these will be in multiples of a quarter so that in reality we are dealing with discrete sampling.
i differs amongst borrowers but not temporally (e.g., like a segmentation characteristic of a borrower such a scorecard or industry group), t differs through calendar time but is common amongst obligors (e.g., as with a macroeconomic variable) and a term subscripted by t i may be permitted to vary both temporally and cross-sectionally (e.g., as with a financial ratio). The corresponding risk factors or covariates are given by i w , t z , and it x ; while the respective parameter vectors are given by 1 , 2 and 3 . The terms 12 γ , 13 γ and 23 γ denote matrices of interaction term parameters between these three sets of parameters to be estimated.
The following describes a rather general and stylized econometric model of PD with respect to obligor i at a discrete time period t , which we will later impose restrictions upon this framework to arrive at a representation of the credit risk models used in practice. Let us denote as * as a continuous latent variable representing the "utility" gained by the default of borrower i in period t . We define the event of default as * = 1 if * > 0 and of non-default as = 0 if * ≤ 0. Suppose that this latent variable is a function linear in the risk factors and their interaction terms plus a residual term and borrower specific intercept term 0 : and then define the conditional probability of default as where the distribution function of is given by   F  . The variables subscripted by time may include lags of variable lengths.
Through the imposition of restrictions or assumptions governing various modeling aspects it may be demonstrated how this construct subsumes many PD models currently employed by practitioners. A canonical case is obtained through restriction of all the interaction terms to be zero, which gives rise to the typical PIT PD model used in early warning or credit portfolio management: where in this setting usually t a horizon spanning from 1 month up to 1 year and the function   F  is typically the logistic link function of the LRM Hosmer et al., 1997). It is common amongst practitioners to employ a linear transformation to the left-hand-side term (4), deriving a quantity interpreted as a "score", where an alternative means of deriving this is through scaling the logit estimate.
An important observation is that the model in equations (3) and (4) is suitable for an unbalanced panel dataset, which exactly the format of data typical of the C&I asset classes. What this data design means is that there will be defaulted obligors over the prediction horizon where subsequent performance is unobservable or otherwise recorded at alternative non-contiguous calendar times in the modeling dataset. Among the various means of model estimation in this setting is the survival analysis (Kalbfeisch & Prentice, 2002;Cox & Oakes, 1984) framework prevalent in retail credit risk modelling, where the approximation of discrete time as continuous is more likely to be realistic, as opposed to the wholesale setting where this assumption is more likely to be tenuous.
We first review the case of the continuous time survival model where the target quantity is the instantaneous probability of transitioning from one state (e.g., an obligor with performing loans) to another (e.g., default). Denoting as the duration time to obligor default, this implies that the conditional PD in the next instant given that the obligor is currently not in default may be represented by the hazard function having the duration time : ijef.ccsenet.org International Journal of Economics and Finance Vol. 14, No.4; 2022 It follows that the survival probability (i.e., the probability of not being in the default over some time interval) may be expressed as a function can be written as the following integral representation of the hazard function: A popular approach in both the academic literature and consumer credit practice is the Cox proportional hazard model (Cox, 1972; "CPHM") to estimate the hazard function as well as the related survival probabilities. The CPHM model further admits inclusion of dynamic covariates and can be expressed as where the risk factors   i  x are obligor specific and dynamic across the time of duration (e.g., financial are risk factors varying over absolute time but constant over the cross-section (e.g., macroeconomic factors) and   0 is a baseline hazard function of time which models the evolution of default risk independently of the other risk factors (i.e., the intuition being that this is a time-dependent residual of sorts). In this construct forecasts of obligor financial or macroeconomic conditions after the beginning of a forecast period propagate through all subsequent time periods and influence the hazard function and survival probabilities through the entire forecast horizon.
There are various ways in which standard LRMs are inferior to survival models. First, survival models admit PD prediction over arbitrary forecast horizons apart from the window of the default flag used in developing the LRM, such as a 1-year horizon of most PIT PD models, and as such are inherently suitable for applications such as CECL or CCAR loss forecasting. In addition, unlike in LRMs the PD predictions in survival models are conditional on not being previously having been in default, which makes the former unsuitable for exercises such as loss forecasting. Finally, as survival probabilities are available across the entire projection horizon then there is the potential for an application to profitability forecasting and or as as a challenger to an economic capital construct.
In Equation (3) we represent a discrete time panel model of binary choice. We previously alluded to the fact that it is most common for financial institutions to use panel data sample designs that could be used to model dynamic risk factors. Essentially, this involves estimating LRMs for a set of fixed horizons but with a different model for each forecast window, for example for quarters 1 through 3*4 = 12 for CECL/CCAR applications. In fact, this is the approach taken by the vendor Kamakura in their corporate PD model, with the obvious disadvantage that this is extremely resource intensive and requires a specialized software infrastructure. Another issue with this approach is that degradation of performance as the default horizon lengthens, which may be undesirable in an application like CECL where there is a premium on accuracy in the sense of accurately predicting the level of default rates, apart from discriminating defaults from non-defaults.
An alternative to estimating an LRM for each forecast horizon as just described, which gives identical results seen in the academic literature and in retail credit risk modeling in some cases, that can be implemented in some standard software packages (e.g., Python scikit-survival), is as follows. As we have a discrete time panel data sample design, then given a modeling dataset design matrix of appropriate form, a discrete survival model may be estimated with a hazard function of the following form: where we denote by ℎ the discrete hazard rate for the ℎ obligor and by S(•) the associated survival probability. Cox (1972) has proposed the following specification of this relationship: where the logit function (or log-odds ratio function) is the inverse of the logistic link function   1 log 1 and the discrete baseline hazard function is given by Jenkins (1995) proposes estimating the model (9) Singer and Willett (1993) suggests using functions of the duration time index to capture this effect as a legitimate approach. In either case, we define a default indicator that is zero in all intervals where default is not observed and unity during the period where a default occurs, and subsequent to default the obligor does not appear in the dataset so long as the default is not cured, in which case the obligor reappears in the dataset as a performing entity and the indicator is rest to zero.
The specifications discussed in this section prior to the models in (8) and (9) have all assumed that time is continuous, but as we have pointed out this is not the case in reality, and moreover in the C&I asset class with quarterly observations this is likely to not be a realistic setup. Stepanova and Thomas (2002) find in spite of this argument that estimation results are not far off between assuming continuous or discrete time. This may not be surprising, as Kalfeisch and Prentice (2002) show that as the observation intervals tend to zero the discrete and continuous time models do indeed converge. Nevertheless, we believe that the observation that the continuous time model is an adequate approximation could be highly dependent on the particular dataset, and in our case we do not find the differences to be immaterial in terms of either the coefficient estimates or measures of model performance, which leads us to favor employing a discrete time model. Jenkins (1995) proposes estimating the model (9) through specifying indicator variables corresponding to each time interval. Alternatively, Singer and Willett (1993) suggests using functions of the duration time index to capture this effect as a legitimate approach. In either case, we define a default indicator that is zero in all intervals where default is not observed and unity during the period where a default occurs, and subsequent to default the obligor does not appear in the dataset so long as the default is not cured, in which case the obligor reappears in the dataset as a performing entity and the indicator is rest to zero.
The specifications discussed in this section prior to the models in (8) and (9) have all assumed that time is continuous, but as we have pointed out this is not the case in reality, and moreover in the C&I asset class with quarterly observations this is likely to not be a realistic setup. Stepanova and Thomas (2002) find in spite of this argument that estimation results are not far off between assuming continuous or discrete time. This may not be surprising, as Kalfeisch and Prentice (2002) show that as the observation intervals tend to zero the discrete and continuous time models do indeed converge. Nevertheless, we believe that the observation that the continuous time model is an adequate approximation could be highly dependent on the particular dataset, and in our case we do not find the differences to be immaterial in terms of either the coefficient estimates or measures of model performance, which leads us to favor employing a discrete time model.
Finally, we come to the approach is used to estimate the hazard models in this research, which is based upon a paper by Houwelingen and Putter (2008) in the biostatistics literature. The authors model survival probabilities for acute lymphocytic leukemia patients post transplantation of bone marrow at a 5-yesr horizon. This research proposes a landmark methodology and compares it to an established multi-state modeling methodology in biostatistics. The authors show that this technique greatly simplifies the modeling methodology as it reduces to LRM estimation on the so-called snapshoted dataset and leads to easy to interpret prediction rules. In the snapshot data sample design, for variables that have a different frequency (e.g., quarterly macroeconomic variables but annual or semi-annual financial variables), for each instance of the former we create a time series that evolves where the latter is frozen.
While at each landmark (or snapshot, our terminology) point a simple Cox constant baseline hazard model is fit on the interval, which is mathematically equivalent to estimating an LRM model on the restructured dataset. This is computationally convenient, in line with the methodologies used in the industry for PD scorecard development, does not assume continuous time, is less computationally intensive than the panel logistic approach of Kamakura and allows for simple implementation within standard software such as Python. Our approach can be expressed mathematically as a modification of the LRM Equation 4: where the snapshoted dataset is given by However, a downside of this approach is that the model fit obtained available in standard software cannot be used to test the statistical significance of the parameter estimates. The correct standard errors can be obtained by taking into account the clustering of the data (i.e., each snapshot is effectively a separate case, several such exist for each obligor, and there is correlation overtime in the former and cross-sectionally in the latter) using the so-called sandwich estimators of Lin and Wei (1989). This approach is incorporated in software packages like SAS (the GENMOD or SURVEYSELECT procedures) or Python (the generalized estimation equations in the logistic regression function of the Statsmodel library). However, such approaches are very computationally intensive, and exhibit stability issues in the case where we have highly unbalanced panels, where the latter means nothing more than defaults being relatively rare and concentrated over time. However, in this research we obtain standard errors through bootstrapping, where we impose the proper stratification to preserve the correlation structure of the data, which is straightforward to implement in Python due to vectorized operations.

Description of Modeling Data
The following data is also used for the development of the models in this study:

 The Center for Research in Security Prices TM ("CRSP") U.S. Stock Databases
This product is comprised of a database of historical daily and monthly market and corporate action data for over 32,000 active and inactive securities with primary listings on the NYSE, NYSE American, NASDAQ, NYSE Arca and Bats exchanges and include CRSP broad market indexes.
A series of filters are applied to this Moody's population to construct a population that is closely aligned with the U.S. large corporate segment of companies that are publicly rated and have publicly traded equity. In order to achieve this using Moody's data, the following combination of NAICS and GICS industry codes, regional codes and a historical yearly Net Sales threshold are used: 1) Non-C&I obligors defined by the following NAICS codes below (see Table 1  7) Records that are too close to a default event are not included in the development dataset, which is an industry standard approach, the rationale being that the records of an obligor in this time window do not provide information about future defaults of the obligor, but is more likely to reflect the existing problems that the obligor is experiencing. This restriction corrects a range of timing timing issues between when statements are issued and when ratings are updated.
8) In general, the defaulted obligors' financial statements after default date are not included in the modeling dataset. However, in some cases obligors may exit a default state or "cure" (e.g., emerge from bankruptcy), in which case the statements between default date and cured date are not included.
In our opinion, these data exclusions are reasonable, in line with industry standards, sufficiently documented and do not compromise the integrity of the modeling dataset.   The model development time period considered for the Moody's data Q191-Q415. Shown in Table 1 above is the comparison of the modeling population by GICS industry sectors, where for each sector the defaulted obligors columns represent the percent of defaulted obligors in the sector out of entire population. The data are concentrated in Consumer Discretionary (20%), Industrials (17%), Tech Hardware and Communications (12%), and Energy except E&P (11%). A similar industry composition is shown above in Table 2 according to the NAICS classification system. The model development dataset contains financial ratios and default information that are based upon the most recent data available from DRS TM , Compustat TM and bankruptcydata.com, so that the data is timely and a priori should be give the benefit of the doubt with respect to favorable quality. Furthermore, the model development time period of 1Q91-4Q15 spans two economic downturn periods and a complete business cycle, the length of which being another factor supporting a verdict of good quality. Related to this point, we plot the yearly and quarterly default rates in the model development dataset, as shown above in Figure 1.
In Table 3 above we present the summary statistics for the variables that appear in our final models. These final models were chosen based upon an exhaustive search algorithm in conjunction with 5-fold cross-validation, and we have chosen the leading three models incorporating and omitting the DTD risk factor. The following are the categories and names of the explanatory variables appearing in the final candidate models (Note 13):

Econometric Specifications and Model Validation
In the subsequent tables we present the estimation results and in-sample performance statistics for our final models. As tabulated in Tables 4, 6 and 8 we show the three leading models having the DTD risk factors  included with the other explanatory variables; whereas in Tables 5, 7 and 9 we show the three best models that omit the DTD variable.
Across models, signs of coefficient estimates are in line with economic intuition, and significance levels are indicative of very precisely estimated parameters. The macroeconomic variables that are associated with improving economic conditions (S&P, NFP and DOW) have negative signs, while those that indicate deteriorating conditions have (UNP and SPR) have positive signs. The duration variable TSS has a positive sign, which is consistent with the intuition that on an unconditional basis default risk increases over time, given that the preponderance of the obligors in the sample are rated better than speculative grade. The sign on the PD rating is positive, which makes sense in that worse rated obligors have higher default risk. The financial ratios measuring borrower liquidity all have negative signs, as greater levels of such resources diminish the chances of a default, while the interaction terms with time are positive, the latter indicating sensibly that the efficacy of this factor decays over time. Finally, the negative signs on DTD indicate that firms further away from their default points have lower default risk, as expected.
AUC statistics indicate that models have strong ability to rank order default risk, where the associated ROC curves are shown in Figures 2, 6 and 10 (4, 8 and 12) for the three leading models having the DTD risk factors included with (excluded from) the other explanatory variables. Regarding measures of predictive accuracy, in all cases the pseudo R-squared ("PR2") indicates that all models exhibit good fit, which is confirmed by the plots of the predicted PD versus the default rates over time as shown in Figures 3,7 and 11 (5,9 and 13) for the three leading models having the DTD risk factors included with (excluded from) the other explanatory variables. As expected, the AIC and PR2 predictive accuracy measures deteriorate when the DTD risk factors are omitted, but this rank ordering does not carry over to the AUC discriminatory power measure except in the case of the first leading model of each type.
In Figures 14 through 19 we show the 12-quarter baseline and adverse scenario macroeconomic forecasts for the models, the average PDs in the top panels, and the average model log-odds scores in the lower panels. These scenarios are sourced from Moody's Analytics as of 4Q21. We observe that while all the models show a reasonable pattern of stress in the adverse relative to the baseline scenarios, the patterns exhibit significant variation across models, so that final model selection would be dependent upon which of these patterns are deemed preferable to business and risk management experts, apart from statistical performance.          Figure 10. Hazard rate regression receiver operating curve -Moody's large corporate financial, macroeconomic credit quality, duration and Merton distance-to-default explanatory variables 1-quarter default model 3 Figure 11. Hazard rate regression receiver accuracy plot -Moody's large corporate financial, macroeconomic credit quality, duration and Merton distance-to-default explanatory variables 1-quarter default model 3

The Quantification of Model Risk According to the Principle of Relative Entropy
In building of risk models we are subject to errors from model risk, one source of which being the violation of modeling assumptions. In this section we apply a methodology for the quantification of model risk that is a tool in building models robust to such errors. A key objective of model risk management is to assess the likelihood, exposure and severity of model error in that all models rely upon simplifying assumptions. It follows that a critical component of an effective model risk framework is the development of bounds upon model error resulting from the violation of modeling assumptions. This measurement is based upon a reference nominal risk model and is capable of rank ordering the various model risks as well as indicating which perturbation of the model has maximal effect upon some risk measure.
In line with the objective of managing model risk in the context of obligor-level PD stress testing, we calculate confidence bounds around forecasted PDs spanning model errors in a vicinity of a nominal or reference model defined by a set of alternative models. These bounds can be likened to confidence intervals that quantify sampling error in parameter estimation. However, these bounds are a measure of model robustness that instead measures model error due to the violation of modeling assumptions. In contrast, a standard error estimate conventionally employed in managing credit portfolios does not achieve this objective, as this construct relies on an assumed joint distribution of the asset returns or correlation in defaults.
We meet our objective referenced previously in the context of stressed PD modeling through bounding a measure of loss, in this case the scenario PD forecasts, which can reflect a level of model error within reason. We have observed that while amongst practitioners one alternative means of measuring model risk is to consider challenger models, an assessment of estimation error or sensitivity in perturbing parameters is in fact a more prevalent means of accomplishing this objective, which captures only a very narrow dimension of model risk. In contrast, our methodology transcends the latter aspect to quantify potential model errors such as incorrect specification of the probability law governing the model (e.g., the distribution of error terms; or the specification of a link function in generalized linear regression, of which logistic regression is a sub-class), variables belonging in the model (e.g., omitted variable bias with respect to the DTD) or the functional form of the model equations (e.g., neglected transformations or interaction terms).
As the commonality of these types of model errors under consideration all relate to the likelihood of such error, which in turn is connected to perturbation in of probability laws governing the entire modeling construct, we apply the principle of relative entropy (Hansen & Sargent, 2007;Glasserman & Xu, 2013). Relative entropy between a posterior and a prior distribution is a measure of information gain when incorporating incremental data in Bayesian statistical inference. In the context of quantifying model error, relative entropy has the interpretation of a measure of the additional information requisite for a perturbed model to be considered superior to a champion or null model. Said differently relative entropy may be interpreted as measuring the credibility of a challenger model. Another useful feature of this construct is that within a relative entropy constraint the so-called worst-case alternative (e.g., in our case the upper bounds on the scenario forecasts due to ignoring some feature of the alternative model) can be expressed as an exponential change of measure.
Model risk with respect to a champion model   y f x  is quantified by the Kullback-Leibler relative entropy divergence measure to a challenger model   y g x  and is expressed as follows: 1   the worst case of model risk in extremis.
The change in measure of Equation (14) has important property of being model-free, or not dependent upon the specification of the challenger model   gx. As mentioned previously, this reflects the robustness to misspecification of the alternative model that is a key feature of this construct, which from a model validation perspective is a desirable propertyi.e., we do not have to assume that either the champion or alternative models are correct and only have to quantify the distance of the alternative from the base model to assess the impact of the violating modeling assumptions.
We study the quantification of model risk with respect to the following modeling assumptions:

 Misspecification According to an Incorrect Link Function
Omitted variable bias is analyzed by consideration of the DTD risk factor as discussed in the main estimation results in this paper, where we saw that including this variable in the model specification did not result in other financial or macroeconomic variables falling out of the model, and improved model performance. The second assumption is based upon estimation of alternative specifications that include interaction effects amongst the explanatory variables. Finally, we analyze the third assumption above through estimation of these specifications with the complimentary log-log ("CLL") as opposed to the Logit link function (Note 14).
We implement this procedure in a bootstrap simulation exercise, where we develop a distribution of the baseline and adverse macroeconomic forecasts at each horizon, and study the high 95 th and low 5 th percentiles of these distributions as upper and lower bounds on model risk, respectively. In each iteration, we resample the data with replacement (stratified in order that the history of each obligor is preserved), re-estimate the models considered in the main body of the paper not, as well as three variants that either include DTD, interaction effects or a CLL link function. In the case of the DTD risk factor, we will be comparing the variants as considered in the main results which have already been estimated, except that in each run the results will be perturbed according to the different bootstraps of dataset, and in the other two cases there will be alternative estimations (Note 15).

Conclusions and Directions for Future Research
In this study we have considered an obligor level hazard rate methodology for corporate PD modelling that features macroeconomic, financial, equity market, duration and credit rating variables. This methodology has been applied to stress testing for CECL, where we have departed from the common practice for wholesale portfolios of adapting rating transition models where the ratings are stressed for this purpose, and to our knowledge this is one of the fist studies in the literature to have done this. We have further innovated by ijef.ccsenet.org International Journal of Economics and Finance Vol. 14, No.4;2022 97 developing explicitly discrete time hazard models with a specialized data sample design (the landmark methodology) that is particularly tractable in computation, which allows for the inclusion of a large number of variables and utilization of a long historical dataset.
Our base data are a lengthy borrower level history of corporate ratings and defaults sourced from Moody's in the period 1990-2015. This data are enhanced by attaching an extensive set of financial, macroeconomic and equity market variables to form the basis of candidate explanatory variables. The obligor-level hazard rate models have a 1-quarter default horizon and further feature time decay and duration effects. Based upon the relevant literature, we also considered an alternative structural risk factor, the Merton DTD measure constructed from the market value of equity and accounting leverage measures. We then compared these hybrid structural-reduced form models to the financial ratio and macroeconomic variable only models. It has been shown that adding the DTD measures to our leading models do not invalidate the other variables chosen, significantly augments model performance and results in comparable scenario forecasts.
Finally we measured the model risk attributable to various model assumptions according to the principle of relative entropy. It has been observed that the omitted variable bias (with respect to the DTD) has the greatest, the incorrect specification of the link function has the least, and the neglect of interaction effects amongst risk factors has an intermediate impact upon measured model risk.
Our conclusion is that validation methods chosen in the stress testing context should be capable of testing model assumptions, given the sensitive regulatory uses of these models and concerns raised in the industry about the effect of model misspecification on capital and reserves. Our research is accretive to the literature by offering state of the art techniques as viable options in the arsenal of model validators, developers and supervisors seeking to manage model risk.
Given the wide relevance and scope of the topics addressed in this study, there is no shortage of fruitful avenues along which we could extend this research. Some proposals include but are not limited to:  Alternative econometric techniques, such as various classes of machine learning models, including non-parametric alternatives;  asset classes beyond the large corporate segments, such as small business, real estate or even retail;  the consideration of industry specificity in model specification; and,  datasets in jurisdictions apart from the U.S., else pooled data encompassing different countries with a consideration of geographical effects.