Data Quality Improvement , Data Linkage and Multiple Imputation in the UK National Vascular Database

The National Vascular Database (NVD) is a prospective audit database collecting information of the quality of care and outcomes of patients admitted to acute hospitals in England, Wales, Scotland and Northern Ireland with several vascular disorders. The NVD has proved to be an important resource for clinical audit but by contrast its potential as a valuable research tool remains under exploited. We demonstrate proof-of-principle linkage of the NVD to Hospital Episode Statistics (HES) and UK Statistics Authority data. We present and validate Multiple Imputation (MI) methods to address problems with missingness in the linked dataset, focusing on a specific risk model. MI is applied to these linked data to extend the chosen risk model to long term mortality outcomes.


Introduction
The National Vascular Database (NVD) is a prospective audit database collecting information about the quality of care and outcomes of patients admitted to hospitals in England, Wales, Scotland and Northern Ireland with 1) Abdominal Aortic Aneurysms (AAA)-the focus of this paper 2) Lower limb ischaemia requiring bypass 3) Carotid Endarterectomy 4) Amputation.
The NVD has proved to be an important resource for clinical audit (Prytherch et al., 2001;McCollum et al., 1997) by contrast its potential as a valuable research tool remains under exploited.Use of audit data for research is dependent on the ability to adjust for case-mix, which in turn is dependent on the completeness and quality of data collected.
Hospital Episode Statistics (HES) is a data warehouse containing details of all admissions to National Health Service hospitals in England.HES information is stored in separate records: one for each period of care.Each Single imputation of missing values may cause standard errors to be too small since it fails to account for uncertainty about the imputed values.For some time multiple imputation (MI) has been suggested as a promising approach for dealing with missing data (Little & Rubin, 1987), although it is only relatively recently that coherent guidelines for its use have been suggested in the medical literature (Sterne et al., 2009).
MI allows for the uncertainty about the imputed values by creating several plausible imputed datasets and combines the results from each of them.Because we do not know the true values of the missing data the multiple imputation procedure creates multiple copies of the data from the empirical predictive distributions of the observed values.Standard statistical methods such as regression are then used to fit the model of interest in each dataset, with different results because of the variation introduced in the imputation of the missing values.The results are only meaningful when combined to give overall estimated associations.Point estimation and estimates of standard errors are calculated using Rubin's rules (Rubin, 1987) taking into account the between imputation variation.Recent developments in statistical software permit some degree of automation of the process of multiple imputation; see ICE in STATA (Royston, 2004) or MICE in R (van Buuren & Groothius-Oudshoorn, 2011).
In this proof-of-principle study we focus on the Abdominal Aortic Aneurysms (AAA) component of the National Vascular Database (the NVD AAA data).As the largest component of the database (approximately 60% of patient episodes relate to AAA admissions) this provides a test bed for developing vascular multiple imputation models.Analysis focuses on the Vascular Biochemistry and Haematology Outcome Model (VBHOM) (Tang et al., 2007).This binary logistic regression model of in hospital mortality was built using National Vascular Database items that contained complete data on Urea, Sodium, Potassium, Haemoglobin, White Cell Count, Age on and Mode of Admission.As a model originally developed for the NVD VBHOM provides a useful basis on which to build and validate the concept of multiply imputed vascular case-mix models whilst recognising this is only a step towards robust vascular case-mix modelling.
In Section 2 of this paper we describe the degree of missingness present in the NVD and the possibility for HES to help address this missing data.In Section 3 we describe the methods and approach to linking NVD and HES data.In Section 4 we present and validate Multiple Imputation by Chained Equations (MICE) as a technique to address the problems of missing data in the linked HES-NVD data in the context of case-mix adjustment.In Section 5 we apply MICE to the linked HES-NVD data to demonstrate proof-of-principle in modelling long term patient outcomes.Section 6 gives our conclusions in terms of future utility of the NVD and approaches to vascular case-mix modelling.

Scope of Missingness
Missingness can affect the validity of a database in a number of different ways.Firstly, all of the data can be missing for a patient, either because a particular hospital does not contribute to the database or because certain patients do not appear in the database even where a hospital is contributing data on some patients.Both cause bias in the representativeness of the database to the patient population.We found that HES contains a higher volume of cases than the NVD, but as the NVD is not yet a complete census of all acute hospitals this was expected.
A second possibility for missingness occurs when a database does not record variables on a patient that have utility in adjusting to case-mix as either a predictor or an outcome measure.In Section 5 of this paper we discuss a Vol. 1, No. 2; 2012 proof-of-principle study for the use of HES data to augment the range of variables available in the NVD.
Recent literature from the USA and Canada (Osborne et al., 2010;Dimick & Upchurch Jr, 2008;Dueck et al., 2004;Vogel et al., 2011) suggests a focus on including more patient demographic information in case-mix modelling such as deprivation/income (Osborne et al., 2010) or distance travelled to treatment (Dueck et al., 2004).Emphasis is also placed on using patient, surgeon and hospital characteristics (Dueck et al., 2004) for case-mix modelling.Co-morbidities also feature strongly in the list of possible predictors (Osborne et al., 2010;Vogel et al., 2011).Although usually based on selective cohorts this suggests important variables may not be adequately represented in the NVD.Evidence from the UK literature (Hadjianastassiou et al., 2007;Jibawi et al., 2006;Prytherch et al., 2005;Sobocinski et al., 2011) is more varied.Work building on data from the NVD (Prytherch et al., 2005) models casemix largely using patient clinical characteristics, extending or augmenting the VBHOM model (Hadjianastassiou et al., 2007).Previous use of HES data without linkage (Jibawi et al., 2006) focuses on hospital level characteristics without individual level patient characteristics to model.Long term survival outcomes have been studied (Prytherch et al., 2005) but only in the context of trial data for a selected patient group in a single hospital.
The third and most straightforward scope for missingness occurs where, for some of the variables recorded in the database, for some of the patients entered into the database, measurements are missing or implausible.It is this scope of missingness that is described and evaluated for the NVD and can be addressed using multiple imputation.

Analysis by Variable
An overview of the NVD AAA dataset shows that there is a range of data missingness across variables (median proportion of missing data = 22%, interquartile range = 10% to 64%).Each variable in the NVD is assigned a status, namely required or preferred.Although many fields are incomplete, often this coincides with variables that are neither described as required nor preferred for entry into the database.
Required variables have much reduced data missingness (typically less than 5%) but constitute only a small proportion (17%) of variables.They are typically related to dates of treatment, admission and discharge.
Around 20% of variables in the database are described as preferred.These typically relate to clinical measurements taken on patients (e.g.creatinine biomarker concentration and sodium levels) or drugs taken by patients (e.g.beta blockers and statins) that would form important potential variables in any risk model of mortality.Data missingness amongst these variables ranges from around 5% to 50%.
The remaining variables in the database (63%) have no required or preferred status.These include important comorbidities such as previous occurrnace of stroke, cardiac failure and impaired renal function that would inform a risk model.Over 40% of these variables are recorded with missingness over 50%.

Geographical Variation in Missingness
Data missingness by hospital is shown in Figure 1.The circles give locations of hospitals contributing to the NVD AAA data base and the radius of the circle is proportional to the percentage of missing entries in the indicated variables.Whilst there is little variation in missingness amongst required fields (median 7%, interquartile range 7% to 8%), there is marked variation across fields not described as required or preferred (median 40%, interquartile range 37% to 48%) and even greater variation amongst preferred fields (median 11%, interquartile range 6% to 22%).Thus any analysis excluding patients with missing data may lead to unacceptable geographical bias.
The variation in data collection and quality between hospitals means that there is a multilevel effect: that is variations at a hospital level may affect inferences made at the patient level.Consequently the imputation scheme must take some account of this, but it is argued that unless clustering is specifically the focus of the analysis model there is little disadvantage in regarding the clusters as a fixed effect in the imputations (van Buuren & Groothuis-Oudshoorn, 2011).Bona-fide methods of multi-level imputation can be found in, for example (Schafer & Yucel, 2002;Yucel, 2008;Goldstein et al., 2009).To identify patients across admissions, deterministic record linkage is used to assign each patient a unique identifier: the HESID.The HESID can identify patients across years (HES is organised into datasets by financial year) and across record type-inpatient, outpatient, Accident and Emergency, critical care, mortality.The mortality dataset is created by linking mortality data from the UK Statistics Authority (known as ONS because of it's former name the Office of National Statistics) to patient information in HES.The HES database captures information on deaths only if it occurred in hospital.
The death record in HES can be analysed using the diagnoses which provide information on the condition or disease at the time of death, but does not provide any information on the actual cause of death.Linking mortality data from the UK Statistics Authority with HES created a richer dataset that captures mortality information for people who died both in and outside of hospital.ONS provides additional information not available in HES such as the 'underlying cause of death', which could be used for a wide range of analysis, medical research and healthcare planning.Note that the linked data contains mortality information only on people who have attended or have been admitted or treated in hospitals.Since our cohort of interest have all undergone a vascular procedure they should all have a corresponding HES record.
Since the consent given by the patients when they were included on the NVD did not allow for sharing of identifiable data with third parties, we had to link the databases using non-identifiable data on a probablistic basis, matching the details of the operations rather than the patients themselves.
The HES extract was constructed by first filtering the inpatient dataset for any patient who underwent one of the procedures recorded in the NVD (apart from carotid endarterectomy).This created a cohort of patients who were eligible for inclusion into the NVD.This cohort was then used to return all inpatient episodes (1997 to 2010), outpatient episodes (2003 to 2010), mortality records (to date), Accident and Emergency attendances (2007 to 2010) and critical care episodes (2008 to 2010) for those patients.We also applied for and obtained an anonymised extract of the NVD (1995 to 2011) and received the AAA component of the database.
Probabilistic record linkage using the Fellegi-Sunter model (Fellegi & Sunter, 1969) was used to link the inpatient dataset to the NVD AAA database.In most typical record linkage exercises, the entity being linked is a person and we would use identifiers such as their date of birth and name.In this case however, since the data are anonymised, the entities we linked are the operations themselves.The variables used for the linkage were: where the operation took place, the month and year of birth of the patient operated on, sex of the patient operated on, the date of the operation, and the identifying code of the operation.

Linkage Methodology
For each piece of information, e.g.sex, we define two numbers: • m: the chance that the sex agrees between the record in the NVD and the correct matching record in HES.This depends on how well sex has been recorded.
• u: the chance that the sex agrees between the record in the NVD and any other record in HES.This depends on how many operations were done on patients of a given sex.
From these we calculate an agreement weight and a disagreement weight for sex.The agreement weight is positive and the larger it is the stronger the certainty of the match.Similarly the disagreement weight is negative, with the larger the absolute value the stronger the certainty.In comparing records from the NVD and HES we assign either an agreement or a disagreement weight as the match weight for each variable we are comparing.If the variable matches between the two records being compared then the agreement weight is used, otherwise the disagreement weight is used.We then repeat this for each variable and sum the individual match weights to give an overall match weight.
The m and u values were defined either globally for a variable (for any value of the variable) or in an outcomespecific manner (different values of the variable had different m and u values).

Global Linkage Parameters
A flat error rate of 5% (m = 0.95) was assumed for the data quality and the u value calculated by assuming an even distribution of the range of values of the variable as shown in Table 1.For the remaining variables we used m = 0.95 as before and calculated u directly from the distribution of the variable in the HES data.For example we show the values for the variable sex in Table 2.
The values of m and u calculated result in the desired effect-agreement on sex is not particularly discrimanatory but disagreement is more so.If two records have the same sex that is just as likely to be random chance, but if the sex is different they are unlikely to represent the same person.Here we can see higher agreement and disagreement when the sex is female due to the fact that there are twice as many males as females in the HES data.We calculated the m and u values in the same way for year of birth, operation code and place of operation.A number of comapritors were also defined to refine the accuracy of the matching.A date comparator that returns a partial agreement weight was used for operation dates within one week of each other, and partial dates of birth within one month of each other.The date comparator also returns partial agreement weights for common date transcription errors such as getting the day and month back to front.These partial agreement weights are a penalised version of the full agreement weight.The operation codes and locations are also compared for any procedures in the patient records.

Linkage Results
Following the comparisons each NVD record was assigned the record identifier of the matching HES record: 14,580 of 22,145 NVD records (66%).Where only a probable match had been identified this was used: 2,685 (2,772 matches) of 22,145 records (12%).Where no match or probable match had been found, no HES record identifier could be linked: 4,880 of 22,145 records (22%).
A total of 2,772 probable matches were identified covering 2,685 NVD records.For 82 NVD records (169 matches), more than one matching HES record was identified with equal certainty.No information was available to decide which was the correct record and it was decided to discard these matches.Additionally, 367 HES records (732 matches) matched to more than one NVD record.These are likely to be duplicates in the NVD records but we also decided cautiously to discard these matches.This left 16,451 of the total 24,227 NVD records matched to HES (68%, 74% of the 22,145 records we attempted to link).

Linkage Quality
This linkage exercise linked only 74% of the expected NVD records to HES.The resulting linked dataset is sufficient to show that, in principle, probalistic linkage of these (and other) datasets is both possible and useful.It shows that where privacy concerns outweigh the perceived benefit of any given research, some progress can still be made to link anonymised routine datasets that are of sufficient size to deliver results despite some unlinked records.
Although there may be a systematic bias at work which determines which records cannot be linked (for example if the unlinked records all represent patients with poor outcomes), if sufficient care is taken in the analysis of the linked data, this can be identified and checked to determine if it affects the results.

Missingness Mechanisms
Missingness mechanisms are assumptions about the data describing how we believe the missing (unobserved) data are related to the observed data.The three main mechanisms are as follows: 1) Missing completely at random (MCAR): there are no systematic differences between the observed values and the missing values.
2) Missing at random (MAR): any systematic difference between the missing values and the observed values can be explained by differences in observed data.
3) Missing not at random (MNAR): even after the observed data are taken into account, systematic differences remain between the missing values and the observed values.
Multiple imputation generally assumes that data are MAR: that is the observed variables are predictive of the missing values.Analyses based on multiply imputed data avoid bias only if enough variables that predict the missing values are included in the imputation.Failure to do so may render the MAR assumption implausible and analyses based on the data may be biased.
It is straightforward to demonstrate that the NVD AAA data are not MCAR which precludes a valid complete case analysis.Table 3 summarises the discharge status (dead/alive) comparing missing and complete cases across vari-The subset of variables to be imputed and their missingness characteristics are summarised in Table 4.Note that for continuous variables any data values that lie outside clinically plausible limits have been declared as missing data.

Choice of Imputation Method
We use predictive mean matching (PMM) to impute continuous variables and polytomous regression for categorical variables.PMM is a general-purpose imputation method (Meng, 1994) that confines the imputations to the observed distribution.PMM preserves non-linear relations between predictors such as that between age and haemoglobin (Figure 2).We used generalised additive models (GAMs) (Little, 1988;Hastie et al., 2008;Wood, 2004) to explore non-linearities.GAMs are a sum of smooth functions able to characterize non-linear regression effects and an advantage is that there is some automation to the fitting of the smooth functions (Hastie et al., 2008;Wood, 2004).
Figure 2. Generalised additive model showing that age has a non-linear association with haemoglobin A disadvantage of PMM is it may not give sufficient between-imputation variation with only a few predictors (van Buuren & Groothius-Oudshoorn, 2011).The sample sizes of the NVD AAA data are large with many predictors, so we believe that PMM offers a viable method of imputing continuous variables.To mitigate concerns about insufficient between-imputation variation, we followed the suggestions in Sterne et al. (2009) by using 20 imputations rather than the 'standard' five imputations suggested (Sterne et al., 2009;Jacobusse, 2005).This increases the computational burden for subsequent analyses but represents a balance between computational effort and variation.
This study is proof of concept, but in a definitive study it would be possible to determine more precisely the number of imputations that would be needed.For instance, for each variable included in VBHOM model we might plot coefficients, standardized coefficients (coefficient divided by its standard deviation) and corresponding p-values as function of number of imputations.The stability can be studied and then the choice of the number of imputations discussed, for example by including a threshold for the difference between consecutive p-values in order to conclude in favour of stability.
The imputation scheme assumes normality of the variables being imputed and so we checked that this assumption is satisfied.For variables with a non-normal distribution transformations to approximate normality (e.g.logarithmic transformations) were used.We used a logarithmic transformation for the variables White Cell Count, Urea, Sodium and Potassium, which are sufficiently non-normal to cause concern about the normality assumption.
We must include the outcome variable (e.g.mortality status at discharge or mortality status at one year) as a predictor in the imputations.Not including the outcome dilutes associations between the outcome and the other variables (Wood, 2006).

Checking and Validation of Imputations
There is no definitive method for checking the imputations or the within imputation iterations of the chained equations.At each iteration the chain mean and standard deviation can be plotted and on convergence the different streams should freely intermingle with no definite trends.Having checked this for each of our imputations we are satisfied that convergence of the MICE algorithm is satisfactory.
A good imputed value is one that could have been observed had it not been missing.It is desirable that the imputed values be plausible and we checked that there were no implausible values in our imputations.We verified the fit of our imputation models to the observed data as a precursor to imputation and refined our models where non-linearity or non-normality were observed.After transformation we did not observe heteroscedasticity in our imputation models.The observed and imputed distributions are similar suggesting that the imputations are plausible.The observed and imputed distributions for log Urea shown in Figure 3 demonstrates that the imputations captured the behaviour of the observed distribution in each of the 20 imputed datasets, adding credibility to the MAR assumption.For continuous variables we compared the observed and imputed distributions conditioned on the propensity score (probability of missingness) (Raghunathan & Bondartenko, 2011).We found that the distributions of observed and imputed values were similar, which provides more evidence that the imputations are reasonable.We looked at the residuals of regression of the continuous variables on their propensity score and found that the observed and imputed residual distributions have large overlap giving credibility that the spread of the imputations is appropriate.

VBHOM Model
Table 5 shows the performance of the VBHOM model for predicting status at discharge (dead/alive) using only data with complete cases and a full data set with missing values imputed and pooled using the MICE scheme explained above.The magnitudes of the coefficients in the model are broadly similar both with and without imputation.However, notice that, for all of the variables in the VBHOM model, the confidence intervals are narrower for the MICE imputed data.A narrower confidence interval represents reduced uncertainty in the model coefficients and Vol. 1, No. 2; 2012

Survival Analysis Outcome Model
The availability of linked data (as described in Section 3) not only allows the simple extension to longer term mortality models (such as that in Section 5.2), it also allows full survival models based on data of death (censored to reflect length of follow up).It is therefore important to demonstrate that multiple imputation can successfully be used for survival modelling.
We modify the imputation scheme by including an event indicator variable (whether the patient died in the period of follow up or not) and the log of the survival time instead of the status at discharge (dead/alive) variable.This approach is informed by the literature (Clark & Altman, 2003), though there remains some controversy over whether survival time or the log of survival time should be included in the imputation scheme (van Buuren et al., 1999).We favour inclusion of log survival time to mitigate problems with normality assumptions.Comparing multiple imputation with complete case analysis we note that standard errors have been reduced by imputation and estimates remain largely unchanged (except admission mode, see Section 5.1).This gives confidence in the validity of multiple imputation for full survival modelling.We note that age, urea and white cell count have hazard ratios significantly (p < 0.05) in excess of one implying positive association with hazard of death, in agreement with the direction of effects in the status on discharge model.Similarly, haemoglobin has a hazard ratio significantly less than one.However, in addition non-elective admissions have a significant (p < 0.05) increased hazard of death and both sodium is negatively associated with hazard of death.

Discussion of Linked Data Modelling
We have demonstrated the utility of missing data methods in the context of linked data that provide a broader range of potential outcome variables for case-mix adjustment of vascular procedures.We have however, not yet explored the full potential of these additional variables to improve the predictive power of case-mix adjustment models in vascular surgery.For this reason we do not critically interpret the models in terms of effects on survival and mortality as considerable bias may remain in these analyses.The methods evaluated in this report can only mitigate the third scope of missingness (Subsection 2.1).To properly evaluate the potential predictive power afforded by data linkage would require addressing both the first and second scope of missingness, i.e. ensuring all patient eligible to contribute to the database are included and all important predictor variables are measured (where importance is related e.g. to the literature and clinical consensus).

Conclusions
There are many ways of handling missing data (Moons et al., 2006;Arnold & Kronmal, 2003;Donders, 2006) although the best solution is to prevent its occurrence.Data imputation techniques allow missing data to be imputed by a value that is predicted using the patient's other known characteristics and have been validated for up to 40% missingness of data.Such techniques have been widely used to handle missingness problems in large-scale censuses and social surveys (Schafer, 2002;McCleary, 2002;Nur et al., 2005), especially in the US, though little in health research (Rubin, 1996).Imputation is generally beneficial because it allows use of information from the incomplete cases that would otherwise have been lost and this is reflected in greater precision in estimation.
Therefore an important benefit of our multiply imputed data is an appreciable increase in the number of cases available compared with complete cases analysis.
We sought to demonstrate proof-of-principle that missing data imputation methods can be used (in the context of

Figure 1 .
Figure 1.Missingness by hospital and type of variable in the NVD AAA database.The four maps correspond to: % missingness all variables (top left); % missing required variables (top right); % missing preferred variables (bottom left); % missing other variables (bottom right).All plots on the same scale

Figure 3 .
Figure 3. Examples of observed and imputed distributions for log Potassium concentrations

Table 1 .
Global linkage parameters for HES and NVD

Table 2 .
Outcome specific linkage parameters for HES and NVD

Table 7 .
Extension of VBHOM model complete case analysis versus MICE imputation to full survival analysisTable7shows the hazard ratios for a Cox proportional hazards model of survival using VBHOM predictors.