Beyond Comorbidity Indices

KEY POINTS

Question. Among adult hospitalizations in a national claims database, does a permutation-invariant model that learns ICD-10-CM representations improve prediction of 30-day outcomes compared with Charlson and Elixhauser comorbidity indices?

Findings. In a temporally held-out national cohort (2021–2022), the ICD-10-CM embedding model achieved higher discrimination than comorbidity-index models for 30-day unplanned readmission (AUC, 0.7496) and 30-day postdischarge in-hospital mortality (AUC, 0.8557). The highest-performing comorbidity-index comparators achieved AUCs of 0.6553 and 0.7844, respectively.

Meaning. Learning representations directly from ICD-10-CM codes may improve claims-based risk adjustment and help prioritize transitional-care resources at discharge.

Introduction

Accurate prediction and risk adjustment for short-term clinical outcomes, such as 30-day hospital mortality and readmission, are critical for enhancing healthcare research quality, allowing fair assessment of healthcare outcomes and quality metrics [1]. Most claims-based risk adjustment continues to rely on comorbidity indices such as the Charlson Comorbidity Index (CCI) and Elixhauser Comorbidity Index (ECI), which map diagnosis codes to a limited set of conditions [2,3]. While these indices are interpretable and widely deployed, they inevitably discard granularity and may miss clinically meaningful comorbidity patterns and interactions among diagnoses.

Recent machine-learning approaches increasingly use high-dimensional ICD code inputs and have demonstrated improved prediction for a range of outcomes [4,5,6]. However, many approaches simplify or truncate ICD codes, aggregate diagnosis lists in ways that depend on code order, or are trained and evaluated in settings where coding practices differ across sites—each of which can limit robustness and transportability. In addition, many claims-based studies focus on in-hospital mortality and do not evaluate postdischarge mortality as a discharge-time outcome [6,7,8,9,10].

We developed and temporally validated a claims-based prediction model that uses trainable ICD-10-CM embeddings and permutation-invariant aggregation of diagnosis lists. Using the Nationwide Readmissions Database (2016–2022), we assessed discrimination, calibration, and recall-weighted performance for 30-day unplanned readmission and 30-day postdischarge in-hospital mortality and compared results with Charlson and Elixhauser comorbidity-index models; we also generated diagnosis-level attributions using Integrated Gradients.

Methods

Study Design, Data Source, and Oversight

We conducted a retrospective cohort study using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD), 2016–2022. Adult discharges from 2016–2020 were used for model development, and discharges from 2021–2022 were reserved for temporal external testing. Discharges in December of each year were excluded to allow complete 30-day follow-up within the same calendar year.

Use of the NRD was governed by the HCUP data use agreement. Because the NRD contains deidentified data, the institutional review board determined the study was not human participants research and that informed consent was not required.

Cohort Definition

We included hospitalizations for patients aged 18 years or older with a valid patient linkage identifier within each calendar year. For mortality analyses, index hospitalizations with in-hospital death were excluded from the primary outcome definition (postdischarge in-hospital mortality) and examined in a prespecified secondary analysis.

Outcomes

The coprimary outcomes were (1) 30-day unplanned readmission and (2) 30-day postdischarge in-hospital mortality within 30 days of discharge. Readmissions were classified as unplanned using the HCUP algorithm. Postdischarge mortality was defined as inpatient death occurring during a subsequent hospitalization within 30 days after discharge; deaths outside the hospital are not captured in the NRD.

Predictors

For each index hospitalization, we used up to 40 ICD-10-CM diagnosis codes (principal and secondary diagnoses). Diagnosis codes were label-encoded into integer identifiers for model input. Demographic and socioeconomic covariates included age, sex, primary payer, and ZIP-code median income quartile. Age was standardized; categorical variables (sex, payer, income quartile) were one-hot encoded with explicit handling of missing values. No diagnosis codes were filtered, simplified, or reordered.

Model Development

Embedding Model Architecture

The embedding model mapped each ICD-10-CM code to a trainable embedding vector and used a Deep Sets architecture [11] for permutation-invariant aggregation. The Deep Sets encoder processed individual diagnosis embeddings independently through shared multilayer perceptrons (MLPs); outputs were summed (permutation-invariant pooling) and passed to a decoder MLP. Demographic and socioeconomic covariates were processed through a separate 2-layer MLP and concatenated with the Deep Sets output before the final predictor MLP. The model was implemented in TensorFlow [12] and trained with binary cross-entropy loss, the Adam optimizer, and early stopping on validation loss.

Training Strategy

Models were trained separately for each outcome. To address outcome imbalance, we randomly downsampled majority-class encounters in the training set to a 1:1 case-control ratio (validation and temporal test sets were not downsampled). Predicted probabilities were recalibrated on the validation set using logistic calibration. Hyperparameters were tuned via random search (32 trials per outcome), prioritizing validation AUC with F_2 as a secondary criterion.aa Hyperparameter search details and final selected configurations are provided in the supplementary eMethods 1 and eTable 1.

Comparator Models

We implemented logistic regression models based on the Charlson and Elixhauser comorbidity indices. For the Charlson model, we used both the unweighted comorbidity count and the traditional age-adjusted score [13,14,15,16,17,18]. For the Elixhauser model, we used the unweighted comorbidity count; the refined AHRQ algorithm was applied for ICD-10-CM mapping [13]. All comparator models included age, sex, primary payer, and income quartile as covariates.

Statistical Analysis and Performance Evaluation

Primary performance evaluation used a prespecified stratified random subsample of 2021–2022 discharges (n = 3,226,831; sampling fraction 0.10). Discrimination was assessed with AUC-ROC and 95% CIs (DeLong method); pairwise comparisons used DeLong tests [19,20]. Calibration was evaluated with calibration plots and the Brier score. Binary classification thresholds were selected on the validation set by maximizing the Youden index and applied unchanged to the temporal test set. Threshold-dependent metrics included sensitivity, specificity, precision, F_1, and F_2 (which weights recall twice as heavily as precision).

Interpretability Analysis

We used Integrated Gradients [21] to generate code-level attributions, quantifying each ICD-10-CM code’s contribution to predicted risk. Attributions were averaged across occurrences for the 10 most positively and negatively influential codes per outcome (minimum 50 occurrences in the interpretability cohort).bb Additional details on training, testing, and interpretability methods are provided in supplementary eMethods 1–3.

Ablation Studies

We conducted three prespecified ablation studies to evaluate the contribution of key architectural choices: (1) addition of transformer blocks for attention-based contextualization, (2) replacement of Deep Sets with a permutation-variant flattening comparator, and (3) removal of demographic and socioeconomic covariates.cc Full ablation details and results are in supplementary eMethods 4 and eTable 3.

Results

Cohort Characteristics

The study included 19,120,000 adult hospitalizations from 2016–2020 (development cohort) and 32,268,308 from 2021–2022 (temporal test cohort). The 30-day unplanned readmission rate was 11.0% in the development cohort and 10.7% in the temporal test cohort. The 30-day postdischarge mortality rate was 0.6% in both cohorts. Additional cohort characteristics are reported in Table 1.

Primary Performance Comparison

For 30-day readmission, the embedding model achieved an AUC of 0.7496 (95% CI, 0.7488–0.7504), outperforming the Charlson model (AUC, 0.6553; 95% CI, 0.6544–0.6562; P < .001) and the Elixhauser model (AUC, 0.6363; 95% CI, 0.6353–0.6372; P < .001). For 30-day postdischarge in-hospital mortality, the embedding model achieved an AUC of 0.8557 (95% CI, 0.8532–0.8581) vs the best-performing comparator (age-adjusted Charlson: 0.7844; 95% CI, 0.7813–0.7874; P < .001). Performance metrics are summarized in Table 2.

The embedding model also showed superior recall-weighted performance. For 30-day readmission, F_2 was 0.4848 vs 0.4066 for the best comparator; for postdischarge mortality, F_2 was 0.0530 vs 0.0480. Calibration plots demonstrated good agreement between predicted and observed risks across the probability range.dd Calibration reliability plots are provided in supplementary eFigure 2.

Figure 1 — [placeholder] Receiver operating characteristic curves for 30-day readmission (left) and 30-day postdischarge in-hospital mortality (right), comparing the embedding model against Charlson and Elixhauser comparators on the temporal test set.

Figure 2 — [placeholder] Calibration reliability plots for the embedding model and comparators on the temporal test set.

Interpretability Analysis

Integrated Gradients attributions identified clinically relevant patterns. For readmission, codes indicating prior readmissions, chronic conditions requiring ongoing management, and social determinants (e.g., housing issues) showed high positive attributions. For postdischarge mortality, codes reflecting severe illness (e.g., sepsis, respiratory failure) and palliative care had high positive attributions, while codes for routine procedures (e.g., colonoscopy) had negative attributions.

Figure 3 — [placeholder] Top 10 positively and negatively attributed ICD-10-CM codes by mean Integrated Gradients attribution for 30-day readmission.

Figure 4 — [placeholder] Top 10 positively and negatively attributed ICD-10-CM codes by mean Integrated Gradients attribution for 30-day postdischarge in-hospital mortality.

Ablation Studies

Addition of transformer blocks did not improve discrimination or F_2 score for either outcome.ee Reported P values for AUROC differences were P < .001 for readmission and P = 0.57 for postdischarge mortality. Full results in supplementary eTable 3A. Replacing Deep Sets with a permutation-variant flattening comparator reduced AUC (P < .001 for readmission; P = .014 for mortality) and F_2 score.ff Full results in supplementary eTable 3B. Removing demographic and socioeconomic covariates slightly reduced performance (P = .020 for readmission; P < .001 for mortality).gg Full results in supplementary eTable 3C.

Figure 5 — [placeholder] Ablation study results comparing the base embedding model against transformer-augmented, permutation-variant, and ICD-only variants for both outcomes.

Discussion

In a large national claims database, a permutation-invariant embedding model that learned ICD-10-CM representations achieved higher discrimination than Charlson and Elixhauser comorbidity-index models for predicting 30-day unplanned readmission and 30-day postdischarge in-hospital mortality. The embedding model also showed better calibration and recall-weighted performance, supporting its potential utility for discharge-time risk stratification and claims-based risk adjustment.

Comparison With Prior Work

Prior studies have demonstrated the value of machine learning for readmission and mortality prediction [4,5,6,7,8,9,10]. However, many approaches rely on permutation-variant aggregation (e.g., recurrent networks or attention over ordered sequences), which can be sensitive to code ordering—a dimension that varies across sites and coders. Our permutation-invariant approach addresses this limitation and may improve transportability. In addition, many claims-based studies focus on in-hospital mortality and do not evaluate postdischarge mortality as a discharge-time outcome. Our focus on postdischarge mortality provides a more clinically relevant endpoint for discharge planning and transitional care.

Implications for Practice and Policy

Readmission reduction is a key quality and policy goal, with financial penalties under the Hospital Readmissions Reduction Program [1,22,23,24]. However, current risk adjustment often relies on comorbidity indices that may under-adjust for complexity, leading to unfair comparisons across hospitals serving different populations [25,26,27]. Our findings suggest that learning representations directly from ICD-10-CM codes could improve fairness and precision in risk adjustment.

At the patient level, improved risk stratification at discharge could help prioritize transitional-care resources (e.g., care coordination, medication reconciliation, follow-up calls) to high-risk patients [28,29]. The interpretability analysis also highlights specific diagnoses that drive risk, which may inform discharge planning conversations.

Limitations

Our study has several limitations. First, the NRD captures only readmissions to acute-care hospitals within the same state and does not include deaths outside the hospital; both outcomes are therefore underestimated. Second, the model was trained and evaluated in a single national database; external validation in other claims databases and prospective evaluation are needed to assess generalizability and clinical utility. Third, while Integrated Gradients provides code-level attributions, the embedding model remains less interpretable than simple comorbidity indices [30]. Fourth, we did not incorporate additional predictors (e.g., laboratory values, vital signs, discharge disposition) that may be available in some settings and could further improve performance. Finally, residual confounding by unmeasured factors (e.g., social determinants, functional status) may affect model predictions and limit clinical deployment without further validation.

Conclusions

In a large national claims database, a permutation-invariant model that learned ICD-10-CM representations improved prediction of 30-day readmission and postdischarge in-hospital mortality compared with Charlson and Elixhauser index models. These findings support the use of high-dimensional diagnosis information for claims-based risk adjustment and discharge-time risk stratification, with prospective evaluation needed before clinical deployment.

Tables

Table 1: Cohort Characteristics

Characteristic Development (2016–2020) Temporal Test (2021–2022)
Total hospitalizations 19,120,000 32,268,308
30-day readmission rate (%) 11.0 10.7
30-day postdischarge mortality rate (%) 0.6 0.6

Table 2: Model Performance Comparison (Temporal Test Set)

Outcome Model AUC-ROC 95% CI Precision Recall F_1 F_2
30-day readmission Embedding 0.7496 0.7488–0.7504 0.1881 0.8006 0.3046 0.4848
CCI 0.6553 0.6544–0.6562 0.1493 0.6819 0.2453 0.4066
CCI (age-adj) 0.6483 0.6474–0.6491 0.1469 0.6794 0.2416 0.4016
ECI 0.6363 0.6353–0.6372 0.1426 0.6750 0.2357 0.3925
30-day postdischarge mortality Embedding 0.8557 0.8532–0.8581 0.0111 0.8756 0.0220 0.0530
CCI 0.7217 0.7180–0.7253 0.0075 0.8026 0.0149 0.0371
CCI (age-adj) 0.7844 0.7813–0.7874 0.0093 0.7846 0.0185 0.0480
ECI 0.6686 0.6645–0.6728 0.0068 0.7901 0.0135 0.0336

Full title: Development of an ICD-10-CM Embedding Model for Predicting 30-Day Readmission and Postdischarge In-Hospital Mortality in the Nationwide Readmissions Database

Keywords: ICD-10-CM; readmission; mortality; claims data; risk adjustment; deep learning; model interpretability

Structured Abstract

Importance. Comorbidity indices are widely used for claims-based risk adjustment but compress diagnostic information and may under-adjust for clinical complexity.

Objective. To develop and temporally validate a permutation-invariant model that learns ICD-10-CM representations to predict 30-day unplanned readmission and 30-day postdischarge mortality and to compare performance with Charlson and Elixhauser comorbidity-index models.

Design, Setting, and Participants. Retrospective cohort study of adult hospitalizations in the Healthcare Cost and Utilization Project Nationwide Readmissions Database (NRD), 2016–2022. Models were developed using discharges from 2016–2020 and temporally tested using 2021–2022 discharges. Primary performance evaluation was conducted in a prespecified stratified random subsample of 2021–2022 discharges (n = 3,226,831).

Exposure. Up to 40 discharge diagnosis codes (ICD-10-CM) were mapped to trainable embeddings and aggregated with a permutation-invariant Deep Sets architecture; age, sex, primary payer, and ZIP code income quartile were also included as covariates.

Main Outcomes and Measures. Outcomes were 30-day unplanned readmission and 30-day postdischarge in-hospital mortality. Discrimination was assessed with the area under the receiver operating characteristic curve (AUC) and 95% CIs; calibration and threshold-dependent metrics (including F_2) were evaluated. Performance was compared with optimized logistic regression models based on Charlson and Elixhauser indices.

Results. In temporal testing, the embedding model showed higher discrimination for 30-day readmission (AUC, 0.7496 [95% CI, 0.7488–0.7504]) than Charlson (0.6553 [95% CI, 0.6544–0.6562]) and Elixhauser (0.6363 [95% CI, 0.6353–0.6372]). For 30-day postdischarge in-hospital mortality, the embedding model achieved an AUC of 0.8557 (95% CI, 0.8532–0.8581) vs the best-performing comparator (age-adjusted Charlson: 0.7844 [95% CI, 0.7813–0.7874]); DeLong tests were significant for each comparison (P < .001). Recall-weighted performance similarly favored the embedding model (F_2: 0.4848 vs 0.4066 for readmission; 0.0530 vs 0.0480 for postdischarge mortality).

Conclusions and Relevance. In a large national claims database, a permutation-invariant model that learned ICD-10-CM representations improved prediction of 30-day readmission and postdischarge in-hospital mortality compared with Charlson and Elixhauser index models. These findings support the use of high-dimensional diagnosis information for claims-based risk adjustment and discharge-time risk stratification, with prospective evaluation needed before clinical deployment.