Predictive Risk Modeling
Overview
HealthInfoNet delivers insights that help its participants predict the risk of future conditions, events, and types of utilization for their attributed patient populations. That means that users can be alerted to high and rising risks for the patients they care for to help drive quality improvements, manage risk and population health, and inform operational decision-making.
The following sections of this document provide an overview of the underlying predictive risk model methodology, including guidance on how to interpret those results to support follow-on action.
Risk Models
The following predictive risk models are currently available within HealthInfoNet’s reporting platform:
| Population Risk | Transition Risk |
Timeframe | Future 12 Months | Within 30 Days of Encounter Discharge |
Condition |
| N/A |
Event |
| N/A |
Utilization |
|
|
Risk Model Methodology
Reviewing Risk Models
HealthInfoNet researches previous publications to understand methods and critical risk drivers in order to refine its definitions of the specific conditions, events, and utilization being predicted. Once each model’s definition has been set, HealthInfoNet leverages its Health Information Exchange (HIE) data source in the modeling process. This process typically starts with tens of thousands of variables and ends with only a few hundred important factors. In risk modeling, these variables are referred to as risk features. They can generally be thought of as the data elements that most influence the predicted risk of the condition, event, or utilization being measured.
Building Risk Models
HealthInfoNet performs custom machine learning to inform its predictive risk models. Instead of deploying a standard model developed on a pristine, engineered data set, HealthInfoNet runs its machine learning process directly on its HIE data source. The process consists of taking two years’ worth of data and constructing two cohorts: (1) a case cohort and (2) a control cohort. The first year's worth of data are used to develop, prospectively test, and calibrate model performance, while the next year's worth of data are used to validate the performance of the model before allowing HealthInfoNet’s customers to leverage it in their work.
Several risk algorithms are applied during this learning process. Whichever one yields the best result for each risk model is deployed. The following algorithms are typically applied and compared during this process: Linear Regression, Random Forest, Deep Learning, XGBoost, and Logistic regression. For HealthInfoNet’s purposes, the XGBoost algorithm has been selected for each available predictive risk model.
Evaluating Risk Models
The two primary ways that HealthInfoNet evaluates its predictive risk models is by looking at (1) the Receiver Operating Characteristic (ROC) Curve and (2) each predictive risk model’s Performance Table.
ROC Curve – This method is used to measure true positive rates against false positive rates. The closer the curve follows the left border to the upper border, the more accurate the model; the closer the curve comes to 45 degrees, the less accurate the model. The value given to the ROC Curve, the Area Under the Curve (AUC), can range from 0.5 to 1.0, where 0.5 is an essential coin flip and 1.0 is a perfect predictor. This is also called the C-Statistic.
Performance Table – Each risk model is paired with a performance table that demonstrates the sensitivity, positive predictive value (PPV), and relative risk values by risk class. HealthInfoNet closely evaluates each predictive risk model’s performance table to confirm the most accurate results based on the incidence of actual conditions/events/utilization occurring between the evaluated timeframes.
As a rule, machine learning is performed on HealthInfoNet’s predictive risk models semi-annually or more frequently if new data is incorporated or if performance deteriorates. Model performance is monitored continuously to ensure accuracy and satisfaction.
Interpreting Risk Models
A risk score is a numeric value between 0 and 100 that represents the probability of an individual developing a particular condition, experiencing a specific event, or encountering a certain type of utilization within a defined period of time. Individuals are assigned a score for each predictive risk model available, assuming individuals have sufficient data to calculate each model. Individuals without sufficient data to calculate a model will not have a risk score available.
Numeric risk scores can be further translated based on population relative risk in order to segment individuals’ results into risk classes of “low,” “moderate,” “high,” and “very high” categories. Based on each risk model’s PPV results, the following normalization can be calculated to imply the categorization of risk scores into risk classes:
Low Risk = ≤ total PPV%
Moderate Risk = 1-3x total PPV%
High Risk = 2-5x total PPV%
Very High Risk = >5x total PPV%
Condition-based risk models are represented on a 0-100 risk score scale, whereby a score of 0 implies no traceable risk, a score of 100 implies the condition’s presence, and a score somewhere in between implies varied risk of developing the condition. On the other hand, event- and utilization-based risk models are represented on a 0-99 risk score scale, since the event or utilization in question could occur multiple times for an individual and therefore there is no equivalent of a “present” result (i.e., a score of 100). Whereas individuals who develop a condition being measured are excluded from the observed at-risk population (e.g., Type 2 Diabetes, Hypertension), individuals who experience an event or utilization are not excluded but instead calculated with higher risk of having the event again (e.g., Acute Myocardial Infarction, Cerebrovascular Accident).
Furthermore, whereas population risk models can be interpreted as the risk for developing a particular condition, experiencing a specific event, or encountering a certain type of utilization within the next 12 months based on the previous 12 months’ worth of data, transition risk models can be interpreted as the risk for either being readmitted to an inpatient setting or returning to an emergency department setting within 30 days of the first encounter’s discharge based on the previous 30 days’ worth of utilization.
To illustrate the interpretation of risk models in greater detail, an example is given using the Type 2 Diabetes risk model’s performance table.
Type 2 Diabetes Performance Table | |||||
X Timeframe: 1/1/2020 – 12/31/2020 | |||||
Low 0~2 | Moderate 2~6 | High 6~10 | Very High 10~61 | Total | |
Number of Patients | 865,425 | 92,282 | 21,544 | 23,805 | 1,003,056 |
Percentage of Patients | 86.28 | 9.2 | 2.15 | 2.37 | 100 |
True Positives | 5,729 | 3,299 | 1,913 | 5,430 | 16,371 |
PPV (%) | 0.66 | 3.57 | 8.88 | 22.81 | 1.63 |
Sensitivity (%) | 34.99 | 20.15 | 11.69 | 33.17 | 100 |
Mean Relative Risk | 0.41 | 2.19 | 5.44 | 13.98 | 1 |
Metric | Definition | Example |
Number | The total number of individuals assigned to each risk class by the risk model algorithm based on criteria evaluated during the X timeframe. | There are 1,003,056 total individuals predicted at-risk for having Type 2 Diabetes in the next 12 months, 21,544 of which are predicted at High risk. |
Percentage | The percent of the total population assigned to each risk class. | 2.15% of the total population is predicted at High risk for having Type 2 Diabetes in the next 12 months. |
True Positives | The total number of at-risk individuals identified during the X timeframe who developed the condition during the Y timeframe. | Of the 1,003,056 individuals evaluated as being at-risk for Type 2 Diabetes during the X timeframe, 16,371 individuals actually developed Type 2 Diabetes during the Y timeframe. |
Positive Predictive Value (PPV%) | The incidence rate of true positive cases within the total population; individuals’ chances of having the condition/event/utilization in the predictive period. | There are 1,913 individuals in the High risk class (out of a total of 21,554 individuals) who actually developed Type 2 Diabetes during the Y timeframe. Dividing the true positive cases by the total population ((1,913/21,554)*100 = 8.88) predicts that the group of individuals included in the High risk class have a 8.88% chance of developing Type 2 Diabetes in the next 12 months; or, 8.88% of the individuals in the High risk class will develop Type 2 Diabetes in the next 12 months. |
Sensitivity | The percentage of true positive cases within a particular risk class out of the total number of true positive cases within the total population. | There are 1,913 individuals in the High risk class who developed Type 2 Diabetes during the Y timeframe performance period out of a total of 16,371 actual cases across all risk classes. Dividing the High risk class actual cases by the total actual cases ((1,913/16,371)*100 = 11.69%) determines that the sensitivity of the model; or, the confidence that the model will correctly identify individuals that will have the condition/event/utilization. |
Mean Relative Risk | The ratio of the probability of a condition or event occurring within each risk class, identified by dividing the PPV% for each risk class by the PPV% for the entire population. | Dividing the High risk class PPV% by the total population PPV% (8.88/1.63 = 5.44) determines that individuals included in the High risk class are 5.44x more likely to develop Type 2 Diabetes than individuals in the other risk classes. |
To summarize, using the example above, we can infer the following about an individual with a Type 2 Diabetes risk score of ‘8’:
The individual is at High risk for developing the condition in the next 12 months
The individual has an ~8.88% chance of developing the condition in the next 12 months
The individual is 5.44 times more likely to develop the condition in the next 12 months than individuals placed in other risk classes
Influencing Risk Models
In risk modeling, variables impacting the risk of a condition, event, or utilization are referred to as risk features. Although the individual in our last example has a Type 2 Diabetes risk score of ‘8’, the various factors influencing their risk may be different from the factors influencing the risk of an individual who also has a Type 2 Diabetes risk score of ‘8’. HealthInfoNet groups its risk features into 12 categories for ease of understanding.
Category | Description | Example |
Acute Disease | An acute diagnosis code applied in the last 12 months | Patient diagnosed with acute disease [K20 Esophagitis] in the last 12 months |
Chronic Disease Burden | A chronic diagnosis code applied in the last 24 months | Patient diagnosed with chronic disease [E11 Type 2 diabetes mellitus] in the last 24 months |
Community Social Determinants | A characteristic of the ZIP code where the individual resides | Patient's ZIP code has a Very High level % of residents with Medicaid insurance |
Demographics | Sex and age | Female age group (75-84) |
Disease Events | An inpatient, outpatient, or emergency department event diagnosis in the last 12 months | Patient had 8 outpatient visits [Nausea and vomiting] in the last 12 months |
Factors Influencing | A lifestyle diagnosis [Z-code] applied in the last 12 months | Patient diagnosed with [Z72 Problems related to lifestyle: Z72.0 Tobacco use] in the last 12 months |
Laboratory Test | An abnormal laboratory test result during the encounter and/or in the last 24 hours | High MEAN PLATELET VOLUME during episode |
Medication | A medication filled in the last 12 months | Patient had 2 inpatient medications [methyl xanthine] in the last 12 months |
Utilization | Inpatient, outpatient, or emergency department visits had in the last 12 months | Patient had 3+ (9) emergency department visit(s) in the last 12 months |
Vital Sign | An abnormal vital sign result during the encounter and/or in the last 24 hours | Pulse Oximetry 24h – Low |
Procedure | An inpatient, outpatient, or emergency department procedure in the last 12 months as evidenced from an ICD-10 procedure code | Patient had (B51 Imaging, Veins, Fluoroscopy) procedures in the last 12 months |
CPT / HCPCS | An inpatient, outpatient, or emergency department activity in the last 12 months per CPT or HCPCS code | Patient has 1 (99285 Emergency Department Visit High Severity & Threat) in the last 12 months |
For each risk model’s risk features, the machine learning process calculates odds ratios, the measure of association between a risk feature and the predicted outcome. If the odds ratio is greater than 1, the feature is associated with higher odds of the outcome; If lower than 1, the feature is associated with lower odds of the outcome; and, if equal to 1, the feature does not affect the outcome. For the Type 2 Diabetes risk model, for example, a risk feature of ‘Patient diagnosed with acute disease R.73 Elevated Blood Glucose’ in the last 12 months has an odds ratio over 1 (10.67), meaning an individual with a R.73 diagnosis is ~11 times as likely to develop Type 2 Diabetes in the next 12 months compared to an individual in the population without the diagnosis.
Technical Appendix
For more information about HealthInfoNet’s predictive risk model methodology, see the Predictive Risk Methodology Technical Appendix.
Note: To view, please download the file to your desktop instead of opening in your browser.
Included in the technical appendix for each risk model are the following tabs of information:
Definition – Includes the risk model’s description, criteria, and related code value set
Performance – Includes the risk model’s C-Statistic and Performance Table
Risk Features – Includes the risk model’s risk features and odd ratios
HealthInfoNet - Predictive Risk Methodology Technical Appendix.xlsx