# List of Abbreviations

AUC Area Under the receiver-operator Curve
CCAE IBM MarketScan Commercial Claims and Encounters
CDM Common Data Model
CRAN Comprehensive R Archive Network
EHR Electronic Health Record
IRB Institutional review board
JMDC Japan Medical Data Center
MDCR IBM MarketScan Medicare Supplemental Database
MDCD IBM MarketScan Multi-State Medicaid Database
OHDSI Observational Health Data Science and Informatics
OMOP Observational Medical Outcomes Partnership

# Responsible Parties

## Investigators

Investigator Institution/Affiliation
Jenna Reps * Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA
Jose Posada * Universidad del Norte, Colombia
Ross Williams * Erasmus University
Aniek Markus * Erasmus University
* Principal Investigator

## Disclosures

This study is undertaken within Observational Health Data Sciences and Informatics (OHDSI), an open collaboration. JMR is an employee of Janssen Research and Development and shareholders in Johnson & Johnson.

# Abstract

Background and Significance

Patient-level prediction models healthcare can help identify a patients personalized risk of some future medical event. These models can aid medical decision making. Machine learning models are frequently published, but few many any clinical impact. One issue limiting models being used clinically is the difficulty with integrating a model into an EHR system. This is a complex task, especially when the model requires variables from different parts of the EHR system. Simple and parsimonious patient level prediction (PLP) models may reduce the complexity of implementing a model clinically.

Study Aims

We aim to answer the question: Can we find a recommended approach to create parsimonious PLP models for Healthcare data?

Study Description

  • Design: Retrospecive cohort with external validation
  • Target Populations:
    1. First random visit in 2017
    2. First influenza vaccine visit in 2017
  • Outcomes:
    1. AMI
    2. Anaphylaxis.sql
    3. Appendicitis
    4. DisIntraCoag
    5. Encephalomyelitis
    6. Hemorrhagic Stroke
    7. Non Hemorrhagic Stroke
    8. Pulmonary Embolism
    9. Guillain Barre Syndrome
  • Time-at-risk:
    1. 1 days to 365 days
  • Covariates
    1. Demographics (age in 5 year buckets + gender)
    2. Demographics + Conditions + Drugs -1 days to -365 days relative to index
  • Models: LASSO Logistic Regression
  • Internal Metrics:
    • Area Under the receiver-operator Curve (AUC).
    • Sensitivity and PPV at a range of thresholds.
    • Calibration Plots
    • Net Benefit
  • External Metrics:
    • Area Under the receiver-operator Curve (AUC).
    • Sensitivity and PPV at a range of thresholds.
    • Calibration Plots (pre and post recalibration)
    • Net Benefit (pre and post recalibration)

1 Amendments and Updates

Number Date Section of study protocol Amendment or update Reason
None

2 Milestones

Milestone Planned / actual date
Start of analysis
End of analysis
Results presentation

3 Rationale and Background

[ADD]

4 Study Objectives

Can we find a recommended approach to create parsimonious PLP models for Healthcare data?

Specific aims:

To compare different data-driven approaches for simplifying a PLP model

5 Research Methods

5.1 Target Populations

Target 1: Patients with a visit in 2017 with >= 365 days prior observations, index is the first valid visit.

Target 2: Patients with a influenza vaccine in 2017 with >= 365 days prior observations, index is the date of the influenza vaccine.

Table 5.1: Target Population.
Cohort Name Cohort Description
Set date 1st Jan 2017 This cohort contain all patients who were in the database on Janurary the 1st in 2017. Their index date is 1st January 2017.
Random visit 2017-2019 This cohort contains all patients who had one or more visits during 2017. Their index date is a randomly selected visit during 2017.
Flu vaccine 2017-2019 This cohort contain all patients who had a flu vaccine recorded during 2017. Their index date is the earliest flu vaccine date in 2017.

5.2 Outcomes and Time-at-risk

The study will focus on nine outcomes with multiple phenotypes per outcome, as shown in Table 5.2.

Table 5.2: Outcome of interest.
Outcome Cohort Name Cohort Description Time At Risk
Acute Myocardial Infarction AcuteMyocardialInfarction index + 0 days to index + 365 days
Acute Myocardial Infarction AMI_IP index + 0 days to index + 365
Acute Myocardial Infarction AMI_IP_FDA index + 0 days to index + 365
Anaphylaxis Anaphylaxis index + 0 days to index + 365
Anaphylaxis Anaphylaxis IPED index + 0 days to index + 365
Anaphylaxis Anaphylaxis FDA index + 0 days to index + 365
Appendicitis Appendicitis index + 0 days to index + 365
Appendicitis Appendicitis IP index + 0 days to index + 365
Appendicitis Appendicitis FDA index + 0 days to index + 365
DisIntraCoag DisIntraCoag index + 0 days to index + 365
DisIntraCoag DisIntraCoag_IP index + 0 days to index + 365
DisIntraCoag DisIntraCoag_FDA index + 0 days to index + 365
Encephalomyelitis Encephalomyelitis index + 0 days to index + 365
Encephalomyelitis Encephalomyelitis IP index + 0 days to index + 365
Encephalomyelitis Encephalomyelitis IP FDA index + 0 days to index + 365
Hemorrhagic Stroke HemorrhagicStroke index + 0 days to index + 365
Hemorrhagic Stroke HemorrhagicStroke IP index + 0 days to index + 365
Hemorrhagic Stroke HemorrhagicStroke IP FDA index + 0 days to index + 365
Non Hemorrhagic Stroke NonHemorrhagicStroke index + 0 days to index + 365
Non Hemorrhagic Stroke NonHemorrhagicStroke IP index + 0 days to index + 365
Non Hemorrhagic Stroke NonHemorrhagicStroke IP FDA index + 0 days to index + 365
Non Hemorrhagic Stroke NonHemorrhagicStroke broad index + 0 days to index + 365
Pulmonary Embolism PulmonaryEmbolism index + 0 days to index + 365
Pulmonary Embolism PulmonaryEmbolism IP index + 0 days to index + 365
Pulmonary Embolism PulmonaryEmbolism FDA index + 0 days to index + 365
Guillain Barre Syndrome GuillainBarreSyndrome index + 0 days to index + 365
Guillain Barre Syndrome GuillainBarreSyndrome IP index + 0 days to index + 365
Guillain Barre Syndrome GBS_IP_Primary index + 0 days to index + 365

5.3 Feature Selection Pipeline

In this study we will compare the following feature selection pipelines:

  • Standard LASSO model
  • Standard LASSO model + automatic reduction by restricting to top N variables based on feature importance metrics
  • Standard LASSO model + automatic reduction with backwards selection afterwards
  • Standard LASSO model + automatic reduction with forward selection using LASSO selected variables afterwards
  • The constrained LASSO - this lets a user specify the number of variables to include

5.4 Covariates to focus on

  • all
  • use PHOEBE to select those well represented across databases

5.5 Candidate Covariates

Gender and age in 5-year buckets, conditions in prior 365 days, drugs in prior 365 days.

5.6 Metrics

We will evaluate the performance by calculating the area under the receiver operating characteristic curve for overal discrimination and calibration plots for calibration. We will also calculate the area under the precision recall curve and net benefit plots. Threshold dependent performance will be displayed using the probability threshold plot.

Performance will be plotted as a function of number of features in the model to observed the relationship between simplicity and performance.

5.7 Data sources

We will execute the study as an OHDSI network study. All data partners within OHDSI are encouraged to participate voluntarily and can do so conveniently, because of the community’s shared Observational Medical Outcomes Partnership (OMOP) common data model (CDM) and OHDSI tool-stack.

Many OHDSI community data partners have already committed to participate and we will recruit further data partners through OHDSIs standard recruitment process, which includes protocol publication on OHDSIs GitHub, an announcement in OHDSIs research forum, presentation at the weekly OHDSI all-hands-on meeting and direct requests to data holders.

Table 5.3 lists the 5 already committed data sources; these sources encompass a large variety of practice types and populations. For each data source, we report a brief description and size of the population it represents. All data sources will receive institutional review board approval or exemption for their participation before executing the study.

Table 5.3: Committed data sources and the populations they cover.
Data source Population Patients History Data capture process and short description
Administrative claims
IBM MarketScan Commercial Claims and Encounters (CCAE) Commercially insured, < 65 years 142M 2000 – Adjudicated health insurance claims (e.g. inpatient, outpatient, and outpatient pharmacy) from large employers and health plans who provide private healthcare coverage to employees, their spouses and dependents.
IBM MarketScan Medicare Supplemental Database (MDCR) Commercially insured, 65\(+\) years 10M 2000 – Adjudicated health insurance claims of retirees with primary or Medicare supplemental coverage through privately insured fee-for-service, point-of-service or capitated health plans.
IBM MarketScan Multi-State Medicaid Database (MDCD) Medicaid enrollees, racially diverse 26M 2006 – Adjudicated health insurance claims for Medicaid enrollees from multiple states and includes hospital discharge diagnoses, outpatient diagnoses and procedures, and outpatient pharmacy claims.
Optum Clinformatics Data Mart (Optum) Commercially or Medicare insured 85M 2000 – Inpatient and outpatient healthcare insurance claims.
Electronic health records (EHRs)
Optum Electronic Health Records (OptumEHR) US, general 93M 2006 – Clinical information, prescriptions, lab results, vital signs, body measurements, diagnoses and procedures derived from clinical notes using natural language processing.

6 Strengths and Limitations

6.1 Strengths

  • We follow the PatientLevelPrediction framework for developing and evaluating models to ensure best practices are applied
  • Our standardized framework enables us to externally validate our models accross a large number of external databases
  • The fully specified study protocol is being published before analysis begins.
  • All analytic methods have previously been verified on real data.
  • All software is freely available as open source.
  • Use of a common data model allows extension of the experiment to future databases and allows replication of these results on licensable databases that were used in this experiment, while still maintaining patient privacy on patient-level data.

6.2 Limitations

  • We are unable to test our models in a clinical setting
  • In a clinically setting predictors may be self reported and these may differ from our covariate definitions
  • The electronic health record databases may be missing care episodes for patients due to care outside the respective health systems.

7 Protection of Human Subjects

This study does not involve human subjects research. The project does, however, use de-identified human data collected during routine healthcare provision. All data partners executing the study within their data sources will have received institutional review board (IRB) approval or waiver for participation in accordance to their institutional governance prior to execution (see Table 7.1). This study executes across a federated and distributed data network, where analysis code is sent to participating data partners and only aggregate summary statistics are returned, with no sharing of patient-level data between organizations.

Table 7.1: IRB approval or waiver statement from partners.
Data source Statement
IBM MarketScan Commercial Claims and Encounters (CCAE) New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
IBM MarketScan Medicare Supplemental Database (MDCR) New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
IBM MarketScan Multi-State Medicaid Database (MDCD) New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
Japan Medical Data Center (JMDC) New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
Optum Clinformatics Data Mart (Optum) New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
Optum Electronic Health Records (OptumEHR) New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.

8 Plans for Disseminating and Communicating Study Results

Open science aims to make scientific research, including its data process and software, and its dissemination, through publication and presentation, accessible to all levels of an inquiring society, amateur or professional [Woelfle2011-ss?] and is a governing principle of this study. Open science delivers reproducible, transparent and reliable evidence. All aspects of this study (except private patient data) will be open and we will actively encourage other interested researchers, clinicians and patients to participate. This differs fundamentally from traditional studies that rarely open their analytic tools or share all result artifacts, and inform the community about hard-to-verify conclusions at completion.

8.1 Transparent and re-usable research tools

We will publicly register this protocol and announce its availability for feedback from stakeholders, the OHDSI community and within clinical professional societies. This protocol will link to open source code for all steps to generate and evaluate prediction models, figures and tables. Such transparency is possible because we will construct our studies on top of the OHDSI toolstack of open source software tools that are community developed and rigorously tested [Schuemie2020-wx?]. We will publicly host the source code at (https://github.com/ohdsi-studies/FeatureSelectionComparison), allowing public contribution and review, and free re-use for anyone’s future research.

8.2 Scientific meetings and publications

We will deliver multiple presentations at scientific venues and will also prepare multiple scientific publications for clinical, informatics and statistical journals.

8.3 General public

We believe in sharing our findings that will guide clinical care with the general public. We will use social-media (Twitter) to facilitate this.