# List of Abbreviations

AUC	Area Under the receiver-operator Curve
CCAE	IBM MarketScan Commercial Claims and Encounters
CDM	Common Data Model
COVID-19	COronaVIrus Disease 2019
CRAN	Comprehensive R Archive Network
EHR	Electronic Health Record
H1N1	Hemagglutinin Type 1 and Neuraminidase Type 1 (influenza strain, aka swine flu)
IRB	Institutional review board
JMDC	Japan Medical Data Center
MDCR	IBM MarketScan Medicare Supplemental Database
MDCD	IBM MarketScan Multi-State Medicaid Database
OHDSI	Observational Health Data Science and Informatics
OMOP	Observational Medical Outcomes Partnership

1 Responsible Parties

1.1 Investigators

Investigator	Institution/Affiliation
Jenna Reps *	Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA
Patrick B. Ryan	Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA
* Principal Investigator

1.2 Disclosures

This study is undertaken within Observational Health Data Sciences and Informatics (OHDSI), an open collaboration. JMR and PBR are employees of Janssen Research and Development and shareholders in John & Johnson.

2 Abstract

Background and Significance

The ability to predict which patients are at risk of COVID-19 vaccine outcomes that are of interest could be used to indentify which patients should be prioritise for monitoring after vaccination. Patient-level prediction models can not be used to identify causality, so such models do not tell us whether the vaccine causes the outcome, but they can tell us at the point a patient is about to recieve a vaccine, what is their risk of experiencing the outcome of interest.

Various outcomes of interest have been identified for the COVID-19 vaccination and there are multiple potential phenotypes that can be used to identify patients with the outcome in real-world observational data. It is currently unknown which phenotype is best per outcome and whether the phenotype impacts a model development. To ensure we are developing the most suitable models we need to perform analysis into the impact that the phenotype has on a prediction model. The number of patients with a COVID-19 vaccination in real-world data will be small when the vaccines first start to be rolled out and the vaccine may be given to certain populations initially (e.g. patients at a higher risk COVID-19 complications). Therefore large proxy target populations may be required to train complex machine learning models. An investigation into the tranportability of models developed usign different target populations will help determine which proxy is most suitable and any limitations of using a proxy.

Study Aims

To study the impact of outcome phenotype on model performance across databases and phenotypes and the suitability of using large proxy target populations. Answering these two questions will then guide us to develop quality COVID-19 vaccination patient-level prediction models.

Study Description

Design: Retrospecive cohort with external validation
- Target Populations: Random visit 2017, Influenza visit 2017, Jan 1st 2017
- Outcomes: selected adverse events of special interest with various phenotypes
- Time-at-risk:
  1. 0 days to 365 days
- Covariates
  1. Demographics (age in 5 year buckets + gender)
  2. Demographics + Conditions + Drugs -1 days to -365 days relative to index
- Models: LASSO Logistic Regression
- Internal Metrics:
  - Area Under the receiver-operator Curve (AUC).
  - Sensitivity and PPV at a range of thresholds.
  - Calibration Plots
  - Net Benefit
- External Metrics:
  - Area Under the receiver-operator Curve (AUC).
  - Sensitivity and PPV at a range of thresholds.
  - Calibration Plots (pre and post recalibration)
  - Net Benefit (pre and post recalibration)
For each outcome models will be developed per possible combination of <target population, phenotype, database> and will then be externally validated across all other combinations.

3 Amendments and Updates

Number	Date	Section of study protocol	Amendment or update	Reason
None

4 Milestones

Milestone	Planned / actual date
Start of analysis
End of analysis
Results presentation

5 Rationale and Background

The ability to predict which patients are at risk of COVID-19 vaccine outcomes that are of interest could be used to indentify which patients to monitor after vaccination. Patient-level prediction models can not be used to identify causality, so such models do not tell us whether the vaccine causes the outcome, but they can tell us at the point a patient is about to recieve a vaccine, what is their risk of each outcome of interest.

A common critism of using large observational healthcare datasets for the development of patient-level prediction models is the quality of the phenotypes used to indentify the occurence of the outcome in the data. It is likely that outcome phenotypes miss some patients who have an outcome during the prediction time-at-risk or incorrectly idenitfy some patients who do not have the outcome. This prompts the question - how stable are prediction model across outcome definitions? Answering this question may help us develop improved COVID-19 vaccine models.

Another concern of patient-level prediction models is how models developed in certain populations and at certain time-points transport to new populations and time-points. At the early stages of the COVID-19 vaccine rollout, we are unlikely to have sufficient data to develop quality models, but maybe we can identify suitable lerge proxy target populations and time-points.

6 Study Objectives

The overarching aim is to identify the best practices for selecting the target population and outcome phenotype to develop patient-level prediction models that can be applied when patients are administrated a COVID-19 vaccine using observational, real-world data.

Specific aims:

To determine how stable prediction models are across outcome phenotypes and databases
To determine how stable prediction models when the target population changes
To use this research to develop quality COVID-19 vaccine patient-level prediction models and discover potential limitations when using this in clincal practice

7 Research Methods

7.1 Target Populations

The study will focus on three possible target populations (large potential proxies for the COVID-19 vaccinated population), as shown in Table 7.1.

Table 7.1: Target Population.
Cohort Name	Cohort Description
Set date 1st Jan 2017	This cohort contain all patients who were in the database on Janurary the 1st in 2017. Their index date is 1st January 2017.
Random visit 2017-2019	This cohort contains all patients who had one or more visits during 2017. Their index date is a randomly selected visit during 2017.
Flu vaccine 2017-2019	This cohort contain all patients who had a flu vaccine recorded during 2017. Their index date is the earliest flu vaccine date in 2017.

7.2 Outcomes and Time-at-risk

The study will focus on nine outcomes with multiple phenotypes per outcome, as shown in Table 7.2.

Table 7.2: Outcome of interest.
Outcome	Cohort Name	Cohort Description	Time At Risk
Acute Myocardial Infarction	AcuteMyocardialInfarction	…	index + 0 days to index + 365 days
Acute Myocardial Infarction	AMI_IP	…	index + 0 days to index + 365
Acute Myocardial Infarction	AMI_IP_FDA	…	index + 0 days to index + 365
Anaphylaxis	Anaphylaxis	…	index + 0 days to index + 365
Anaphylaxis	Anaphylaxis IPED	…	index + 0 days to index + 365
Anaphylaxis	Anaphylaxis FDA	…	index + 0 days to index + 365
Appendicitis	Appendicitis	…	index + 0 days to index + 365
Appendicitis	Appendicitis IP	…	index + 0 days to index + 365
Appendicitis	Appendicitis FDA	…	index + 0 days to index + 365
DisIntraCoag	DisIntraCoag	…	index + 0 days to index + 365
DisIntraCoag	DisIntraCoag_IP	…	index + 0 days to index + 365
DisIntraCoag	DisIntraCoag_FDA	…	index + 0 days to index + 365
Encephalomyelitis	Encephalomyelitis	…	index + 0 days to index + 365
Encephalomyelitis	Encephalomyelitis IP	…	index + 0 days to index + 365
Encephalomyelitis	Encephalomyelitis IP FDA	…	index + 0 days to index + 365
Hemorrhagic Stroke	HemorrhagicStroke	…	index + 0 days to index + 365
Hemorrhagic Stroke	HemorrhagicStroke IP	…	index + 0 days to index + 365
Hemorrhagic Stroke	HemorrhagicStroke IP FDA	…	index + 0 days to index + 365
Non Hemorrhagic Stroke	NonHemorrhagicStroke	…	index + 0 days to index + 365
Non Hemorrhagic Stroke	NonHemorrhagicStroke IP	…	index + 0 days to index + 365
Non Hemorrhagic Stroke	NonHemorrhagicStroke IP FDA	…	index + 0 days to index + 365
Non Hemorrhagic Stroke	NonHemorrhagicStroke broad	…	index + 0 days to index + 365
Pulmonary Embolism	PulmonaryEmbolism	…	index + 0 days to index + 365
Pulmonary Embolism	PulmonaryEmbolism IP	…	index + 0 days to index + 365
Pulmonary Embolism	PulmonaryEmbolism FDA	…	index + 0 days to index + 365
Guillain Barre Syndrome	GuillainBarreSyndrome	…	index + 0 days to index + 365
Guillain Barre Syndrome	GuillainBarreSyndrome IP	…	index + 0 days to index + 365
Guillain Barre Syndrome	GBS_IP_Primary	…	index + 0 days to index + 365

7.3 Models

In this study we will focus on using a LASSO logistic regression model. We will use 3-fold cross-validation on a 75% train set to select the optimal varience (amount of reguarlization) and use a 25% test set for internal validation performance estimation.

7.4 Candidate Covariates

Settings 1 (benchmark): Gender and age in 5-year buckets
Setting 2 (full model): Gender and age in 5-year buckets, conditions in prior 365 days, drugs in prior 365 days.

We will train models with two sets of candidate covariates. The first models will use age and gender only, if this models performs well then we can develop very simple models. The second model will also include any condition/drug recorded in the 365 days prior to index up to 1 day prior to index. Comparing the performance of the full and benchmark models will tell us how useful non-age and gender covariates are.

7.5 Metrics

We will evaluate the performance by calculating the area under the receiver operating characteristic curve for overal discrimination and calibration plots for calibration. We will also calculate the area under the precision recall curve and net benefit plots. Threshold dependent performance will be displayed using the probability threshold plot.

7.6 Data sources

We will execute the study as an OHDSI network study. All data partners within OHDSI are encouraged to participate voluntarily and can do so conveniently, because of the community’s shared Observational Medical Outcomes Partnership (OMOP) common data model (CDM) and OHDSI tool-stack. Many OHDSI community data partners have already committed to participate and we will recruit further data partners through OHDSI’s standard recruitment process, which includes protocol publication on OHDSI’s GitHub, an announcement in OHDSI’s research forum, presentation at the weekly OHDSI all-hands-on meeting and direct requests to data holders.

Table 7.3 lists the 5 already committed data sources for EUMAEUS; these sources encompass a large variety of practice types and populations. For each data source, we report a brief description and size of the population it represents. All data sources will receive institutional review board approval or exemption for their participation before executing the study.

Table 7.3: Committed data sources and the populations they cover.
Data source	Population	Patients	History	Data capture process and short description
Administrative claims
IBM MarketScan Commercial Claims and Encounters (CCAE)	Commercially insured, < 65 years	142M	2000 –	Adjudicated health insurance claims (e.g. inpatient, outpatient, and outpatient pharmacy) from large employers and health plans who provide private healthcare coverage to employees, their spouses and dependents.
IBM MarketScan Medicare Supplemental Database (MDCR)	Commercially insured, 65\(+\) years	10M	2000 –	Adjudicated health insurance claims of retirees with primary or Medicare supplemental coverage through privately insured fee-for-service, point-of-service or capitated health plans.
IBM MarketScan Multi-State Medicaid Database (MDCD)	Medicaid enrollees, racially diverse	26M	2006 –	Adjudicated health insurance claims for Medicaid enrollees from multiple states and includes hospital discharge diagnoses, outpatient diagnoses and procedures, and outpatient pharmacy claims.
Optum Clinformatics Data Mart (Optum)	Commercially or Medicare insured	85M	2000 –	Inpatient and outpatient healthcare insurance claims.
Electronic health records (EHRs)
Optum Electronic Health Records (OptumEHR)	US, general	93M	2006 –	Clinical information, prescriptions, lab results, vital signs, body measurements, diagnoses and procedures derived from clinical notes using natural language processing.

7.7 Development and Evaluation Overview

For each of the 9 outcomes of interest we will train models for the 3 target populations x 2 covariate settings x 3 phenotype version x 5 databases = 810 models being developed.

To investigate the impact of outcome phenotype, for each model we will implement it in the development data with the 2 different phenotypes and then externally across the 3 phenotypes x 4 external databases = 14 times. This will require 11,340 validations.

To investigate the impact of target population, for each model we will implement it in the development data with the 2 different target populations and then externally across the 3 target population x 4 external databases = 14 times. This will require 11,340 validations.

In total we will have 810 models developed and tested and 22,680 validations using different settings or databases.

8 Strengths and Limitations

8.1 Strengths

We follow the PatientLevelPrediction framework for developing and evaluating models to ensure best practices are applied
Our standardized framework enables us to externally validate our models accross a large number of external databases
The fully specified study protocol is being published before analysis begins.
All analytic methods have previously been verified on real data.
All software is freely available as open source.
Use of a common data model allows extension of the experiment to future databases and allows replication of these results on licensable databases that were used in this experiment, while still maintaining patient privacy on patient-level data.

8.2 Limitations

We do not know the true sensitvity/specificity of each phenotype used in this study
We are unable to test our models in a clinical setting
In a clinically setting predictors may be self reported and these may differ from our covariate definitions
The electronic health record databases may be missing care episodes for patients due to care outside the respective health systems.
We only investigate LASSO logistic regression and it is unknown whether the results will generalize to other machine learning algorithms

9 Protection of Human Subjects

This study does not involve human subjects research. The project does, however, use de-identified human data collected during routine healthcare provision. All data partners executing the study within their data sources will have received institutional review board (IRB) approval or waiver for participation in accordance to their institutional governance prior to execution (see Table 9.1). This study executes across a federated and distributed data network, where analysis code is sent to participating data partners and only aggregate summary statistics are returned, with no sharing of patient-level data between organizations.

Table 9.1: IRB approval or waiver statement from partners.
Data source	Statement
IBM MarketScan Commercial Claims and Encounters (CCAE)	New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
IBM MarketScan Medicare Supplemental Database (MDCR)	New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
IBM MarketScan Multi-State Medicaid Database (MDCD)	New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
Japan Medical Data Center (JMDC)	New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
Optum Clinformatics Data Mart (Optum)	New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.
Optum Electronic Health Records (OptumEHR)	New England Institutional Review Board and was determined to be exempt from broad IRB approval, as this research project did not involve human subject research.

10 Plans for Disseminating and Communicating Study Results

Open science aims to make scientific research, including its data process and software, and its dissemination, through publication and presentation, accessible to all levels of an inquiring society, amateur or professional [???] and is a governing principle of this study. Open science delivers reproducible, transparent and reliable evidence. All aspects of this study (except private patient data) will be open and we will actively encourage other interested researchers, clinicians and patients to participate. This differs fundamentally from traditional studies that rarely open their analytic tools or share all result artifacts, and inform the community about hard-to-verify conclusions at completion.

10.1 Transparent and re-usable research tools

We will publicly register this protocol and announce its availability for feedback from stakeholders, the OHDSI community and within clinical professional societies. This protocol will link to open source code for all steps to generate and evaluate prediction models, figures and tables. Such transparency is possible because we will construct our studies on top of the OHDSI toolstack of open source software tools that are community developed and rigorously tested [???]. We will publicly host the source code at (https://github.com/ohdsi-studies/CovidVaccinePrediction), allowing public contribution and review, and free re-use for anyone’s future research.

10.2 Scientific meetings and publications

We will deliver multiple presentations at scientific venues and will also prepare multiple scientific publications for clinical, informatics and statistical journals.

10.3 General public

We believe in sharing our findings that will guide clinical care with the general public. We will use social-media (Twitter) to facilitate this.

RESEARCH PROTOCOL

Evaluating the Sensitivity Of Prediction Model Development and Performnance Due To Phenotypes Applied To COVID-19 VACCINES

Version: 1.0.0

1 Responsible Parties

1.1 Investigators

1.2 Disclosures

2 Abstract

3 Amendments and Updates

4 Milestones

5 Rationale and Background

6 Study Objectives

7 Research Methods

7.1 Target Populations

7.2 Outcomes and Time-at-risk

7.3 Models

7.4 Candidate Covariates

7.5 Metrics

7.6 Data sources

7.7 Development and Evaluation Overview

8 Strengths and Limitations

8.1 Strengths

8.2 Limitations

9 Protection of Human Subjects

10 Plans for Disseminating and Communicating Study Results

10.1 Transparent and re-usable research tools

10.2 Scientific meetings and publications

10.3 General public

RESEARCH PROTOCOL Evaluating the Sensitivity Of Prediction Model Development and Performnance Due To Phenotypes Applied To COVID-19 VACCINES

Version: 1.0.0

1 Responsible Parties

1.1 Investigators

1.2 Disclosures

2 Abstract

3 Amendments and Updates

4 Milestones

5 Rationale and Background

6 Study Objectives

7 Research Methods

7.1 Target Populations

7.2 Outcomes and Time-at-risk

7.3 Models

7.4 Candidate Covariates

7.5 Metrics

7.6 Data sources

7.7 Development and Evaluation Overview

8 Strengths and Limitations

8.1 Strengths

8.2 Limitations

9 Protection of Human Subjects

10 Plans for Disseminating and Communicating Study Results

10.1 Transparent and re-usable research tools

10.2 Scientific meetings and publications

10.3 General public

RESEARCH PROTOCOL

Evaluating the Sensitivity Of Prediction Model Development and Performnance Due To Phenotypes Applied To COVID-19 VACCINES