In patients that started using NSAIDs for the first time, predict who will develop a gastrointestinal (GI) bleed in the next year.

The NSAID new-user cohort has COHORT_DEFINITION_ID = 4. The GI bleed cohort has COHORT_DEFINITION_ID = 3.

Setup Synthetic Database with Cohorts

Connect to GiBleed Eunomia DB

datasetName = "GiBleed"
dbms = "sqlite"

datasetLocation <- Eunomia::getDatabaseFile(
  datasetName = datasetName, 
  dbms = dbms, 
  databaseFile = tempfile(fileext = ".sqlite")
)
attempting to download GiBleed
attempting to extract and load: C:\Users\mccul\AppData\Local\Temp\RtmpG6gSFT/GiBleed_5.3.zip to: C:\Users\mccul\AppData\Local\Temp\RtmpG6gSFT/GiBleed_5.3.sqlite
connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = dbms, server = datasetLocation)
connection = DatabaseConnector::connect(connectionDetails = connectionDetails)
Connecting using SQLite driver
DatabaseConnector::getTableNames(connection, databaseSchema = 'main')
 [1] "attribute_definition"  "care_site"             "cdm_source"           
 [4] "cohort"                "cohort_attribute"      "cohort_definition"    
 [7] "concept"               "concept_ancestor"      "concept_class"        
[10] "concept_relationship"  "concept_synonym"       "condition_era"        
[13] "condition_occurrence"  "cost"                  "death"                
[16] "device_exposure"       "domain"                "dose_era"             
[19] "drug_era"              "drug_exposure"         "drug_strength"        
[22] "fact_relationship"     "location"              "measurement"          
[25] "metadata"              "note"                  "note_nlp"             
[28] "observation"           "observation_period"    "payer_plan_period"    
[31] "person"                "procedure_occurrence"  "provider"             
[34] "relationship"          "source_to_concept_map" "specimen"             
[37] "visit_detail"          "visit_occurrence"      "vocabulary"           

Create Cohorts

Eunomia has some default cohorts, so we will just use those for this example. Note our cohorts, and their cohortId

  • Target Cohort: NSAIDs
  • Outcome Cohort: GiBleed

The NSAIDs cohort is just the union of celecoxib and diclofenac

Eunomia::createCohorts(connectionDetails)
Cohorts created in table main.cohort

Define Covariates

Exercise 13.1 Using the PatientLevelPrediction R package, define the covariates you want to use for the prediction and extract the PLP data from the CDM. Create the summary of the PLP data.

We specify a set of covariate settings, and use the getPlpData function to extract the data from the database:

dbDetails <- PatientLevelPrediction::createDatabaseDetails(
    connectionDetails = connectionDetails,
    cdmDatabaseSchema = "main",
    cdmDatabaseName = "main",
    cdmDatabaseId = "main",
    outcomeTable = "cohort",
    outcomeIds = 3,
    targetId = 4
)
covSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = TRUE,
  useDemographicsAge = TRUE,
  useConditionGroupEraLongTerm = TRUE,
  useConditionGroupEraAnyTimePrior = TRUE,
  useDrugGroupEraLongTerm = TRUE,
  useDrugGroupEraAnyTimePrior = TRUE,
  useVisitConceptCountLongTerm = TRUE,
  longTermStartDays = -365,
  endDays = -1)

plpData <- PatientLevelPrediction::getPlpData(
    databaseDetails = dbDetails,
    covariateSettings = covSettings,
    restrictPlpDataSettings = PatientLevelPrediction::createRestrictPlpDataSettings()
)
summary(plpData)
plpData object summary

At risk cohort concept ID: 4
Outcome concept ID(s): 3

People: 2630

Outcome counts:
  Event count Person count
3         479          479

Covariates:
Number of covariates: 245
Number of non-zero covariate values: 54079

Create Study Population

Exercise 13.2 Revisit the design choices you have to make to define the final target population and specify these using the createStudyPopulation function. What will the effect of your choices be on the final size of the target population?

We create a study population for the outcome of interest (in this case the only outcome for which we extracted data), removing subjects who experienced the outcome before they started the NSAID, and requiring 364 days of time-at-risk:

populationSettings <- PatientLevelPrediction::createStudyPopulationSettings(
    washoutPeriod = 364,
    firstExposureOnly = F,
    removeSubjectsWithPriorOutcome = T,
    priorOutcomeLookback = 9999,
    riskWindowStart = 1,
    riskWindowEnd = 365,
    #addExposureDaysToStart = F,
    #addExposureDaysToEnd = F,
    startAnchor = "cohort start",
    endAnchor = "cohort start",
    minTimeAtRisk = 364,
    requireTimeAtRisk = T,
    includeAllOutcomes = T

)
population <- PatientLevelPrediction::createStudyPopulation(
    plpData = plpData, 
    populationSettings = populationSettings,
    outcomeId = 3
)
outcomeId: 3
binary: TRUE
includeAllOutcomes: TRUE
firstExposureOnly: FALSE
washoutPeriod: 364
removeSubjectsWithPriorOutcome: TRUE
priorOutcomeLookback: 9999
requireTimeAtRisk: TRUE
minTimeAtRisk: 364
restrictTarToCohortEnd: FALSE
riskWindowStart: 1
startAnchor: cohort start
riskWindowEnd: 365
endAnchor: cohort start
restrictTarToCohortEnd: FALSE
Requiring 364 days of observation prior index date
Removing subjects with prior outcomes (if any)
Removing non outcome subjects with insufficient time at risk (if any)
Outcome is 0 or 1
nrow(population)
[1] 2578

In this case we have lost a few people by removing those that had the outcome prior, and by requiring a time-at-risk of at least 364 days.

LASSO

Exercise 13.3 Build a prediction model using LASSO and evaluate its performance using the Shiny application. How well is your model performing?

We run a LASSO model by first creating a model settings object, and then calling the runPlp function. In this case we do a person split, training the model on 75% of the data and evaluating on 25% of the data:

lassoModel <- PatientLevelPrediction::setLassoLogisticRegression()
lassoResults <- PatientLevelPrediction::runPlp(
                       populationSettings = populationSettings,
                       plpData = plpData,
                       modelSettings = lassoModel,
                       outcomeId = 3,
                       splitSettings = PatientLevelPrediction::createDefaultSplitSetting(type = "stratified", testFraction = 0.25,trainFraction = 0.75, splitSeed = 0, nfold = 3),
                       saveDirectory = "./resources/artifacts/patient-prediction"
                       )
Use timeStamp: TRUE
Creating save directory at: ./resources/artifacts/patient-prediction/2024-08-21-
Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
Patient-Level Prediction Package version 6.3.8
Study started at: 2024-08-21 12:40:49
AnalysisID:         2024-08-21-
AnalysisName:       Study details
TargetID:           4
OutcomeID:          3
Cohort size:        2630
Covariates:         245
Creating population
Outcome is 0 or 1
seed: 0
Creating a 25% test and 75% train (into 3 folds) random stratified split by class
Data split into 643 test cases and 1935 train cases (645, 645, 645)
Train Set:
Fold 1 645 patients with 120 outcomes - Fold 2 645 patients with 120 outcomes - Fold 3 645 patients with 120 outcomes
240 covariates in train data
Test Set:
643 patients with 119 outcomes
Removing 1 redundant covariates
Removing 0 infrequent covariates
Normalizing covariates
Tidying covariates took 1.76 secs
Train Set:
Fold 1 645 patients with 120 outcomes - Fold 2 645 patients with 120 outcomes - Fold 3 645 patients with 120 outcomes
239 covariates in train data
Test Set:
643 patients with 119 outcomes

Running Cyclops
Done.
GLM fit status:  OK
Creating variable importance data frame
Prediction took 0.263 secs
Removing infrequent and redundant covariates and normalizing
Removing infrequent and redundant covariates covariates and normalizing took 0.354 secs
Prediction took 0.224 secs
Calculating Performance for Test
=============
AUC                 64.83
95% lower AUC:      59.31
95% upper AUC:      70.35
AUPRC:              28.69
Brier:              0.14
Eavg:               0.02
Calibration in large- Mean predicted risk 0.1905 : observed risk 0.1851
Calibration in large- Intercept -0.0386
Weak calibration intercept: -0.2789 - gradient:0.8203
Hosmer-Lemeshow calibration gradient: 0.77 intercept:         0.04
Average Precision:  0.29
Calculating Performance for Train
=============
AUC                 71.84
95% lower AUC:      68.91
95% upper AUC:      74.77
AUPRC:              38.26
Brier:              0.14
Eavg:               0.02
Calibration in large- Mean predicted risk 0.186 : observed risk 0.186
Calibration in large- Intercept 0
Weak calibration intercept: 0.2173 - gradient:1.1627
Hosmer-Lemeshow calibration gradient: 1.14 intercept:         -0.03
Average Precision:  0.38
Calculating Performance for CV
=============
AUC                 67.11
95% lower AUC:      63.98
95% upper AUC:      70.24
AUPRC:              30.54
Brier:              0.14
Eavg:               0.01
Calibration in large- Mean predicted risk 0.1856 : observed risk 0.186
Calibration in large- Intercept 0.0034
Weak calibration intercept: 0.0692 - gradient:1.0483
Hosmer-Lemeshow calibration gradient: 1.06 intercept:         -0.01
Average Precision:  0.31
Calculating covariate summary @ 2024-08-21 12:40:57
This can take a while...
Creating binary labels
Joining with strata
calculating subset of strata 1
calculating subset of strata 2
calculating subset of strata 3
calculating subset of strata 4
Restricting to subgroup
Calculating summary for subgroup TrainWithNoOutcome
Restricting to subgroup
Calculating summary for subgroup TrainWithOutcome
Restricting to subgroup
Calculating summary for subgroup TestWithNoOutcome
Restricting to subgroup
Calculating summary for subgroup TestWithOutcome
Aggregating with labels and strata
Finished covariate summary @ 2024-08-21 12:40:59
Run finished successfully.
Saving PlpResult
Creating directory to save model
plpResult saved to ..\./resources/artifacts/patient-prediction/2024-08-21-\plpResult

Note that for this example set the random seeds both for the LASSO cross-validation and for the train-test split to make sure the results will be the same on multiple runs.

We can now view the results using the Shiny app.

This will launch the app as shown in Figure E.18. Here we see an AUC on the test set of 0.645, which is better than random guessing, but maybe not good enough for clinical practice.

PatientLevelPrediction::viewPlp(lassoResults)