Defining Cohorts in SQL

Exercise: Cohort Definitions using SQL

From Exercise 10.2 in the Book of OHDSI

Use SQL and R to create a cohort for acute myocardial infarction (AMI) in the existing COHORT table, following these criteria:

An occurrence of a myocardial infarction diagnose (concept 4329847 “Myocardial infarction” and all of its descendants, excluding concept 314666 “Old myocardial infarction” and any of its descendants).
During an inpatient or ER visit (concepts 9201, 9203, and 262 for “Inpatient visit”, “Emergency Room Visit”, and “Emergency Room and Inpatient Visit”, respectively).

Solution

For readability we here split the SQL into two steps.

We first find all condition occurrences of myocardial infarction, and store these in a temp table called “#diagnoses”. Note we are using OHDSI SQL (a subset of SQL Server) to conduct these queries. Behind thse scenes, SqlRender is translating to our chosen dbms (sqlite in our case).

Connect to Database

First, we need to establish a connection to our database. We will test our connection by getting all the table names.

datasetName = "GiBleed"
dbms = "sqlite"

datasetLocation <- Eunomia::getDatabaseFile(
  datasetName = datasetName, 
  dbms = dbms, 
  databaseFile = tempfile(fileext = ".sqlite")
)

attempting to download GiBleed

attempting to extract and load: C:\Users\mccul\AppData\Local\Temp\Rtmp21zzNI/GiBleed_5.3.zip to: C:\Users\mccul\AppData\Local\Temp\Rtmp21zzNI/GiBleed_5.3.sqlite

connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = dbms, server = datasetLocation)
connection = DatabaseConnector::connect(connectionDetails = connectionDetails)

Connecting using SQLite driver

DatabaseConnector::getTableNames(connection, databaseSchema = 'main')

 [1] "attribute_definition"  "care_site"             "cdm_source"           
 [4] "cohort"                "cohort_attribute"      "cohort_definition"    
 [7] "concept"               "concept_ancestor"      "concept_class"        
[10] "concept_relationship"  "concept_synonym"       "condition_era"        
[13] "condition_occurrence"  "cost"                  "death"                
[16] "device_exposure"       "domain"                "dose_era"             
[19] "drug_era"              "drug_exposure"         "drug_strength"        
[22] "fact_relationship"     "location"              "measurement"          
[25] "metadata"              "note"                  "note_nlp"             
[28] "observation"           "observation_period"    "payer_plan_period"    
[31] "person"                "procedure_occurrence"  "provider"             
[34] "relationship"          "source_to_concept_map" "specimen"             
[37] "visit_detail"          "visit_occurrence"      "vocabulary"

Excellent.

Now, since I know the default eunomia dataset yields no one in this cohort, I’m going to insert some visit rows, just so we get some people in the cohort.

source("util.R")
insert_visits(connection)

[1] "Adding visits..."


insert into concept (
  concept_id,
  concept_name,
  domain_id,
  vocabulary_id,
  concept_class_id,
  standard_concept,
  concept_code,
  valid_start_date,
  valid_end_date,
  invalid_reason
)
values (
  262--concept_id,
  ,'Emergency Room and Inpatient Visit'--concept_name,
  ,'Visit'--domain_id,
  ,'Visit'--vocabulary_id,
  ,'Visit'--concept_class_id,
  ,'S'--standard_concept,
  ,'ERIP'--concept_code,
  ,strftime('%s', '1970-01-01 00:00:00')--valid_start_date,
  ,strftime('%s', '2099-12-31 00:00:00')--valid_end_date,
  ,null--invalid_reason
);

delete from visit_occurrence where visit_occurrence_id >= 1000000;
insert into visit_occurrence(
  visit_occurrence_id
  ,person_id
  ,visit_concept_id
  ,visit_start_date
  ,visit_start_datetime
  ,visit_end_date
  ,visit_end_datetime
  ,visit_type_concept_id
  ,provider_id
  ,care_site_id
  ,visit_source_value
  ,visit_source_concept_id
  ,admitting_source_concept_id
  ,admitting_source_value
  ,discharge_to_concept_id
  ,discharge_to_source_value
  ,preceding_visit_occurrence_id
)
values(
  1000000--visit_occurrence_id
  ,1491--person_id
  ,9201--visit_concept_id
  ,strftime('%s', '2011-07-11 12:34:56')--visit_start_date
  ,strftime('%s', '2011-07-11 12:34:56')--visit_start_datetime
  ,strftime('%s', '2011-07-13 12:34:56')--visit_end_date
  ,strftime('%s', '2011-07-13 12:34:56')--visit_end_datetime
  ,0--visit_type_concept_id
  ,null--provider_id
  ,null--care_site_id
  ,null--visit_source_value
  ,0--visit_source_concept_id
  ,0--admitting_source_concept_id
  ,null--admitting_source_value
  ,0--discharge_to_concept_id
  ,null--discharge_to_source_value
  ,null--preceding_visit_occurrence_id
), 
(
  1000001--visit_occurrence_id
  ,105--person_id
  ,262--visit_concept_id
  ,strftime('%s', '2005-11-10 12:34:56')--visit_start_date
  ,strftime('%s', '2005-11-10 12:34:56')--visit_start_datetime
  ,strftime('%s', '2005-11-13 12:34:56')--visit_end_date
  ,strftime('%s', '2005-11-13 12:34:56')--visit_end_datetime
  ,0--visit_type_concept_id
  ,null--provider_id
  ,null--care_site_id
  ,null--visit_source_value
  ,0--visit_source_concept_id
  ,0--admitting_source_concept_id
  ,null--admitting_source_value
  ,0--discharge_to_concept_id
  ,null--discharge_to_source_value
  ,null--preceding_visit_occurrence_id
),
(
  1000002--visit_occurrence_id
  ,96--person_id
  ,262--visit_concept_id
  ,strftime('%s', '1997-01-24 12:34:56')--visit_start_date
  ,strftime('%s', '1997-01-24 12:34:56')--visit_start_datetime
  ,strftime('%s', '1997-01-26    12:34:56')--visit_end_date
  ,strftime('%s', '1997-01-26 12:34:56')--visit_end_datetime
  ,0--visit_type_concept_id
  ,null--provider_id
  ,null--care_site_id
  ,null--visit_source_value
  ,0--visit_source_concept_id
  ,0--admitting_source_concept_id
  ,null--admitting_source_value
  ,0--discharge_to_concept_id
  ,null--discharge_to_source_value
  ,null--preceding_visit_occurrence_id
),
(
  1000003--visit_occurrence_id
  ,3049--person_id
  ,9203--visit_concept_id
  ,strftime('%s', '1986-12-04 12:34:56')--visit_start_date
  ,strftime('%s', '1986-12-04 12:34:56')--visit_start_datetime
  ,strftime('%s', '1986-12-06    12:34:56')--visit_end_date
  ,strftime('%s', '1986-12-06 12:34:56')--visit_end_datetime
  ,0--visit_type_concept_id
  ,null--provider_id
  ,null--care_site_id
  ,null--visit_source_value
  ,0--visit_source_concept_id
  ,0--admitting_source_concept_id
  ,null--admitting_source_value
  ,0--discharge_to_concept_id
  ,null--discharge_to_source_value
  ,null--preceding_visit_occurrence_id
),
(
  1000004--visit_occurrence_id
  ,2692--person_id
  ,9201--visit_concept_id
  ,strftime('%s', '1998-06-28 12:34:56')--visit_start_date
  ,strftime('%s', '1998-06-28 12:34:56')--visit_start_datetime
  ,strftime('%s', '1998-06-30 12:34:56')--visit_end_date
  ,strftime('%s', '1998-06-30 12:34:56')--visit_end_datetime
  ,0--visit_type_concept_id
  ,null--provider_id
  ,null--care_site_id
  ,null--visit_source_value
  ,0--visit_source_concept_id
  ,0--admitting_source_concept_id
  ,null--admitting_source_value
  ,0--discharge_to_concept_id
  ,null--discharge_to_source_value
  ,null--preceding_visit_occurrence_id
)
;

  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

Executing SQL took 0.0169 secs

Define Cohort Entry Event

Our cohort entry event are Myocardial Infraction Conditions

We will store all condition occurrences of myocardial infraction in a temp table called #diagnoses

Before executing the sql, I will just print the statement generated after doing a render |> translate sequence in SqlRender. This is good for debugging.

Recall render() will replace the parameters, while translate translates to our target dialect (sqlite).

# Recall, we want to include myocardial infraction, except old myocardial infraction
# We will use variables so we can use this as a template, rather than hard code them into
# our sql string
myocardial_infraction = 4329847 # Myocardial infarction
old_myocardial_infraction = 314666 #  Old myocardial infarction
cdm = "main"

# OHDSI SQL, which will need to be translated to sqlite eventually
sql <- "
drop table if exists #diagnosis;

SELECT 
  person_id AS subject_id,
  condition_start_date AS cohort_start_date
INTO #diagnoses
FROM @cdm.condition_occurrence
WHERE condition_concept_id IN (
    SELECT descendant_concept_id
    FROM @cdm.concept_ancestor
    WHERE ancestor_concept_id IN(@myocardial_infraction)
)
  AND condition_concept_id NOT IN (
    SELECT descendant_concept_id
    FROM @cdm.concept_ancestor
    WHERE ancestor_concept_id IN(@old_myocardial_infraction)
);
"

# Getting rendered and translated sql, just for printing so you can see it
executed_sql <- SqlRender::render(
    sql = sql,
    myocardial_infraction = myocardial_infraction,
    old_myocardial_infraction = old_myocardial_infraction,
    cdm = cdm
  ) |> 
  SqlRender::translate(targetDialect = dbms)
  
cat("Query to be executed on sqlite: \n\n", executed_sql)

Query to be executed on sqlite: 

 drop table if exists temp.diagnosis;
CREATE TEMP TABLE diagnoses
AS
SELECT
person_id AS subject_id,
  condition_start_date AS cohort_start_date
FROM
main.condition_occurrence
WHERE condition_concept_id IN (
    SELECT descendant_concept_id
    FROM main.concept_ancestor
    WHERE ancestor_concept_id IN(4329847)
)
  AND condition_concept_id NOT IN (
    SELECT descendant_concept_id
    FROM main.concept_ancestor
    WHERE ancestor_concept_id IN(314666)
);
ANALYZE diagnoses
;

Now let’s execute it. Note we are doing renderTranslateExecuteSql. As the name suggests, this does a render |> translate as above, but also executes the sql. Note the parameters (cdm and after) are inserted as the last arguments. The first two are connection and then the sql to render/translate.

# Actually executing the query (notice we are render/translate the original sql again)
DatabaseConnector::renderTranslateExecuteSql(
  connection = connection, 
  sql = sql, 
  cdm = cdm, 
  myocardial_infraction = myocardial_infraction,
  old_myocardial_infraction = old_myocardial_infraction
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

Executing SQL took 0.0308 secs

Let’s view the results real quick.

querySql allows you to execute raw sql, so I can write straight up sqlite here, no need to bother with “OHDSI SQL”.

Note that I am doing querySql rather than executeSql. This is because querySql will return the result set, which is why it only accepts a single statement. On the other hand, executeSql does not return anything, and can take multiple statements. Note that, after rander/translate, a single statement could get converted into multiple, so if you do renderTranslateQuerySql and it turns into multiple statements, you will get an error.

DatabaseConnector::querySql(connection = connection, sql = "select * from temp.diagnoses limit 10;")

Define Inclusion Criteria

Additional inclusion criteria is needed to get only inpatient or ER visits

We then select only those that occur during an inpatient or ER visit, using some unique COHORT_DEFINITION_ID (we selected ‘1’):

#  Inpatient or ER
visit_concepts_to_include <- c(9201, 9203, 262) # inpatient or ER

sql <- "
INSERT INTO @cdm.cohort (
  subject_id,
  cohort_start_date,
  cohort_definition_id
  )
SELECT 
  subject_id,
  cohort_start_date,
  CAST (1 AS INT) AS cohort_definition_id
FROM #diagnoses
INNER JOIN @cdm.visit_occurrence v
  ON subject_id = person_id
    AND cohort_start_date >= v.visit_start_date
    AND cohort_start_date <= v.visit_end_date
WHERE visit_concept_id IN (@visit_concepts_to_include);"

# Printing the render/translate just to show what it is doing...
sql |> SqlRender::render(cdm = cdm, visit_concepts_to_include = visit_concepts_to_include) |> SqlRender::translate(targetDialect = "sqlite") |> cat()

INSERT INTO main.cohort (
  subject_id,
  cohort_start_date,
  cohort_definition_id
  )
SELECT 
  subject_id,
  cohort_start_date,
  CAST (1 AS INT) AS cohort_definition_id
FROM temp.diagnoses
INNER JOIN main.visit_occurrence v
  ON subject_id = person_id
    AND cohort_start_date >= v.visit_start_date
    AND cohort_start_date <= v.visit_end_date
WHERE visit_concept_id IN (9201,9203,262);

# Now executing it...
DatabaseConnector::renderTranslateExecuteSql(
  connection = connection, 
  sql = sql, 
  cdm = cdm,
  visit_concepts_to_include = visit_concepts_to_include
)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Executing SQL took 0.0081 secs

Let’s view the results

sql <- "
  select * from @cdm.cohort;
"
DatabaseConnector::renderTranslateQuerySql(
  connection = connection,
  sql = sql,
  cdm = cdm
)

Note that an alternative approach would have been to join the conditions to the visits based on the VISIT_OCCURRENCE_ID, instead of requiring the condition date to fall within the visit start and end date. This would likely be more accurate, as it would guarantee that the condition was recorded in relation to the inpatient or ER visit. However, many observational databases do not record the link between visit and diagnose, and we therefore chose to use the dates instead, likely giving us a higher sensitivity but perhaps lower specificity.

Note also that we ignored the cohort end date. Often, when a cohort is used to define an outcome we are only interested in the cohort start date, and there is no point in creating an (ill-defined) cohort end date.

It is recommended to clean up any temp tables when no longer needed:

sql <- "
TRUNCATE TABLE #diagnoses;
DROP TABLE #diagnoses;
"

DatabaseConnector::renderTranslateExecuteSql(connection, sql)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%

Executing SQL took 0.0069 secs