Predicates

In ACES, predicates specify how particular concepts relevant to your task of interest is expressed in your dataset of interest. These dataset-specific items form a large foundation of the cohort extraction algorithm as the more complex dataset-agnostic windowing logic of your task is defined based on your predicates, ultimately facilitating ease-of-sharing for your task configurations.

Predicate Columns

A predicate column is simply a column in a dataframe containing numerical counts (often just 0’s and 1’s), representing the number of times a given predicate (concept) occurs at a given timestamp for a given patient.

Suppose you had a simple time-sorted dataframe as follows:

subject_id

timestamp

code

value

1

null

SEX//male

null

1

1989-01-01 00:00:00

ADMISSION

null

1

1989-01-01 01:00:00

LAB//HR

90

1

1989-01-01 01:00:00

PROCEDURE_START

null

1

1989-01-01 02:00:00

DISCHARGE

null

1

1989-01-01 02:00:00

PROCEDURE_END

null

2

null

SEX//female

null

2

1991-05-06 12:00:00

ADMISSION

null

2

1991-05-06 20:00:00

DEATH

null

3

null

SEX//male

null

3

1980-10-17 22:00:00

ADMISSION

null

3

1980-10-17 22:00:00

LAB//HR

120

3

1980-10-18 01:00:00

LAB//temp

37

3

1980-10-18 09:00:00

DISCHARGE

null

3

1982-02-02 02:00:00

ADMISSION

null

3

1982-02-02 04:00:00

DEATH

null

The code column contains a string of an event that occurred at the given timestamp for a given subject_id. Note: Static variables are shown as rows with null timestamps.

You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:

subject_id

timestamp

admission

discharge

death

discharge_or_death

lab

procedure_start

HR_over_100

male

1

1989-01-01 00:00:00

1

0

0

0

0

0

0

1

1

1989-01-01 01:00:00

0

0

0

0

1

1

1

1

1

1989-01-01 02:00:00

0

1

0

1

0

0

0

1

2

1991-05-06 12:00:00

1

0

0

0

0

0

0

0

2

1991-05-06 20:00:00

0

0

1

1

0

0

0

0

3

1980-10-17 22:00:00

1

0

0

0

1

0

0

1

3

1980-10-18 01:00:00

0

0

0

0

1

0

0

1

3

1980-10-18 09:00:00

0

1

0

1

0

0

0

1

3

1982-02-02 02:00:00

1

0

0

0

0

0

0

1

3

1982-02-02 04:00:00

0

0

1

1

0

0

0

1

Note: This set of predicates are all plain predicates (ie., explicitly expressed as a value in the dataset), with the exception of the derived predicate discharge_or_death, which can be expressed by applying boolean logic on the discharge and death predicates (ie., or(discharge, death)). You may choose to create these columns for derived predicates explicitly (as you would plain predicates). Or, ACES can automatically create them from plain predicates if the boolean logic is provided in the task configuration file. Please see Predicates for more information.

Additionally, you may notice that the tables differ in shape. In the original raw data, (subject_id, timestamp) is not unique. However, a final predicates dataframe must have unique (subject_id, timestamp) pairs. If the MEDS or ESGPT standard is used, ACES will automatically collapse rows down into unique per-patient per-timestamp levels (ie., grouping by these two columns and aggregating by summing predicate counts). However, if creating predicate columns directly, please ensure your dataframe is unique over (subject_id, timestamp).

Sample Predicates DataFrame

A sample predicates dataframe is provided in the repository (sample_data/sample_data.csv). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository (sample_configs/) could be directly extracted.

[1]:
import polars as pl

pl.read_csv("../../../sample_data/sample_data.csv")
[1]:
shape: (54, 17)
subject_idtimestampmalefemaleadmissiondeathdischargelabspo2normal_spo2abnormally_low_spo2abnormally_high_spo2procedure_startprocedure_endventilationdiagnosis_ICD9CM_41071diagnosis_ICD10CM_I214
i64stri64i64i64i64i64i64i64i64i64i64i64i64i64i64i64
1null100000000000000
1"12/1/1989 12:03"001000000000000
1"12/1/1989 13:14"000001110000000
1"12/1/1989 15:17"000001110000000
1"12/1/1989 16:17"000001110000000
3"3/9/1996 11:00"000001110000000
3"3/9/1996 19:00"000001110000000
3"3/9/1996 22:00"000000000000000
3"3/11/1996 21:00"000000000001100
3"3/12/1996 0:00"000100000000000

Generating the Predicates DataFrame

The predicates dataframe will always have the subject_id and timestamp columns. They should be unique between these two columns, as each row can capture multiple events.

ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.

Again, it is acceptable if your own predicates dataframe only contains plain predicate columns, as ACES can automatically create derived predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of and/or) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see #66).

Note: When creating plain predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the code field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows.

Example of the derived predicate discharge_or_death, expressed as an or() relationship between plain predicates discharge and death, which have been directly defined (ie., arbitrary values for their codes, defined in data, are present).

predicates:
  death:
    code: defined in data
  discharge:
    code: defined in data
  discharge_or_death:
    expr: or(discharge, death)
  ...