Predicates¶
In ACES, predicates specify how particular concepts relevant to your task of interest is expressed in your dataset of interest. These dataset-specific items form a large foundation of the cohort extraction algorithm as the more complex dataset-agnostic windowing logic of your task is defined based on your predicates, ultimately facilitating ease-of-sharing for your task configurations.
Predicate Columns¶
A predicate column is simply a column in a dataframe containing numerical counts (often just 0’s and 1’s), representing the number of times a given predicate (concept) occurs at a given timestamp for a given patient.
Suppose you had a simple time-sorted dataframe as follows:
subject_id |
timestamp |
code |
value |
|---|---|---|---|
1 |
null |
SEX//male |
null |
1 |
1989-01-01 00:00:00 |
ADMISSION |
null |
1 |
1989-01-01 01:00:00 |
LAB//HR |
90 |
1 |
1989-01-01 01:00:00 |
PROCEDURE_START |
null |
1 |
1989-01-01 02:00:00 |
DISCHARGE |
null |
1 |
1989-01-01 02:00:00 |
PROCEDURE_END |
null |
2 |
null |
SEX//female |
null |
2 |
1991-05-06 12:00:00 |
ADMISSION |
null |
2 |
1991-05-06 20:00:00 |
DEATH |
null |
3 |
null |
SEX//male |
null |
3 |
1980-10-17 22:00:00 |
ADMISSION |
null |
3 |
1980-10-17 22:00:00 |
LAB//HR |
120 |
3 |
1980-10-18 01:00:00 |
LAB//temp |
37 |
3 |
1980-10-18 09:00:00 |
DISCHARGE |
null |
3 |
1982-02-02 02:00:00 |
ADMISSION |
null |
3 |
1982-02-02 04:00:00 |
DEATH |
null |
The code column contains a string of an event that occurred at the given timestamp for a given subject_id. Note: Static variables are shown as rows with null timestamps.
You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:
subject_id |
timestamp |
admission |
discharge |
death |
discharge_or_death |
lab |
procedure_start |
HR_over_100 |
male |
|---|---|---|---|---|---|---|---|---|---|
1 |
1989-01-01 00:00:00 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
1989-01-01 01:00:00 |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
1989-01-01 02:00:00 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
2 |
1991-05-06 12:00:00 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
2 |
1991-05-06 20:00:00 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
0 |
3 |
1980-10-17 22:00:00 |
1 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
3 |
1980-10-18 01:00:00 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
1 |
3 |
1980-10-18 09:00:00 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
3 |
1982-02-02 02:00:00 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
3 |
1982-02-02 04:00:00 |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
Note: This set of predicates are all plain predicates (ie., explicitly expressed as a value in the dataset), with the exception of the derived predicate discharge_or_death, which can be expressed by applying boolean logic on the discharge and death predicates (ie., or(discharge, death)). You may choose to create these columns for derived predicates explicitly (as you would plain predicates). Or, ACES can automatically create them from plain predicates if
the boolean logic is provided in the task configuration file. Please see Predicates for more information.
Additionally, you may notice that the tables differ in shape. In the original raw data, (subject_id, timestamp) is not unique. However, a final predicates dataframe must have unique (subject_id, timestamp) pairs. If the MEDS or ESGPT standard is used, ACES will automatically collapse rows down into unique per-patient per-timestamp levels (ie., grouping by these two columns and aggregating by summing predicate counts). However, if creating predicate columns directly, please ensure
your dataframe is unique over (subject_id, timestamp).
Sample Predicates DataFrame¶
A sample predicates dataframe is provided in the repository (sample_data/sample_data.csv). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository (sample_configs/) could be directly extracted.
[1]:
import polars as pl
pl.read_csv("../../../sample_data/sample_data.csv")
[1]:
| subject_id | timestamp | male | female | admission | death | discharge | lab | spo2 | normal_spo2 | abnormally_low_spo2 | abnormally_high_spo2 | procedure_start | procedure_end | ventilation | diagnosis_ICD9CM_41071 | diagnosis_ICD10CM_I214 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i64 | str | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 | i64 |
| 1 | null | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | "12/1/1989 12:03" | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | "12/1/1989 13:14" | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | "12/1/1989 15:17" | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | "12/1/1989 16:17" | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 3 | "3/9/1996 11:00" | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | "3/9/1996 19:00" | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | "3/9/1996 22:00" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | "3/11/1996 21:00" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 3 | "3/12/1996 0:00" | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Generating the Predicates DataFrame¶
The predicates dataframe will always have the subject_id and timestamp columns. They should be unique between these two columns, as each row can capture multiple events.
ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.
Again, it is acceptable if your own predicates dataframe only contains plain predicate columns, as ACES can automatically create derived predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of and/or) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or
other expressions (see #66).
Note: When creating plain predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the code field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows.
Example of the derived predicate discharge_or_death, expressed as an or() relationship between plain predicates discharge and death, which have been directly defined (ie., arbitrary values for their codes, defined in data, are present).
predicates:
death:
code: defined in data
discharge:
code: defined in data
discharge_or_death:
expr: or(discharge, death)
...