Predicates¶

In ACES, predicates specify how particular concepts relevant to your task of interest is expressed in your dataset of interest. These dataset-specific items form a large foundation of the cohort extraction algorithm as the more complex dataset-agnostic windowing logic of your task is defined based on your predicates, ultimately facilitating ease-of-sharing for your task configurations.

Predicate Columns¶

A predicate column is simply a column in a dataframe containing numerical counts (often just 0’s and 1’s), representing the number of times a given predicate (concept) occurs at a given timestamp for a given patient.

Suppose you had a simple time-sorted dataframe as follows:

subject_id	timestamp	code	value
1	null	SEX//male	null
1	1989-01-01 00:00:00	ADMISSION	null
1	1989-01-01 01:00:00	LAB//HR	90
1	1989-01-01 01:00:00	PROCEDURE_START	null
1	1989-01-01 02:00:00	DISCHARGE	null
1	1989-01-01 02:00:00	PROCEDURE_END	null
2	null	SEX//female	null
2	1991-05-06 12:00:00	ADMISSION	null
2	1991-05-06 20:00:00	DEATH	null
3	null	SEX//male	null
3	1980-10-17 22:00:00	ADMISSION	null
3	1980-10-17 22:00:00	LAB//HR	120
3	1980-10-18 01:00:00	LAB//temp	37
3	1980-10-18 09:00:00	DISCHARGE	null
3	1982-02-02 02:00:00	ADMISSION	null
3	1982-02-02 04:00:00	DEATH	null

The code column contains a string of an event that occurred at the given timestamp for a given subject_id. Note: Static variables are shown as rows with null timestamps.

You may then create a series of predicate columns depending on what suits your needs. For instance, here are some plausible predicate columns that could be created:

subject_id	timestamp	admission	discharge	death	discharge_or_death	lab	procedure_start	HR_over_100	male
1	1989-01-01 00:00:00	1	0	0	0	0	0	0	1
1	1989-01-01 01:00:00	0	0	0	0	1	1	1	1
1	1989-01-01 02:00:00	0	1	0	1	0	0	0	1
2	1991-05-06 12:00:00	1	0	0	0	0	0	0	0
2	1991-05-06 20:00:00	0	0	1	1	0	0	0	0
3	1980-10-17 22:00:00	1	0	0	0	1	0	0	1
3	1980-10-18 01:00:00	0	0	0	0	1	0	0	1
3	1980-10-18 09:00:00	0	1	0	1	0	0	0	1
3	1982-02-02 02:00:00	1	0	0	0	0	0	0	1
3	1982-02-02 04:00:00	0	0	1	1	0	0	0	1

Note: This set of predicates are all plain predicates (ie., explicitly expressed as a value in the dataset), with the exception of the derived predicate discharge_or_death, which can be expressed by applying boolean logic on the discharge and death predicates (ie., or(discharge, death)). You may choose to create these columns for derived predicates explicitly (as you would plain predicates). Or, ACES can automatically create them from plain predicates if the boolean logic is provided in the task configuration file. Please see Predicates for more information.

Additionally, you may notice that the tables differ in shape. In the original raw data, (subject_id, timestamp) is not unique. However, a final predicates dataframe must have unique (subject_id, timestamp) pairs. If the MEDS or ESGPT standard is used, ACES will automatically collapse rows down into unique per-patient per-timestamp levels (ie., grouping by these two columns and aggregating by summing predicate counts). However, if creating predicate columns directly, please ensure your dataframe is unique over (subject_id, timestamp).

Sample Predicates DataFrame¶

A sample predicates dataframe is provided in the repository (sample_data/sample_data.csv). This dataframe holds completely synthetic data and was designed such that the accompanying sample configuration files in the repository (sample_configs/) could be directly extracted.

[1]:

import polars as pl

pl.read_csv("../../../sample_data/sample_data.csv")

[1]:

shape: (54, 17)

subject_id	timestamp	male	female	admission	death	discharge	lab	spo2	normal_spo2	abnormally_low_spo2	abnormally_high_spo2	procedure_start	procedure_end	ventilation	diagnosis_ICD9CM_41071	diagnosis_ICD10CM_I214
i64	str	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64	i64
1	null	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	"12/1/1989 12:03"	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0
1	"12/1/1989 13:14"	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0
1	"12/1/1989 15:17"	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0
1	"12/1/1989 16:17"	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
3	"3/9/1996 11:00"	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0
3	"3/9/1996 19:00"	0	0	0	0	0	1	1	1	0	0	0	0	0	0	0
3	"3/9/1996 22:00"	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	"3/11/1996 21:00"	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0
3	"3/12/1996 0:00"	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0

Generating the Predicates DataFrame¶

The predicates dataframe will always have the subject_id and timestamp columns. They should be unique between these two columns, as each row can capture multiple events.

ACES is able to automatically compute the predicates dataframe from your dataset and the fields defined in your task configuration if you are using the MEDS or ESGPT data standard. Should you choose to not transform your dataset into one of these two currently supported standards, you may also navigate the transformation yourself by creating your own predicates dataframe.

Again, it is acceptable if your own predicates dataframe only contains plain predicate columns, as ACES can automatically create derived predicate columns from boolean logic in the task configuration file. However, for complex predicates that would be impossible to express (outside of and/or) in the configuration file, we direct you to create them manually prior to using ACES. Support for additional complex predicates is planned for the future, including the ability to use SQL or other expressions (see #66).

Note: When creating plain predicate columns directly, you must still define them in the configuration file (they could be with an arbitrary value in the code field) - ACES will verify their existence after data loading (ie., by validating that a column exists with the predicate name in your dataframe). You will also need them for referencing in your windows.

Example of the derived predicate discharge_or_death, expressed as an or() relationship between plain predicates discharge and death, which have been directly defined (ie., arbitrary values for their codes, defined in data, are present).

predicates:
  death:
    code: defined in data
  discharge:
    code: defined in data
  discharge_or_death:
    expr: or(discharge, death)
  ...