ACES Technical Details

Configuration Language Specification

This document specifies the configuration language for the automatic extraction of task dataframes and cohorts from structured EHR data organized either via the MEDS format (recommended) or the ESGPT format. This extraction system works by defining a configuration object that details the underlying concepts, inclusion/exclusion, and labeling criteria for the cohort/task to be extracted, then using a recursive algorithm to identify all realizations of valid patient time-ranges of data that satisfy those constraints from the raw data. For more details on the recursive algorithm, see Algorithm Design.

As indicated above, these cohorts are specified through a combination of concepts (realized as event predicate functions, aka “predicates”) which are dataset specific and inclusion/exclusion/labeling criteria which, conditioned on a set of predicate definitions, are dataset agnostic.

Predicates are currently limited to “count” predicates, which are predicates that count the number of times a boolean condition is satisfied over a given time window, which can either be a single timestamp, thus tracking whether how many observations there were that satisfied the boolean condition in that event (aka at that timestamp) or over 1-dimensional windows. In the future, predicates may expand to include other notions of functional characterization, such as tracking the average/min/max value a concept takes on over a time-period, etc.

Constraints are specified in terms of time-points that can be bounded by events that satisfy predicates or temporal relationships on said events. The windows between these time-points can then either be constrained to contain events that satisfy certain aggregation functions over predicates for these time frames.


In the machine form used by ACES, the configuration file consists of three parts:

  • predicates, stored as a dictionary from string predicate names (which must be unique) to either aces.config.PlainPredicateConfig objects, which store raw predicates with no dependencies on other predicates, or aces.config.DerivedPredicateConfig objects, which store predicates that build on other predicates.

  • trigger, stored as a string to EventConfig

  • windows, stored as a dictionary from string window names (which must be unique) to aces.config.WindowConfig objects.

Below, we will detail each of these configuration objects.


Predicates: PlainPredicateConfig and DerivedPredicateConfig

aces.config.PlainPredicateConfig: Configuration of Predicates that can be Computed Directly from Raw Data

These configs consist of the following four fields:

  • code: The string expression for the code object that is relevant for this predicate. An observation will only satisfy this predicate if there is an occurrence of this code in the observation. The field can additionally be a dictionary with either a regex key and the value being a regular expression (satisfied if the regular expression evaluates to True), or a any key and the value being a list of strings (satisfied if there is an occurrence for any code in the list).

    [!NOTE] Each individual definition of PlainPredicateConfig and code will generate a separate predicate column. Thus, for memory optimization, it is strongly recommended to match multiple values using either the List of Values or Regular Expression formats whenever possible.

  • value_min: If specified, an observation will only satisfy this predicate if the occurrence of the underlying code with a reported numerical value that is either greater than or greater than or equal to value_min (with these options being decided on the basis of value_min_inclusive, where value_min_inclusive=True indicating that an observation satisfies this predicate if its value is greater than or equal to value_min, and value_min_inclusive=False indicating a greater than but not equal to will be used).

  • value_max: If specified, an observation will only satisfy this predicate if the occurrence of the underlying code with a reported numerical value that is either less than or less than or equal to value_max (with these options being decided on the basis of value_max_inclusive, where value_max_inclusive=True indicating that an observation satisfies this predicate if its value is less than or equal to value_max, and value_max_inclusive=False indicating a less than but not equal to will be used).

  • value_min_inclusive: See value_min

  • value_max_inclusive: See value_max

  • other_cols: This optional field accepts a 1-to-1 dictionary of column names to column values, and can be used to specify further constraints on other columns (ie., not code) for this predicate.

A given observation will be gauged to satisfy or fail to satisfy this predicate in one of two ways, depending on its source format.

  1. If the source data is in MEDS format (recommended), then the code will be checked directly against MEDS’ code field and the value_min and value_max constraints will be compared against MEDS’ numeric_value field.

    [!NOTE] This syntax does not currently support defining predicates that also rely on matching other, optional fields in the MEDS syntax; if this is a desired feature for you, please let us know by filing a GitHub issue or pull request or upvoting any existing issue/PR that requests/implements this feature, and we will add support for this capability.

  2. If the source data is in ESGPT format, then the code will be interpreted in the following manner: a. If the code contains a "//", it will be interpreted as being a two element list joined by the "//" character, with the first element specifying the name of the ESGPT measurement under consideration, which should either be of the multi-label classification or multivariate regression type, and the second element being the name of the categorical key corresponding to the code in question within the underlying measurement specified. If either of value_min and value_max are present, then this measurement must be of a multivariate regression type, and the corresponding values_column for extracting numerical observations from ESGPT’s dynamic_measurements_df will be sourced from the ESGPT dataset configuration object. b. If the code does not contain a "//", it will be interpreted as a direct measurement name that must be of the univariate regression type and its value, if needed, will be pulled from the corresponding column.

aces.config.DerivedPredicateConfig: Configuration of Predicates that Depend on Other Predicates

These configuration objects consist of only a single string field–expr–which contains a limited grammar of accepted operations that can be applied to other predicates, containing precisely the following:

  • and(pred_1_name, pred_2_name, ...): Asserts that all of the specified predicates must be true.

  • or(pred_1_name, pred_2_name, ...): Asserts that any of the specified predicates must be true.

[!NOTE] Currently, and’s and or’s cannot be nested. Upon user request, we may support further advanced analytic operations over predicates.


Events: aces.config.EventConfig

The event config consists of only a single field, predicate, which specifies the predicate that must be observed with value greater than one to satisfy the event. There can only be one defined “event” with an “EventConfig” in a valid configuration, and it will define the “trigger” event of the cohort.

The value of its field can be any defined predicate.


Windows: aces.config.WindowConfig

Windows contain a tracking name field, and otherwise are specified with two parts: (1) A set of four parameters (start, end, start_inclusive, and end_inclusive) that specify the time range of the window, and (2) a set of constraints specified through two fields, dictionary of constraints (the has field) that specify the constraints that must be satisfied over the defined predicates for a possible realization of this window to be valid.

Time Range Fields

start and end

Valid windows always progress in time from the start field to the end field. These two fields define, in symbolic form, the relationship between the start and end time of the window. These two fields must obey the following rules:

  1. Linkage to other windows: Firstly, exactly one of these two fields must reference an external event, as specified either through the name of the trigger event or the start or end event of another window. The other field must either be null/None/omitted (which has a very specific meaning, to be explained shortly) or must reference the field that references the external event.

  2. Linkage reference language: Secondly, for both events, regardless of whether they reference an external event or an internal event, that reference must be expressed in one of the following ways.

    1. $REFERENCING = $REFERENCED + $TIME_DELTA, $REFERENCING = $REFERENCED - $TIME_DELTA, etc. In this case, the referencing event (either the start or end of the window) will be defined as occurring exactly $TIME_DELTA either after or before the event being referenced (either the external event or the end or start of the window).

      [!NOTE] If $REFERENCED is the start field, then $TIME_DELTA must be positive, and if $REFERENCED is the end field, then $TIME_DELTA must be negative to preserve the time ordering of the window fields.

    2. $REFERENCING = $REFERENCED -> $PREDICATE, $REFERENCING = $REFERENCED <- $PREDICATE In this case, the referencing event will be defined as the next or previous event satisfying the predicate, $PREDICATE.

      [!NOTE] If the $REFERENCED is the start field, then the “next predicate ordering” ($REFERENCED -> $PREDICATE) must be used, and if the $REFERENCED is the end field, then the “previous predicate ordering” ($REFERENCED <- $PREDICATE) must be used to preserve the time ordering of the window fields. These forms can lead to windows being defined as single point events, if the $REFERENCED event itself satisfies $PREDICATE and the appropriate constraints are satisfied and inclusive values are set.

    3. $REFERENCING = $REFERENCED In this case, the referencing event will be defined as the same event as the referenced event.

  3. null/None/omitted: If start is null/None/omitted, then the window will start at the beginning of the patient’s record. If end is null/None/omitted, then the window will end at the end of the patient’s record. In either of these cases, the other field must reference an external event, per rule 1.

start_inclusive and end_inclusive

These two fields specify whether the start and end of the window are inclusive or exclusive, respectively. This applies both to whether they are included in the calculation of the predicate values over the windows, but also, in the $REFERENCING = $REFERENCED -> $PREDICATE and $REFERENCING = $PREDICATE -> $REFERENCED cases, to which events are possible to use for valid next or prior $PREDICATE events. E.g., if we have that start_inclusive=False and the end field is equal to start -> $PREDICATE, and it so happens that the start event itself satisfies $PREDICATE, the fact that start_inclusive=False will mean that we do not consider the start event itself to be a valid start to any window that ends at the same start event, as its timestamp when considered as the prospective “window start timestamp” occurs “after” the effective timestamp of itself when considered as the $PREDICATE event that marks the window end given that start_inclusive=False and thus we will think of the window as truly starting an iota after the timestamp of the start event itself.

Constraints Field

The constraints field is a dictionary that maps predicate names to tuples of the form (min_valid, max_valid) that define the valid range the count of observations of the named predicate that must be found in a window for it to be considered valid. Either min_valid or max_valid constraints can be None, in which case those endpoints are left unconstrained. Likewise, unreferenced predicates are also left unconstrained.

[!NOTE] As predicate counts are always integral, this specification does not need an additional inclusive/exclusive endpoint field, as one can simply increment the bound by one in the appropriate direction to achieve the result. Instead, this bound is always interpreted to be inclusive, so a window would satisfy the constraint for predicate name with constraint name: (1, 2) if the count of observations of predicate name in a window was either 1 or 2. All constraints in the dictionary must be satisfied on a window for it to be included.

Algorithm Overview

We will assume that we are given a dataframe df which details events that have happened to subjects. Each row in the dataframe will have a subject_id column which identifies the subject, and a timestamp column which identifies the timestamp at which the event the row is describing happened. df would be constructed to have unique subject_id and timestamp pairs.

We will also assume this dataframe has a collection of columns which describe the event in a variety of ways. These columns can either have a binary value (1/0) representing whether certain properties are True/False for each row’s event, or a count (integer) for the number of times that certain properties hold within each row’s event. We’ll call these additional properties/columns “predicates” over the events, as they can often be interpreted as boolean or count functions over the event.

For example, we may consider a dataframe df_clinical_events that quantifies clinical events happening to patients, with predicates "admission", "discharge", "death", and "covid_dx", like this:

subject_id

timestamp

admission

discharge

death

covid_dx

1

2020-01-01 12:03:31

1

0

0

0

1

2020-01-01 12:33:01

0

0

0

0

1

2020-01-01 13:02:58

0

0

0

0

1

2020-01-01 15:00:00

0

0

0

0

1

2020-01-04 11:12:00

0

1

0

0

1

2022-04-22 07:45:00

0

0

1

0

2

2020-01-01 12:03:31

1

0

0

0

2

2020-01-02 10:18:29

0

0

0

0

2

2020-01-02 16:18:29

0

0

0

1

2

2020-01-03 14:47:31

0

0

1

0

3

2020-01-01 12:03:31

1

0

0

0

3

2020-01-02 12:03:31

0

1

0

0

3

2022-01-01 12:03:31

1

0

0

0

3

2022-01-06 12:03:31

0

0

1

0

In this case, we have 3 subjects (patients), which have the following respective approximate time series of events:

  • Subject 1 is admitted, has 3 events that don’t satisfy any predicates, is discharged, dies, and has no further events.

  • Subject 2 is admitted, has an event that satisfies no predicate, has a COVID diagnosis, dies, and has no further events.

  • Subject 3 is admitted, then is discharged, then is admitted again, and then dies.

Events that don’t satisfy any predicates in this particular case could represent a variety of other events in the medical record, such as a lab test, a procedure, or a non-COVID diagnosis, just to name a few.

Given data like this, our algorithm is designed to extract valid start and end times of “windows” within a subject’s time series that satisfy certain inclusion and exclusion criteria and are defined with temporal and event-bounded constraints. We can use this algorithm to automatically extract windows of interest from the record, including but not limited to data cohorts and downstream task labeled datasets for machine learning applications.

We will specify these windows using a configuration file language that is ultimately interpreted into a tree structure. For example, suppose we wish to extract a dataset for the prediction of in-hospital mortality from the data defined in the above df_clinical_events dataframe, such that we wish to include the first 24 hours of data of each hospitalization as an input to a model, and predict whether the patient will die within the hospital. Suppose we also subject the dataset to constraints where the admission in question must be at least 48 hours in length and that the patient must not have a COVID diagnosis within that admission.

We might then specify these windows using the defined predicates in the configuration file language as follows:

trigger: admission

windows:
  input:
    start:
    end: trigger + 24h
  gap:
    start: trigger
    end: start + 48h
    has:
      admission: (None, 0)
      discharge: (None, 0)
      death: (None, 0)
      covid_dx: (None, 0)
  target:
    start: gap.end
    end: start -> discharge_or_death
    has:
      covid_dx: (None, 0)
    label: death

Given that our machine learning model seeks to predict in-hospital mortality, our dataset should include both positive and negative samples (patients that died in the hospital and patients that didn’t die). Hence, the target “window” concludes at either a "death" event (patients that died) or a"discharge" event (patients that didn’t die).

We can see that this set of specifications can be realized in a “valid” form for a patient if there exist a set of time points such that, within 48 hours after an admission, there are no discharges, deaths, or COVID diagnoses, and that there exists a discharge or death event after the first 48 hours of an admission where there were no COVID diagnoses between the end of that first 48 hours and the subsequent discharge or death event.

These windows form a naturally hierarchical, tree-based structure based on their relative dependencies on one another. In particular, we can realize the following tree structure constructed by nodes inferred for the above configuration:

- Trigger
  - Gap Start (Trigger)
    - Gap End (Gap Start + 48h)
      - Target Start (Gap End)
        - Target End (subsequent "discharge" or "death")
  - Input End

Our algorithm will naturally rely on this hierarchical structure by performing a set of recursive database search operations to extract the windows that satisfy the constraints of the configuration file by recursing over each subtree to find windows that satisfy the constraints of those subtrees individually.

In the rest of this document, we will detail how our algorithm automatically extracts records that meet these criteria and the terminology we use to describe our algorithm (both here and in the raw source code and code comments). There are certain limitations of this algorithm where some kinds of tasks cannot yet be expressed directly (more information available in the FAQs and the Future Roadmap). Details about the true configuration language that is used in practice to specify “windows” can be found in Configuration Language Specification. Some task examples are available in Task Examples.


Algorithm Terminology

Event

An “event” in our dataset is a unique timestamp that occurs for a given subject.

Predicate

A “predicate” is a boolean or count function that can be applied to an event to describe the observations that an underlying dataset included within the timestamp of that event. They will often be boolean functions at the beginning of the process, but become aggregated into count functions when summarizing windows, so will be thought of as count functions to capture this generality throughout the algorithm as it rarely, if ever, necessary to distinguish between the two.

Window

A “window” is just a time range capturing some portion of a subject’s record. It can be inclusive or exclusive on either endpoint, and may or may not have endpoints corresponding to an extant event in the dataset, as opposed to a time point at which no event occurred.

Time is treated as strictly increasing in our algorithm (ie., the start of a “window” will always be before or equal to the end of that “window”).

A “Root” of a Subtree

A subtree in the hierarchy of constraint windows has a “root” node in the tree, which corresponds to the start or end of a “window” in the set of constraints. For example, the “Gap End” node in the tree above is the root of the subtree Gap End -> Target Start -> Target End.

A “Realized” Subtree of Constraint Windows

A subtree in the hierarchy of constraint windows can be realized in a patient dataset by finding a set of timestamps such that the windows of events they bound satisfy the constraints of the subtree. For instance, using our example in-hospital mortality task above, the subtree Gap End -> Target Start -> Target End would be realized if, given the “Gap End” timestamp, we can find:

  • A timestamp for “Target Start”, which is equal to the timestamp of “Gap End” in this example.

  • A timestamp for “Target End”, which should be equal to the timestamp of a "death" or "discharge" event and there are no "covid_dx" events between the timestamp of “Target Start” and the timestamp of “Target End”.

An “Anchor” or “Anchor Event” of a Subtree

A subtree in the hierarchy of constraint windows that can be realized in a real patient’s record will have one most recent ancestor node whose timestamp will correspond to the timestamp of a real event in the patient record. This node is called the “anchor” of the subtree. For example, in any realization of the tree above, the admission event matched by the “Trigger” node will be the anchor of the realization of the Gap End -> Target Start -> Target End subtree, as the Gap End is defined via a relative time gap to the admission event and thus cannot be guaranteed to correspond to an extant event in the patient record. However, the admission event of the Trigger node will always correspond to an extant event in the patient record and exist in the dataset proper.

This notion of an anchor will be useful in the algorithm as it will correspond to rows from which we will perform temporal and event-based aggregations to determine whether windows satisfy subtree constraints.


Algorithm Design

I. Initialization

Inputs

During initialization, we will be given the following inputs:

cfg

cfg is a aces.config.TaskExtractorConfig object containing our task definition, include all information about predicates, the trigger event, and windows.

predicates_df

The predicates_df dataframe will contain all events and their predicates.

Computation

During initialization, we will first ensure that the predicates dataframe contains unique (subject_id, timestamp) pairs. This is to ensure that no memory leaks occur over mismatched/extra rows when joining dataframes.

Identify Prospective Root Anchors

Prior to summarizing the rest of the task tree, we first identify prospective root anchors by checking the constraints of the trigger event. The trigger event represents the node of the tree we aim to realize, and thus this first step can significantly filter our cohort.

Recurse over Each Subtree

With this dataframe, we can proceed to traverse the tree and recurse over each subtree rooted at each node.


II. Recursive Step

Inputs

In our recursive step, we will be given the following inputs:

predicates_df

The predicates_df dataframe will contain all events and their predicates. This will not be modified across recursive steps.

subtree_anchor_to_subtree_root_df

The subtree_anchor_to_subtree_root_df dataframe will contain rows corresponding to the timestamps of a superset of all possible valid anchor events for realizations of the subtree over which we are recursing (a superset, as if there exist no valid realizations of subtrees, then a prospective anchor would be invalid - if we can find a valid subtree realization for a prospective anchor in this input dataframe, said anchor would be a true valid anchor).

This dataframe will also contain the counts of predicates between the prospective anchor events indexed by the rows of this dataframe and the corresponding possible root timestamps of the subtree over which we are recursing. This information will be necessary to compute the proper counts within a “window” during the recursive step.

offset

In the event that the subtree root timestamp is not the same as the subtree anchor timestamp (there may be a temporal offset between the two), the offset will be the difference between the two timestamps. If the two are not the same, they will guaranteed to be separated by a constant offset because the subtree root will either correspond to a fixed time delta from the subtree anchor or will be an actual event itself, in which case it will be the subtree anchor.

Computation

In the recursive step, we will iterate over all children of the subtree root node. For each child, we will do the following:

Aggregate Predicates over the Relevant “Window”

First, we will aggregate the predicates from predicates_df over the rows corresponding to the “window” spanning the root of the subtree to the root of the selected child. This aggregation step will always return a dataframe keyed by the subject_id column as well as by any possible prospective realizations of anchor events for the subtree rooted at the selected child node. This computation will take one of two forms:

Temporal Aggregation

If the edge linking the subtree root to the child is a temporal relationship (e.g., in our example above, the “Gap End” node is defined as a fixed time delta from the “Gap Start” node), we will aggregate the predicates by using a “rolling” (or “temporal” group-by) operation on the predicates_df dataframe, summarizing time windows of the appropriate size and grouping by the subject_id column. We will perform this aggregation globally over the predicates dataframe, leveraging the determined edge time delta and the passed offset parameter (such that we compute the aggregation over the correct “window” in time from any possible realization of the subtree anchor) and then filter the resulting dataframe to only include rows corresponding to said possible subtree anchors. As a temporal edge means the anchor of the child subtree is the same as the anchor of the passed subtree, this suffices for our intended computation, and we can return it directly.

Event-bound Aggregation

If the edge linking the subtree root to the child is an event-bound relationship (e.g., in our example above, the “Target End” node is defined as the first subsequent "discharge" or "death" event after the “Target Start” node), we will aggregate the predicates by using a custom row-predicate-bound aggregation over the database that will be implemented using differences of cumulative sums within the global predicates_df dataframe. In particular, we will first construct the following three dataframes from our inputs:

  1. A dataframe that contains the cumulative count of all predicates seen up until each event (row) in the predicates_df.

  2. A dataframe that contains nulls in each row that does not correspond to a possible prospective realization of a child anchor event given the edge constraints and the possible prospective subtree anchor events and the specified offset, and contains a True value otherwise.

  3. A dataframe that contains nulls in each row that does not correspond to a possible prospective realization of this subtree’s anchor event and contains a True value otherwise.

From these three dataframes, we can then forward fill (in time) the cumulative counts of each predicate seen at each prospective subtree anchor event up to the next subsequent possible child anchor node, and take the difference between the two to compute the relative counts for each predicate column between each successive pair of subtree anchor events and child anchor events, keyed by child anchor. We must then also subtract the counts seen between the subtree anchor events and the subtree root events to ensure we are only capturing the events between the correct subtree root and child root.

Filter on Constraints

Next, with these new “window” counts, we can validate that any inclusion or exclusion criteria are upheld, and if not, remove those subtrees as possible realizations of the “window” before proceeding to the next computational step.

Recurse through Child Subtree

With this filtered set of possible prospective child anchor nodes, we can now recurse through the child subtree.


III. Clean-Up

Inputs

After recursion, we will have a result dataframe:

result

This dataframe contains rows that represent valid realizations of the task tree. Each node of the tree will have a column with a pl.Struct object containing the name of the window the node represents, the start and end times of the window, and counts of all defined predicates.

Computation

With this result, we can then proceed with some clean-up to optimize the output and streamline downstream tasks by doing the following:

Labeling

If a label field is specified in exactly one defined window in the task configuration, a column will be created to serve as the label for the task. The field corresponds to a defined predicate, and as such, that predicate count for that window will be extracted.

Indexing Timestamp

If an ‘index_timestamp’ field is specified in exactly one defined window in the task configuration, a column will be created to serve as an index for the output cohort. This timestamp can be manually specified to any start or end timestamp of any desired window; however, it should represent the timestamp at which point a prediction can be made (ie., at the end of the input windows).

Matching Input Schemas

For queries on MEDS-formatted dataset, ACES will automatically typecast columns and filter dataframes appropriately to match the label schema defined in MEDS v0.3.

Re-order & Return

Finally, given this dataframe, the algorithm will sort the columns by placing subject_id, index_timestamp, label, and trigger first, if available and in that order, followed by all other window summary columns in the order of a pre-order traversal of the task tree.