--- title: "Developer API" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Developer API} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") if(requireNamespace("pkgload", quietly = TRUE)) { pkgload::load_all(".", quiet = TRUE) } else if(requireNamespace("Immutables", quietly = TRUE)) { library(Immutables) } else { stop("Need either installed 'Immutables' or the 'pkgload' package to render this vignette.") } ``` Every immutables structure is built on finger trees annotated with *monoids* — cached summaries that support O(log n) search and split. This vignette covers the developer API for defining custom monoids, using predicate-based locate and split operations, and validating tree invariants. These operations work uniformly across `flexseq`, `ordered_sequence`, `priority_queue`, and `interval_index`. ## Measures and monoids A monoid is defined by three pieces: a binary combining function `f`, an identity value `i`, and a `measure` function that maps a single leaf entry to its measure value. The combining function *must* be associative (`f(a, b) == f(b, a)`), and the identity must satisfy `f(i, x) == x` and `f(x, i) == x`. Measure values typically represent something you would want to aggregate over a subset of elements, with different measures and aggregations resulting in unique properties. When operating on a `flexseq`, you can think of a monoid's `f` as being applied left-to-right (a left fold), accumulating measure values at each element. The calculation for the first element also uses `f`, which is where `i` is used, as the other input. ``` elements: x1, x2, x3 measures: m1, m2, m3, ... monoid-accumulated measures (infix notation): v1 = i %f% m1 v2 = (i %f% m1) %f% m2 = v1 %f% m2 v3 = ((i %f% m1) %f% m2) %f% m3 = v2 %f% m3 ... ``` This left-to-rightness is only well defined for `flexseq` (using element order defined by the user in construction), `ordered_sequence` (always ordered by key), and `interval_index` (always ordered by interval start). Priority queue element order is maintained internally with new elements being added on the right, not in priority order. There may still be uses for monoids that aggregate priority values. This is because we can compute an aggregate over any *subsequence*, not just prefixes. Internally elements are stored as leaf nodes in a tree structure, and aggregating measures upward can be done purely recursively, and cached for fast access at any level. This monoid keeps track of the sum of values: ```{r} sum_monoid <- measure_monoid( f = `+`, i = 0, measure = function(el) el ) ``` Next we attach it to a structure with `add_monoids()`, which accepts a named list of monoids. We can pass `show_custom_monoids = TRUE` to `print()` to see this new information: ```{r} set.seed(42) task_times <- runif(20, min = 1, max = 10) |> round(digits = 1) |> as_flexseq() x <- add_monoids(task_times, list(sum = sum_monoid)) print(x, show_custom_monoids = TRUE) ``` The `plot_structure()` functions visualizes the internal tree layout. Passing a function to `node_label` lets us render both leaf values and cached structural measures produced by monoids: ```{r fig.height=4, fig.width=10} plot_structure(x, node_label = function(node) { if(node$type == "Element") sprintf("%.1f\nΣ=%.1f", node$element, node$measures$sum) else sprintf("Σ=%.1f", node$measures$sum) }) ``` The tree now caches cumulative sum annotations at every internal node, enabling $O(log n)$ search by accumulated sum. Setting `overwrite = TRUE` replaces an existing monoid with the same name. Each structure type reserves certain monoid names for internal use (e.g. `.size`, `.named_count`, `.pq_min`). Attempting to add a monoid with a reserved name will error. ## Leaf entry contract The provided `measure` function receives a structure-specific *entry*, not just the stored value. The entry shapes are: - **`flexseq`**: the stored element directly. - **`ordered_sequence`**: `list(value, key)`. - **`priority_queue`**: `list(value, priority)`. - **`interval_index`**: `list(value, start, end)`. This means a monoid that works on one structure type may not work on another. Here is a monoid that sums keys on an `ordered_sequence`: ```{r} key_sum <- measure_monoid(`+`, 0L, function(entry) as.integer(entry$key)) os <- as_ordered_sequence(c("alice", "bob", "carol"), keys = c(10, 20, 30)) os <- add_monoids(os, list(key_sum = key_sum)) print(os, show_custom_monoids = TRUE) ``` ## Locating and splitting by predicate Predicate-based operations scan accumulated monoid values left-to-right and find the first point where a predicate transitions from `FALSE` to `TRUE`. The predicate must be *monotonic*: once it returns `TRUE`, it must stay `TRUE` for all subsequent accumulated values. This is what enables the $O(log n)$ search for a single split point. `locate_by_predicate()` finds the *first* element where the predicate fires, without splitting the tree. With `include_metadata = TRUE` it returns additional scan information. The predicate function is applied to accumulated measure values for a given monoid name; here we find the first element where the accumulated sum exceeds $25$. ```{r} loc <- locate_by_predicate(x, function(v) v > 25, "sum", include_metadata = TRUE) str(loc) ``` The `$found` entry indicates that a node where the predicate fires was found; if this is false the predicate is `FALSE` for all elements through the end. The `$value` entry gives the measure of the node where the predicate turned `TRUE`. The optional metadata includes `left_measure` (accumulated value before the match), `hit_measure` (the matched element's own accumulated measure), `right_measure` (accumulated value of everything after), and `index` (1-based position). In this example index 4 is where the first time where the accumulated sum is greater than 25, which we can verify in the plot above ($9.2 + 9.4 + 3.6 = 22.2$, adding in $8.5$ brings the sum to $30.7$). `split_by_predicate()` splits the structure at the predicate point, returning `$left` and `$right`. The matched element is the first element of `$right`: ```{r} s <- split_by_predicate(x, function(v) v > 25, "sum") s$left s$right ``` `split_around_by_predicate()` is similar but extracts the matched element into a separate `$value` field: ```{r} sa <- split_around_by_predicate(x, function(v) v > 25, "sum") sa$left sa$value sa$right ``` Both split variants accept an optional `accumulator` argument to start scanning from a value other than the monoid identity. `split_at()` is a convenience wrapper that splits by position using the built-in `.size` monoid (which counts nodes; each leaf node's measure is `1` and it uses the `sum` operator): ```{r} split_at(x, at = 5) ``` ## Built-in monoids in action As described by [Hinze and Paterson](https://www.cs.ox.ac.uk/ralf.hinze/publications/FingerTrees.pdf), this monoid + predicate pattern is remarkably flexible. The `priority_queue` and `ordered_sequence` types for example are not actually separate data structures, but monoid-annotated `flexseq` structures plus thin wrappers that invoke `locate_by_predicate()` and friends with the right predicate. The same primitives introduced above power every `peek_*`, `pop_*`, and bound query in the package. (Interval indices also follow these patterns, but are more complex than would be useful to review here.) ### Priority queue: `.pq_min` and `.pq_max` Priority queues store elements in insertion order, not priority order, so a linear scan would be the only way to find the min-priority element without some extra bookkeeping. That bookkeeping is `.pq_min` (and its twin `.pq_max`): a monoid whose cached value at every internal node is the smallest priority anywhere in that subtree, combined by picking the side with the smaller priority (left-most on ties, which preserves first-in first-out for equal priorities). `pop_min(q)` then works in three lines of logic: read the root aggregate as the *target* priority, form the predicate "has the accumulated subtree absorbed a slot with this priority yet?", and hand it to `split_around_by_predicate()` over `.pq_min`. The descent uses cached subtree aggregates to prune branches whose min is a different value, so the whole thing runs in $O(\log n)$ even though the leaves themselves are not in priority order. ### Ordered sequence: `.oms_max_key` An ordered sequence is a `flexseq` whose elements are kept sorted by key. Its built-in `.oms_max_key` monoid caches the largest key in each subtree, combined by picking the larger of its two inputs. Because the sequence is already key-ordered, the accumulated max-key is monotonically non-decreasing left-to-right, so a predicate like "is the running max at least my query?" transitions from `FALSE` to `TRUE` exactly once at the correct location. The `lower_bound()` function wires that predicate into `locate_by_predicate()` on `.oms_max_key` and returns the first element with key at or above the target; `peek_key()`, `pop_key()`, `count_between()`, and the rest all follow the same template with different predicate shapes and splitting/concatenating. In practice, `.pq_min`, `.oms_max_key` and other measures are stored as lists, containing the measure of interest (min priority value, max key value) as well as other helper information such as the key type to ensure consistent comparisons. The identity `i` and associative combining function `f` operate on list-based measures of the right shape. ## Example: DNA prefix search Custom monoids and predicate splits combine naturally for index-style queries. Here we build a persistent subsequence index over a DNA string and use it to find all positions where a query pattern occurs as a prefix. ```{r} sequence <- "ACGCGCTCGCGCATAGTCGCGCCTG" query <- "CGCGC" # goal: find indices where query occurs ``` The idea is to store each position in the string as a value, keyed on a subsequence long enough to contain the query (we'll keep 10-letter subsequences): ``` key: ACGCGC..., value = 1 key: CGCGCT..., value = 2 key: GCGCTC..., value = 3 ... ``` However, ordered sequences actually store their values in key-order: ``` key: ACGCGC..., value = 1 key: AGTCGC..., value = 15 key: ATAGTC..., value = 13 ... ``` Since they are in key-order, matches will be found in a run: all subsequences starting with the query will be next to each other. To extract them we just need to find the first one (a splitting operation over a monoid), and the last (another splitting operation over a monoid). First we define a monoid that tracks the maximum key (subsequence string) seen so far. The built-in `.oms_max_key` monoid already tracks this, but we define our own to illustrate the pattern. Recall from above that the measuring function takes a list with shape `list(value = ..., key = ...)` when used with an `ordered_sequence`; in this case we just need it to return the key as the measure. We can add the monoid to an empty tree, and values for new entries will be automatically computed on addition/removal etc. ```{r} max_seq <- measure_monoid(max, i = -Inf, measure = function(entry) entry$key) subseqs <- ordered_sequence() subseqs <- add_monoids(subseqs, list(max_seq = max_seq)) ``` Next we index every position in the DNA string by its length-10 subsequence, inserting them into the structure: ```{r} for(i in 1:nchar(sequence)) { subseq <- substr(sequence, i, i + 10) subseqs <- insert(subseqs, i, key = subseq) } subseqs ``` Now we can search for all subsequences starting with the query. First, split to find the first entry where the measure is `>=` the query: ```{r} first_split <- split_by_predicate(subseqs, function(m) query <= m, "max_seq") first_split$left first_split$right ``` An empty `$right` indicates no matches (all measures are less than the query). If it's not empty, but the first measure in `$right` does not start with the query, then there again no matches (the key is between the left and right but doesn't match). We can use the `get_measures()` function to return a `flexseq` of measure values by name and index with `[[1]]` to get the first measure. If the first key in `$right` *does* start with the query, then there is at least one match, and all matches would be in `$right`. We can split again to find where matching keys end with a different predicate, one that fires on the first measure that is strictly *greater* than the query *and* no longer starts with it (resulting in the index of the first non-match in the run). ```{r} if(length(first_split$right) > 0) { right_measures = get_measures(first_split$right, "max_seq") first_measure = right_measures[[1]] if(startsWith(first_measure, query)) { second_split <- split_by_predicate( first_split$right, function(m) query < m & !startsWith(m, query), "max_seq" ) matches <- second_split$left } } matches ``` Finally we can extract the original string positions from the matching entries: ```{r} positions <- fapply(matches, function(value, key) value) |> unlist() positions ``` ## Validation helpers `validate_tree()` checks structural invariants (monoid consistency, node structure) across the entire tree. `validate_name_state()` verifies that a tree is either fully unnamed or fully named with unique, non-empty names. Both are $O(n)$ and intended for debugging and tests. ```{r} validate_tree(x) validate_name_state(flexseq(a = 1, b = 2, c = 3)) ```