#' (`STEP 2`) Split `pulse_data` across sequential time windows
#'
#' @description
#' * `step 1` -- [pulse_read()]
#' * **`-->>` step 2 -- [pulse_split()] `<<--`**
#' * `step 3` -- [pulse_optimize()]
#' * `step 4` -- [pulse_heart()]
#' * `step 5` -- [pulse_doublecheck()]
#' * `step 6` -- [pulse_choose_keep()]
#'
#' After all raw PULSE data has been imported, the dataset must be split across sequential time windows.
#'
#' `pulse_split()` takes the output from a call to `pulse_read()` and splits data across user-defined time windows. The output of `pulse_split()` can be immediately passed to `pulse_heart()`, or first optimized with `pulse_optimize()` and only then passed to `pulse_heart()` (highly recommended).
#'
#' @inheritParams pulse_read
#' @param pulse_data the output from a call to [pulse_read()].
#' @param window_width_secs numeric, in seconds, defaults to `30`; the width of the time windows over which heart rate frequency will be computed.
#' @param window_shift_secs numeric, in seconds, defaults to `60`; by how much each subsequent window is shifted from the preceding one.
#' @param min_data_points numeric, defaults to `0.8`; decimal from 0 to 1, used as a threshold to discard incomplete windows where data is missing (e.g., if the sampling frequency is `20` and `window_width_secs = 30`, each window should include `600` data points, and so if `min_data_points = 0.8`, windows with less than `600 * 0.8 = 480` data points will be rejected).
#' @param subset numerical, defaults to `0`; the number of time windows to keep from the entire dataset (or the number of entries to reject if set to a negative value); smaller subsets make the entire processing quicker and facilitate the execution of trial runs to optimize parameter selection before processing the entire dataset.
#' @param subset_seed numerical, defaults to `NULL`; only used if `subset` is different from `0`; `subset_seed` controls the seed used when extracting a subset of the available data; if set to `NULL`, a random seed is selected, resulting in rows being selected randomly; alternativelly, the user can set a specific seed in order to always select the same rows (important when the goal is to compare the impact of different parameter combinations using the exact same data points).
#' @param subset_reindex logical, defaults to `FALSE`; only used if `subset` is different from `0`; after extracting a subset of the available data, should rows be re-indexed (i.e., `.$i` made fully sequential); re-indexed rows make using `pulse_plot_raw()` easier, but row identity doesn't match anymore with row identity before subsetting.
#'
#' @seealso
#'  * [pulse_read()], [pulse_optimize()], [pulse_heart()], [pulse_doublecheck()] and [pulse_choose_keep()] are the other functions needed for the complete PULSE processing workflow
#'  * [PULSE()] is a wrapper function that executes all the steps needed to process PULSE data at once
#'
#' @section Window `width` and `shift`:
#' A good starting point for `window_width_secs` is to set it to between `30` and `60` seconds.
#'
#' As a rule of thumb, use lower values for data collected from animals with naturally faster heart rates and/or that have been subjected to treatments conducive to fast heart rates still (e.g., thermal performance ramps). In such cases, lower values will result in higher temporal resolution, which may be crucial if experimental conditions are changing rapidly. Conversely, experiments using animals with naturally slower heart rates and/or subjected to treatments that may cause heart rates to stabilize or even slow (e.g., control or cold treatments) may require the use of higher values for `window_width_secs`, as the resulting windows should encompass no less than 5-7 heartbeat cycles.
#'
#' As for `window_shift_secs`, set it to a value:
#' * smaller than `window_width_secs` if overlap between windows is desired (not usually recommended) (if `window_width_secs = 30` and `window_shift_secs = 15`, the first 3 windows will go from `[0, 30)`, `[15, 45)` and `[30, 60)`)
#' * equal to `window_width_secs` to process all data available (if `window_width_secs = 30` and `window_shift_secs = 30`, the first 3 windows will go from `[0, 30)`, `[30, 60)` and `[60, 90)`)
#' * larger than `window_width_secs` to skip data (ideal for speeding up the processing of large datasets) (if `window_width_secs = 30` and `window_shift_secs = 60`, the first 3 windows will go from `[0, 30)`, `[60, 90)` and `[120, 150)`)
#'
#' In addition, also consider that lower values for the `window_...` parameters may lead to oversampling and a cascade of statistical issues, the resolution of which may end up negating any advantage gained.
#'
#' @section Handling gaps in the dataset:
#' `min_data_points` shouldn't be set too low, otherwise only nearly empty windows will be rejected.
#'
#' @return
#' A tibble with three columns. Column $`i` stores the order of each time window. Column $`smoothed` is a logical vector flagging smoothed data (at this point defaulting to `FALSE`, but later if [`pulse_optimize`] is used, values can change to `TRUE`. Column $`data` is a list with all the valid time windows (i.e., complying with `min_data_points`), each window being a subset of `pulse_data` (a tibble with at least 2 columns (time + one or more channels) containing PULSE data with timestamps within that time window)
#'
#' @export
#'
#' @examples
#' ## Begin prepare data ----
#' pulse_data_sub <- pulse_data
#' pulse_data_sub$data <- pulse_data_sub$data[,1:5]
#' ## End prepare data ----
#'
#' pulse_split(pulse_data_sub)
pulse_split <- function(pulse_data, window_width_secs = 30, window_shift_secs = 60, min_data_points = 0.8, subset = 0, subset_seed = NULL, subset_reindex = FALSE, msg = TRUE) {
  ## CHECKS INITIATED ## ------------------- ##
  stopifnot(identical(names(pulse_data), c("data", "multi", "vrsn", "freq")))
  stopifnot(is.pulse.tbl(pulse_data$data))
  stopifnot(is.numeric(pulse_data$freq))
  stopifnot(is.numeric(window_width_secs))
  stopifnot(length(window_width_secs) == 1)
  stopifnot(is.numeric(window_shift_secs))
  stopifnot(length(window_shift_secs) == 1)
  stopifnot(is.numeric(min_data_points))
  stopifnot(dplyr::between(min_data_points, 0, 1))
  stopifnot(is.logical(msg))
  ## CHECKS COMPLETED ## ------------------- ##

  freq <- pulse_data$freq
  data <- pulse_data$data

  # define the target time windows
  t0 <- dplyr::first(data$time) %>%
    stringr::str_sub(1, 17) %>%
    stringr::str_c("00") %>%
    as.POSIXct(tz = "UTC") %>%
    magrittr::add(60)
  t1 <- dplyr::last(data$time)

  window_t0 <- seq(
    t0,
    t1,
    by = window_shift_secs)

  # split data
  pulse_data_split <- purrr::map(window_t0, ~dplyr::filter(data, dplyr::between(time, .x, .x + window_width_secs)))

  # discard windows that don't have enough data points
  #   used to skip data gaps (such as when the PULSE system is momentarily disconnected but the experiment resumes afterwards)
  min_data_points <- window_width_secs * freq * min_data_points
  nrows <- purrr::map_dbl(pulse_data_split, nrow)
  pulse_data_split  <- pulse_data_split[nrows >= min_data_points]

  # clean
  x <- tibble::tibble(
    i        = seq_along(pulse_data_split),
    smoothed = FALSE,
    data     = pulse_data_split
  )

  # random subset
  if (subset != 0) {
    if (!is.null(subset_seed)) set.seed(subset_seed)
    x <- x %>% dplyr::slice_sample(n = subset) %>% dplyr::arrange(i)
    if (subset_reindex) x <- dplyr::select(x, -i) %>% tibble::rowid_to_column("i")
  }

  # return
  x
}
