---
title: "Working with multi-module PUMF surveys"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with multi-module PUMF surveys}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = nzchar(Sys.getenv("COMPILE_VIG_CANPUMF"))
)
```

```{r setup}
library(dplyr)
library(canpumf)
options(canpumf.cache_path = Sys.getenv("COMPILE_VIG_CANPUMF"))
```

Some Statistics Canada PUMF surveys ship **several linked data files** rather
than one. Each file is a different unit of analysis that shares a common
respondent key, and the files are meant to be joined for analysis. Examples
include:

| Survey | Modules | Join key |
|---|---|---|
| GSS cycle 16 — Aging and Social Support (2002) | `MAIN` + `CG4` + `CG6` + `CR` | `RECID` |
| GSS — Time Use (1998, 2010, 2015, 2022) | `Main` + `Episode` | `RECID` / `PUMFID` |
| Survey of Household Spending (2017) | `Interview` + `Diary` | `CASEID` |
| Giving, Volunteering and Participating (1997–2010) | `MAIN` + `GS` / `VD` / `GIVE` / `VOLNTR` | `PUMFID` / `MICRO_ID` / `IDNUM` |

`canpumf` models these as **several tables inside one DuckDB file**, so the
modules can be joined on a single connection. `get_pumf()` always returns the
survey's **primary module** (the respondent-level file that carries the survey
weight), and tells you which sibling modules are available.

## Loading the primary module

`get_pumf()` returns the main file as usual. For a multi-module survey it also
emits a one-time message listing the other modules and how to open one:

```{r}
main <- get_pumf("GSS", "Cycle 16 (2002)")  # primary module (MAIN), carries WGHT_PER
#> GSS/Cycle 16 (2002) is a multi-module survey; you loaded the primary module. Other linked modules: CG4, CG6, CR.
#> Open one on the same connection with pumf_module(), e.g.:
#>   cg4 <- pumf_module(main, "CG4")

main |> select(1:5) |> head()
```

Everything you already know about `get_pumf()` output applies to the primary
module: values come pre-labelled, `label_pumf_columns()` renames columns to
their human-readable labels, and `dplyr::collect()` pulls a local tibble.

## Opening a sibling module

Use `pumf_module()` to open another module. Crucially, it opens on the **same
DuckDB connection** as `main`, so the two tbls are joinable without a second
connection. The first time you open a module for a survey, `canpumf` reminds
you of the key the modules join on:

```{r}
cg4 <- pumf_module(main, "CG4")   # the caregiving module
#> GSS/2002 modules join on 'RECID' (e.g. dplyr::inner_join(main, CG4, by = "RECID")).

cg4 |> select(1:5) |> head()
```

## Joining modules for analysis

Because both tbls share one connection, the join runs entirely inside DuckDB —
nothing is pulled into R until you `collect()`. The respondent-level survey
weight lives only on the primary module, so a typical pattern is to join the
detail module to the columns you need from `main`:

```{r}
joined <- cg4 |>
  inner_join(
    main |> select(RECID, WGHT_PER),
    by = "RECID"
  )

joined |>
  summarise(weighted_n = sum(WGHT_PER, na.rm = TRUE)) |>
  collect()
```

The detail modules typically have a different row count than the primary
module — for example a caregiving or time-use episode module has one row per
episode rather than one row per respondent — so use the join direction that
fits your unit of analysis. But specific requirements may vary by use case, and
an inner join might not always be the right choice. This decision is thus left to the user.

## A second example: the Survey of Household Spending

The 2017 SHS pairs an `Interview` file (one row per household) with a `Diary`
file (one row per recorded purchase), joined on `CASEID`. Each module ships its
own bootstrap-weight set, so replicate weights stay attached to the correct
unit of analysis:

```{r}
shs   <- get_pumf("SHS", "2017")          # Interview (primary)
diary <- pumf_module(shs, "Diary")        # one row per purchase, same connection

diary |>
  inner_join(shs |> select(CASEID), by = "CASEID") |>
  tally() |>
  collect()
```

## Cleaning up

All modules opened from one `get_pumf()` call share a single connection, so a
single `close_pumf()` on any of the tbls releases it:

```{r}
close_pumf(main)
close_pumf(shs)
```

## Database connections

Alternatively the same functionality can be achieved by opening a general database connection that does not immediately select tables, and then manually select appropriate subtables:

```{r}
con<-get_pumf_connection("SHS", "2017") 

DBI::dbListTables(con)
```


```{r}
close_pumf(con)
```


## Notes

- `get_pumf("GSS", "2002", module = "CG4")` opens a module **standalone** (its
  own connection). Prefer `pumf_module()` when you intend to join, so both tbls
  share one connection.
- The join key is recorded in the survey registry and surfaced in the messages
  above, so you never have to guess it — it varies across surveys
  (`RECID`, `PUMFID`, `MICRO_ID`, `CASEID`, `IDNUM`).
- `label_pumf_columns()` and `pumf_var_labels()` are module-aware: each module
  is labelled from its own metadata even though all modules share one
  connection.