--- title: "readepi: Reading data from health information systems" output: rmarkdown::html_vignette: df_print: "kable" vignette: > %\VignetteIndexEntry{readepi: Reading data from health information systems} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk[["set"]](collapse = TRUE, comment = "#>", eval = FALSE, fig.width = 7L, fig.height = 7L, message = FALSE, fig.align = "center", warning = FALSE, dpi = 300L) ``` # Overview Health-related data in general, and epidemiological data in particular, are stored in files, relation database management systems (RDBMS), or health information systems (HIS). Each category includes numerous options, such as various file formats, RDBMS types, and HIS APIs. Importing data from such repositories involve the usage of different format-specific functions or API-specific packages, which is an exhausting task for end users. The main objective of the **{readepi}** package is to simplify the process of reading health-related data from diverse sources, allowing the user to focus more on downstream analysis tasks. The current implementation of **{readepi}** provides functions for reading data from two common HIS: ([SORMAS](https://sormas.org/), [DHIS2](https://dhis2.org/), and RDBMS such as MS SQL, MySQL, PostgreSQL, and SQLite. Other utility functions for accessing relevant files and data are also included in this package. ```{r setup, eval=TRUE} # LOAD readepi library(readepi) ``` ## Need for MS drivers Users of operating systems other than Microsoft need to have the appropriate MS driver installed into their system. The driver installation process is documented in details in the [install drivers vignette](./install_drivers.Rmd) vignette. ## Authentication To read data from RDBMS and HIS, the user is expected to have, at least, read access to the system. To establish the connection to their system, users can call the `login()` function with the following arguments: - `from`: The URL to the system of interest. For APIs, this must be the base URL (required). - `type`: The source name (required). The current version of the package covers the following RDBMS and HIS types: "ms sql", "mysql", "postgresql", "sqlite", "dhis2", "sormas". - `user_name`: The user name (optional). - `password`: The user's password (optional). When the password is not provided (set to NULL), the user will be prompt to enter the password. - `driver_name`: The driver name (optional). This is only needed for connecting to RDBMS only. - `db_name`: The database name (optional). This is only needed for connecting to RDBMS only. - `port`: The port ID (optional). This is only needed for connecting to RDBMS only. ```{r} # CONNECT TO A DHIS2 INSTANCE dhis2_login <- login( type = "dhis2", from = "https://smc.moh.gm/dhis", user_name = "test", password = "Gambia@123" ) # CONNECT TO THE TEST MYSQL SERVER rdbms_login <- login( type = "mysql", from = "mysql-rfam-public.ebi.ac.uk", user_name = "rfamro", password = "", driver_name = "", db_name = "Rfam", port = 4497 ) # CONNECT TO A SORMAS SYSTEM dhis2_login <- login( type = "sormas", from = "https://demo.sormas.org/sormas-rest", user_name = "SurvSup", password = "Lk5R7JXeZSEc" ) ``` ## Reading data from RDBMS Health related research data are usually stored in either relational databases or non-SQL databases. For example, at [MRCG\@LSHTM](https://www.lshtm.ac.uk/research/units/mrc-gambia), projects data are stored in relational databases. A SQL-based database is run under a specific sever. The current version of the **{readepi}** package provides a function (`read_rdbms()`) for reading data from MS SQL, MySQL, PostgreSQL, and SQLite. The `read_rdbms()` function takes the following arguments: - `login`: The connection object obtained from the `login()` function - `query`: An SQL query or a list with the following elements: 1. table: a string with the table name. 2. fields: a vector of column names. When specified, only those columns will be returned. Default is `NULL`. 3. filter: an expression or a vector of values used to filter the rows from the table of interest. This should be of the same length as the value for the 'select'. Default is `NULL`. ```{r eval=TRUE} # CONNECT TO THE TEST MYSQL SERVER rdbms_login <- login( type = "mysql", from = "mysql-rfam-public.ebi.ac.uk", user_name = "rfamro", password = "", driver_name = "", db_name = "Rfam", port = 4497 ) # DISPLAY THE LIST OF TABLES FROM A DATABASE OF INTEREST tables <- show_tables(login = rdbms_login) head(tables) # READING ALL FIELDS AND ALL RECORDS FROM ONE TABLE (`author`) USING AN SQL # QUERY dat <- read_rdbms( login = rdbms_login, query = "select * from author" ) # READING ALL FIELDS AND ALL RECORDS FROM ONE TABLE (`author`) WHERE QUERY # PARAMETERS ARE SPECIFIED AS A LIST dat <- read_rdbms( login = rdbms_login, query = list(table = "author", fields = NULL, filter = NULL) ) # SELECT FEW COLUMNS FROM ONE TABLE AND LEFT JOIN WITH ANOTHER TABLE dat <- read_rdbms( login = rdbms_login, query = "select author.author_id, author.name, family_author.author_id from author left join family_author on author.author_id = family_author.author_id" ) ``` ```{r echo=FALSE, eval=TRUE} dat |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` # Reading data from HIS The current version of **{readepi}** supports reading data from two common HIS: [DHIS2](https://dhis2.org/), and [SORMAS](https://sormas.org/). ## Importing data from DHIS2 The [District Health Information Software (DHIS2)](https://dhis2.org/about-2/) is an open source software used by many health institutions to store, manage and analyze various types of health data. The `read_dhis2()` function can be used to import data from [DHIS2 Tracker](https://dhis2.org/tracker-in-action/) instances through their API with following arguments: - `login`: A `httr2_response` object returned by the `login()` function - `org_unit`: A character with the organisation unit ID or name - `program`: A character with the program ID or name It is important to note that the request parameters used in the internal functions of the package vary depending on the API version. Currently, {readepi} accounts for versions from **2.22** to **2.42**. The later is the current version of the DHIS2 test instance. Newer versions (**\> 2.42**) might require a different syntax for the request parameters depending on how they are defined by the DHIS2 developers. This can result in the failure of the `read_dhis2()` function when importing data from those newer versions. An issue can be raised [here](https://github.com/epiverse-trace/readepi/issues) to make the developers of the package aware of this problem. In the current version, the `login()` function only supports basic authentication, i.e., using the user name and password. An example of how to establish a connection to a DHIS2 instance is shown below. ```{r eval=TRUE} # CONNECT TO A DHIS2 INSTANCE dhis2_login <- login( type = "dhis2", from = "https://smc.moh.gm/dhis", user_name = "test", password = "Gambia@123" ) ``` Organisation units and programs are the key elements for a successful call of the `read_dhis2()` function. For this reason, we exported the following functions to help users access and identify the correct organisation unit and program names and IDs for a given DHIS2 instance. ```{r eval=TRUE} # GET THE LIST OF ALL ORGANISATION UNITS IN AN HIERARCHICAL ORDER org_units <- get_organisation_units(login = dhis2_login) ``` ```{r echo=FALSE, eval=TRUE} org_units |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` ```{r eval=TRUE} # GET THE LIST OF ALL PROGRAMS programs <- get_programs(login = dhis2_login) ``` ```{r echo=FALSE, eval=TRUE} programs |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` Similarly, users can also access the list of all data elements and program stages from a given DHIS2 instance as shown in the code chunks below. ```{r eval=TRUE} # GET THE LIST OF ALL DATA ELEMENTS data_elements <- get_data_elements(login = dhis2_login) ``` ```{r echo=FALSE, eval=TRUE} data_elements |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` ```{r eval=TRUE} # GET THE LIST OF ALL PROGRAM STAGES FOR A GIVEN PROGRAM ID program_stages <- get_program_stages( login = dhis2_login, program = "E5IUQuHg3Mg", programs = programs ) ``` ```{r echo=FALSE, eval=TRUE} program_stages |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` It is important to know that not all organisation units are registered for a specific program. To know the organisation units that run a particular program, use the `get_program_org_units()` function as shown in the example below. ```{r eval=TRUE} # GET THE LIST OF ORGANISATION UNITS RUNNING THE SPECIFIED PROGRAM target_org_units <- get_program_org_units( login = dhis2_login, program = "E5IUQuHg3Mg", org_units = org_units ) ``` ```{r echo=FALSE, eval=TRUE} target_org_units |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` After identifying the correct organisation unit and program IDs or names, users can import the corresponding data using the `read_dhis2()` with the following syntax: ```{r eval=TRUE} # IMPORT DATA FROM DHIS2 FOR THE SPECIFIED ORGANISATION UNIT AND PROGRAM IDs data <- read_dhis2( login = dhis2_login, org_unit = "GcLhRNAFppR", program = "E5IUQuHg3Mg" ) # IMPORT DATA FROM DHIS2 FOR THE SPECIFIED ORGANISATION UNIT AND PROGRAM NAMES data <- read_dhis2( login = dhis2_login, org_unit = "Keneba", program = "Child Registration & Treatment " ) ``` ```{r echo=FALSE, eval=TRUE} data[1:50, 1:15] |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "300px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ``` ## Importing data from SORMAS The [Surveillance Outbreak Response Management and Analysis System (SORMAS)](https://sormas.org/) is an open source eHealth system that optimises the processes used in monitoring the spread of infectious diseases and responding to outbreak situations. The `read_sormas()` function can be used to import data from SORMAS through its API. In the current version, the `read_sormas()` function returns data for the following columns: **case_id, person_id, sex, date_of_birth, case_origin, country, city, lat, long, case_status, date_onset, date_admission, date_last_contact, date_first_contact, outcome, date_outcome, Ct_values**. It is important to note that SORMAS does not support basic authentication i.e., the authentication details are directly provided to the `read_sormas()` function without a prior call of the `login()` function. Hence, the function takes the following arguments: - `base_url`: A character with the base URL to the target SORMAS system - `user_name`: A character with the user name credential - `password`: A character with the password credential - `disease`: A character vector with the names of the diseases of interest. Users can get the list of all diseases available on their SORMAS system using the `sormas_get_diseases()` function. ```{r eval=TRUE} # ESTABLISH THE CONNECTION TO THE SORMAS SYSTEM sormas_login <- login( type = "sormas", from = "https://demo.sormas.org/sormas-rest", user_name = "SurvSup", password = "Lk5R7JXeZSEc" ) # GET THE LIST OF ALL AVAILABLE DISEASES IN THE TEST SORMAS SYSTEM disease_names <- sormas_get_diseases( login = sormas_login ) ``` When relevant, users can download the data dictionary into a specific folder using the `sormas_get_data_dictionary()`. This provides an idea of the content of each column. ```{r eval=TRUE} # DOWNLOAD AND SAVE THE DATA DICTIONARY IN YOUR CURRENT DIRECTORY path_to_dictionary <- sormas_get_data_dictionary(path = getwd()) path_to_dictionary ``` A call of the `read_sormas()` function looks like below. ```{r eval=TRUE} # FETCH ALL COVID (coronavirus) CASES FROM THE TEST SORMAS INSTANCE FROM THE # BEGINNING OF DATA COLLECTION covid_cases <- read_sormas( login = sormas_login, disease = "coronavirus", since = 0 ) # FETCH ALL COVID (coronavirus) CASES FROM THE TEST SORMAS INSTANCE SINCE # JUNE 01, 2025 covid_cases <- read_sormas( login = sormas_login, disease = "coronavirus", since = as.Date("2025-06-01") ) ``` ```{r echo=FALSE, eval=TRUE} covid_cases[, 1:15] |> kableExtra::kbl() |> kableExtra::kable_paper("striped", font_size = 14, full_width = TRUE) |> kableExtra::scroll_box(height = "200px", width = "100%", box_css = "border: 1px solid #ddd; padding: 5px; ", extra_css = NULL, fixed_thead = TRUE) ```