\documentclass[oneside,english,ngerman]{amsart} \usepackage[T1]{fontenc} \usepackage[latin9]{inputenc} \usepackage[a4paper,footskip=1.5cm]{geometry} \geometry{verbose,tmargin=3cm,bmargin=4cm,lmargin=3cm,rmargin=3cm} \usepackage{babel} \usepackage{textcomp} \usepackage{amsthm} \usepackage[unicode=true]{hyperref} \usepackage{fancyhdr} \usepackage{hyperref} %\VignetteIndexEntry{Use IBM In-Database Analytics with R} %\VignetteEngine{knitr::knitr} %\usepackage[UTF-8]{inputenc} \renewcommand{\headrulewidth}{0pt} \pagestyle{fancy} \lfoot{\fontsize{7}{11}\selectfont \textcopyright Copyright 2014, 2016, 2017, IBM Corporation. IBM, the IBM logo, dashDB, z/OS, Bluemix, and Db2 are trademarks of IBM Corporation, registered in many jurisdictions worldwide.} \cfoot{} \rfoot{} \rhead{} \lhead{} \chead{} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. \numberwithin{equation}{section} \numberwithin{figure}{section} \makeatother \begin{document} <>= library(knitr) opts_chunk$set( concordance=TRUE ) @ \title{Use IBM In-Database Analytics with R} \author{M. Wurst, C. Blaha, A. Eckert, IBM Germany Research and Development} \maketitle \thispagestyle{fancy} \section*{Introduction} To process data, most native R functions require that the data first is extracted from a database to working memory. However, If you need to analyze a large amount of data, that is often impractical or even impossible. The ibmdbR package provides methods to make working with databases more efficient by seamlessly pushing operations of R into the underlying database for execution. This not only lifts the memory limit of R, it also allows users to profit from performance-enhancing features of the underlying database management system, such as columnar technology and parallel processing, without having to interact with the database explicitly. Keeping the data in the database also avoids security issues that are associated with extracting data and ensures that the data that is being analyzed is as current as possible. Some in-database functions additionally use lazy loading to load only those parts of the data that are actually required, to further increase efficiency. \section*{Prerequisites} The ibmdbR package is designed to work with the following IBM database products and services: \begin{itemize} \item Products \begin{itemize} \item IBM\textsuperscript{\textregistered} Db2\textsuperscript{\textregistered} Warehouse (formerly IBM dashDB\textsuperscript{\textregistered} Local; in the following Db2) \item IBM Db2 for z/OS\textsuperscript{\textregistered} in conjunction with IBM Db2 Analytics Accelerator for z/OS (in the following Db2 z/OS) \end{itemize} \item Managed services \begin{itemize} \item IBM Db2 Warehouse on Cloud (formerly IBM dashDB for Analytics; in the following Db2) \end{itemize} \end{itemize} Before you start, you must install Db2 or provision an instance of Db2 (see \hyperref[further_reading]{,,Further Reading``}). If you use Db2 managed services from your web browser, everything is pre-configured and ready to be used, so you can skip directly to the instructions in \hyperref[getting_started]{,,Getting Started``}. If you connect to a Db2 server or if you want to use Db2 remotely, you must first install the appropriate client driver packages. For Db2, they can be downloaded from the web console. After you install the driver packages you must configure an ODBC source. Refer to the documentation of your operating system for information on how to perform this task. In the following we assume you created an ODBC data source called ,,BLUDB``. \section*{Getting Started} \label{getting_started} Before you can use the any of the push-down functions of the ibmdbR package, you must connect to a database. This is done by executing the idaConnect function. Specify the name of the ODBC source, the user name and the password as parameters. If you use R from the web console of the Db2 managed services, the following statement will connect you directly to the database BLUDB: \begin{verbatim} > con <- idaConnect('BLUDB','','') \end{verbatim} Next, initialize the in-database functionality by executing the idaInit function. The idaInit function creates a singleton for the connection object such that you do not need to pass it as parameter later on: \begin{verbatim} > idaInit(con) \end{verbatim} Now, we can try to issue a first command. The idaShowTables function will return a \emph{data.frame} that contains a list of all tables in the current schema: \begin{verbatim} > idaShowTables() Schema Name Owner Type 1 USER1 A USER1 T 2 USER1 B USER1 T \end{verbatim} Normally, at the end of each session, you would close the connection to the database: \begin{verbatim} idaClose(con) \end{verbatim} However, to run the samples in the next sections, we will keep it open. \section*{Working with ida.data.frame} Instead of holding data in memory, an object of class \emph{ida.data.frame} contains only a reference to a table or a query. Many operations can be performed without loading the content of this table or query into memory. You can create an object of class \emph{ida.data.frame} either by pointing to an existing table in the database or by uploading a local R \emph{data.frame} to a table. For example, if there already is a table 'IRIS' in the database, you can create an \emph{ida.data.frame} object by executing the following statement: \begin{verbatim} > iris.ida <- ida.data.frame('IRIS') \end{verbatim} If the table 'IRIS' does not yet exist, you can create an \emph{ida.data.frame} object by uploading local data: \begin{verbatim} > iris.ida <- as.ida.data.frame(iris,'IRIS') \end{verbatim} To check the class and print the \emph{ida.data.frame} object, execute the following statements: \begin{verbatim} > class(iris.ida) 'ida.data.frame' > iris.ida SELECT "SepalLength","SepalWidth","PetalLength","PetalWidth","Species" FROM "IRIS" \end{verbatim} Optimally, R users should not need to care about SQL too much. For this reason, the ibmdbR package overwrites many methods and functions defined for regular R \emph{data.frame} objects. It uses SQL to push the execution of these methods down into the database. The \emph{dim} method is a simple example: \begin{verbatim} > dim(iris.ida) [1] 150 5 \end{verbatim} It calculates the dimensions of the \emph{ida.data.frame}, just as the \emph{dim} method would do for a regular \emph{data.frame}. Internally, however, it executes, among other statements, a \emph{SELECT COUNT({*}) FROM IRIS} SQL query. Another example is the \emph{head} method, which retrieves only the first rows from a \emph{ida.data.frame} by executing the appropriate SQL statements in the background: \begin{verbatim} > head(iris.ida) SepalLength SepalWidth PetalLength PetalWidth Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa \end{verbatim} The ibmdbR package overwrites several other methods (defined on the \emph{data.frame} class) in a similar way. Among these are \emph{as.data.frame}, \emph{sd}, \emph{max}, \emph{mean}, \emph{min}, \emph{length}, \emph{print}, \emph{names}, \emph{colnames}, \emph{summary},\emph{ NROW}, \emph{ NCOL}, \emph{var}, \emph{cor} and \emph{cov}. Sometimes, you might not want to work on a full table but only on a selection. You can do this in a way that is similar to the way you would do this for a regular \emph{data.frame}. For example, the following statements, would select only rows for which the column 'Species' equals 'setosa', and only the columns 'PetalLength' and'PetalWidth': \begin{verbatim} > iris.ida2 <- iris.ida[iris.ida$Species == 'setosa', c('PetalLength', 'PetalWidth')] > dim(iris.ida2) [1] 50 2 \end{verbatim} All methods and functions that are applied to an \emph{ida.data.frame} object with selection will reflect it, which is why the \emph{dim} method returns 50 rows instead of 150 rows in the previous example. Selection can also be useful to remove missing values. The following statement keeps all columns but removes rows for which column 'Species' is NULL: \begin{verbatim} iris.ida3 <- iris.ida[!db.is.null(iris.ida$Species),] [1] 150 5 \end{verbatim} As the IRIS table does not have any missing values, the \emph{dim} method returns 150 rows. Instead of selecting columns, you can also add new columns, that are based on existing ones. These columns are kept locally and are only materialized at query time. The following statements show a few examples: \begin{verbatim} > iris.ida$X <- iris.ida$SepalLength + iris.ida$SepalWidth > iris.ida$Y <- ifelse((iris.ida$SepalLength > 4) & (iris.ida$SepalWidth < 3), 'a', 'b') > iris.ida$Z <- as.character(iris.ida$SepalLength) > head(iris.ida) SepalLength SepalWidth PetalLength PetalWidth Species X Y Z 1 5.1 3.5 1.4 0.2 setosa 8.6 a 5.1E0 2 4.9 3.0 1.4 0.2 setosa 7.9 b 4.9E0 3 4.7 3.2 1.3 0.2 setosa 7.9 b 4.7E0 4 4.6 3.1 1.5 0.2 setosa 7.7 b 4.6E0 5 5.0 3.6 1.4 0.2 setosa 8.6 b 5.0E0 6 5.4 3.9 1.7 0.4 setosa 9.3 b 5.4E0 \end{verbatim} If you want to store these columns to the database (for example, to make them accessible to others), you can use the idaCreateView function to create a view of an \emph{ida.data.frame} object. \section*{Preprocess and Analyze Data} The ibmdbR package provides several functions for preprocessing and statistical analysis. The idaSample function draws a (stratified) sample from an \emph{ida.data.frame}. In the following statement, 'Species' is the stratification column. \begin{verbatim} > df <- idaSample(iris.ida, 6, 'Species') > df SepalLength SepalWidth PetalLength PetalWidth Species X Y Z 1 6.4 2.7 5.3 1.9 virginica 9.1 a 6.4E0 2 5.8 2.7 4.1 1.0 versicolor 8.5 a 5.8E0 3 5.0 3.3 1.4 0.2 setosa 8.3 b 5.0E0 4 6.3 2.9 5.6 1.8 virginica 9.2 a 6.3E0 5 6.1 3.0 4.6 1.4 versicolor 9.1 a 6.1E0 6 5.1 3.3 1.7 0.5 setosa 8.4 a 5.1E0 \end{verbatim} The idaSample function can also be applied to \emph{ida.data.frame} objects with column or row selection or with defined columns. The following example would excludes all rows with 'Species' equal to 'setosa': \begin{verbatim} > df <- idaSample(iris.ida[iris.ida$Species!='setosa',], 4, 'Species') > df SepalLength SepalWidth PetalLength PetalWidth Species X Y Z 1 7.1 3.0 5.9 2.1 virginica 10.1 a 7.1E0 2 6.2 2.9 4.3 1.3 versicolor 9.1 a 6.2E0 3 5.5 2.6 4.4 1.2 versicolor 8.1 a 5.5E0 4 6.7 3.0 5.2 2.3 virginica 9.7 a 6.7E0 \end{verbatim} For linear regression, the idaLm function is provided. It is very similar to the lm function for this purpose, but can be easily applied to millions of rows. The following statement calculates a linear regression model on an \emph{ida.data.frame} object: \begin{verbatim} l <- idaLm(SepalLength~SepalWidth+PetalLength,iris.ida) l$coefficients [,1] SepalWidth 0.5955247 PetalLength 0.4719200 Intercept 2.2491402 attr(,"names") [1] "SepalWidth" "PetalLength" "Intercept" \end{verbatim} Other functions include, for instance, idaTable for cross tabulation or idaMerge, for merging two \emph{ida.data.frame} objects. The reference manual contains more details and examples. \section*{Store and Share R Objects} Users often need to store objects in their workspace across two R sessions. The ibmdbR package provides methods that you can use to store R objects in database tables. This is achieved through a class called \emph{ida.list}. An \emph{ida.list} object can be used in a similar way as regular R \emph{list}. However, the objects in an \emph{ida.list} object are stored in database tables instead of being stored locally. There are three possible ways to initialize a \emph{ida.list} object, depending on what information is to be accessed: \begin{itemize} \item Objects that are to be stored in a private container and are not to be readable by other users\\ \texttt{l <- ida.list(type='private')} \item Objects that should be readable by all users of the database \\ \texttt{l <- ida.list(type='public')} \item Public objects of another user \\ \texttt{l <- ida.list(user='user2')} \end{itemize} The first two, when used for the first time, create the tables that are needed to hold the objects. The tables are created under the current schema: Objects can be accessed using the R \emph{list} operators: \begin{verbatim} > l <- ida.list(type='public') > l['a'] <- 1:100 > l['b'] <- 'c' > l['b'] "c" > l['b'] <- NULL \end{verbatim} All object keys can be listed using names \begin{verbatim} > names(l) "a" \end{verbatim} Objects are written to tables in a serialized format, so even though you can see the tables in your schema, you will usually not be able to read their contents directly. \section*{Frequently asked questions} \label{faq} \subsection*{Setting up an ODBC connection on Mac} The ibmdbR package uses ODBC to establish a connection between an R session and a Db2 database. The following instructions describe how to setup an ODBC connection to Db2 using a Mac that has OS X or macOS as its operating system. \begin{enumerate} \item Locale settings \begin{itemize} \item[] This setup was tested with \emph{en\_US.UTF-8} as locale setting. The settings can be checked with the \emph{locale} command. To set the locale variables, issue the following commands: \begin{verbatim} export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8 \end{verbatim} The following error message, issued when trying to connect to the database, indicates that the locale settings are set incorrectly: \\ \texttt{Driver's SQLAllocHandle on SQL\_HANDLE\_HENV failed} \end{itemize} \item Install unixODBC for example using homebrew \begin{itemize} \item[] \begin{verbatim} brew install unixodbc \end{verbatim} \end{itemize} \item Download the Db2 drivers for Mac \begin{itemize} \item[] In the Db2 console, open "Connect > Connection information > Driver" and select "Download Db2 driver package". \end{itemize} \item Install Db2 drivers \begin{itemize} \item[] In the terminal enter: \begin{verbatim} cd /Volumes/dsdriver/ sudo ./installDSDriver \end{verbatim} \end{itemize} \item Source the dsdriver scripts to avoid connection validation problems (see \href{http://www-01.ibm.com/support/docview.wss?uid=swg21689403}{SWG21689403}) \begin{itemize} \item[] In the terminal enter: \begin{verbatim} cd /Applications/dsdriver source db2profile \end{verbatim} \end{itemize} \item Create a DSN and a database driver configuration \begin{itemize} \item[] Create or edit the \emph{db2dsdriver.cfg} file with any editor: \begin{verbatim} sudo vi /Applications/dsdriver/cfg/db2dsdriver.cfg \end{verbatim} Paste the following code into the db2dsdriver.cfg file: \begin{verbatim} \end{verbatim} \end{itemize} \item In the Db2 console, open the connection information and note the host name and port. In the db2dsdriver.cfg file, replace the and placeholders with the appropriate values. \item Edit the \emph{odbc.ini} and \emph{odbcinst.ini} files with any editor. Note: If you are using a 64-bit version of R, replace \emph{libdb2o.dylib} in the following entries with \emph{libdb2.dylib}. \begin{itemize} \item[] Edit the odbc.ini: \begin{verbatim} sudo vi /usr/local/etc/odbc.ini \end{verbatim} Ensure that the file contains at least the following entries: \begin{verbatim} [ODBC Data Sources] BLUDB = DASHDB [BLUDB] Driver = /Applications/dsdriver/lib/libdb2o.dylib Description = DashDB connection DATABASE = BLUDB HOSTNAME = PORT = PROTOCOL = TCPIP \end{verbatim} \item[] Edit the odbcinst.ini: \begin{verbatim} sudo vi /usr/local/etc/odbcinst.ini \end{verbatim} Ensure that the file contains at least the following entries: \begin{verbatim} [ODBC] Trace=0 TraceFile=/tmp/odbctrace.log [ODBC Drivers] DASHDB = Installed [DASHDB] Driver = /Applications/dsdriver/lib/libdb2o.dylib \end{verbatim} \end{itemize} \item Restart your R or RStudio instance \item Test the connection in R or R-Studio \begin{itemize} \item[] Exchange the connection information with your Db2 credentials and run the following script in R: \begin{verbatim} library(ibmdbR) con <- idaConnect("BLUDB", uid = "", pwd = "") idaInit(con) ## Create a data frame called mydata based on a given query mydata <- idaQuery("select count(*) from sysibm.sysdummy1", as.is=FALSE) head(mydata, 1) idaClose(con) \end{verbatim} The expected output is: \begin{verbatim} > head(mydata, 1) 1 1 1 \end{verbatim} \end{itemize} \end{enumerate} \section*{Further Reading} \label{further_reading} The ibmdbR package allows you to seamlessly scale from small data sets to larger data sets, especially ones that no longer fit in memory. To learn more about the functionality of the package, refer to the reference documentation. Another good starting point is (\href{https://www.ibm.com/analytics/us/en/technology/cloud-data-services/dashdb/}{IBM Db2 Warehouse Offerings}), which allows you to provision a database online within a few minutes. It contains many samples that can run online. To learn more about Db2 and how to install ODBC on your client machine, please refer to the \emph{\href{https://www.ibm.com/support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.kc.doc/welcome.html}{IBM knowledge center}}. \end{document}