\documentclass[english]{scrartcl} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Using the USL Package to Analyze System Scalability} %\VignetteDepends{usl} %\VignetteKeyword{usl} %\VignetteKeyword{scalability} %\VignettePackage{usl} \subject{Using the usl package} \title{Analyze System Scalability in R with the Universal Scalability Law} \author{Stefan M\"oding} % Fonts, encoding & language support \usepackage{fourier} \usepackage[scaled]{helvet} \usepackage[scaled=0.76]{beramono} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{babel} \usepackage{amsmath} \usepackage{microtype} \usepackage{nicefrac} \usepackage[all]{nowidow} \usepackage[hang,marginal]{footmisc} \usepackage[super]{nth} % Cross references \usepackage[pdfusetitle,hidelinks]{hyperref} \usepackage[all]{hypcap} \usepackage{bookmark} \usepackage{cleveref} \hypersetup{pdfkeywords={Scalability,Universal Scalability Law,USL,R}} % Layout \KOMAoption{DIV}{11} \KOMAoption{toc}{bib} % Define shortcut commands \newcommand{\R}{{\normalfont\textsf{R}}} \newcommand{\usl}{{\normalfont\textsf{usl}}} \newcommand{\kbd}[1]{{\normalfont\texttt{#1}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= library(knitr) opts_knit$set(progress=FALSE, verbose=FALSE) knit_hooks$set(small.mar=function(before, options, envir) { if (before && options$fig.show != 'none') par(mar=c(5.1, 4.1, 1.1, 1.1), family='Helvetica', ps=11) }) opts_chunk$set(prompt=TRUE, comment=NA, tidy=FALSE) opts_chunk$set(out.width='\\textwidth', small.mar=TRUE) opts_chunk$set(fig.width=7, fig.height=3.6) opts_chunk$set(fig.align='center', fig.pos='htbp', fig.path='usl-') options(prompt='R> ', scipen=4, digits=4, width=78) options(digits.secs=3, show.signif.stars=TRUE) options(str=strOptions(strict.width='cut')) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \maketitle \begin{abstract}\noindent The Universal Scalability Law is used to quantify the scalability of hardware or software systems. It uses sparse measurements from an existing system to predict the throughput for different loads and can be used to learn more about the scalability limitations of the system. This document introduces the \usl{} package for \R{} and shows how easily it can be used to perform the relevant calculations. \end{abstract} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Version} This document describes version \Sexpr{packageVersion("usl")} of the \usl{} package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} Every system architect faces the challenge to deliver an application system that meets the requirements. A critical point during the design is the scalability of the system. Informally scalability can be defined as the ability to support a growing amount of work. A system is said to scale if it handles the changing demand or hardware environment in a reasonable efficient and practical way. Scalability can have two facets with respect to a computer system. On the one hand, there is software scalability where the focus is about how the system behaves when the demand increases, i.e., when more users are using it or more requests need to be handled. On the other hand, there is hardware scalability where the behavior of an application system running on larger hardware configurations is investigated. The Universal Scalability Law (USL) has been developed by Dr.~Neil~J.~Gunther to allow the quantification of scalability for the purpose of capacity planning. It provides an analytic model for the scalability of a computer system. A comprehensive introduction to the Universal Scalability Law including the mathematical grounding has been published by Dr.~Gunther (cf. \cite{Gun07}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Background} Dr.~Gunther shows how the scalability of every computer system can be described by a common rational function. This function is \emph{universal} in the sense that it does not assume any specific type of software, hardware or system architecture. \Cref{eqn:usl2} illustrates the original Universal Scalability Law where $C(N) = X(N) / X(1)$ is the relative capacity given by the ratio of the measured throughput $X(N)$ for load $N$ to the throughput $X(1)$ for load $1$. \begin{equation}\label{eqn:usl2} C(N) = \frac{N}{1 + \alpha (N - 1) + \beta N (N - 1)} \end{equation} The denominator consists of three terms that all have a specific physical interpretation: \begin{labeling}{Concurrency:} \item[Concurrency:] The first term models linear scalability that would exist if the different parts of the system (processors, threads \dots) could work without any interference caused by their interaction. \item[Contention:] The second term of the denominator refers to the contention between different parts of the system. Most common are issues caused by serialization or queueing effects. \item[Coherency:] The last term represents the delay induced by keeping the system in a coherent and consistent state. This is necessary when writable data is shared in different parts of the system. Predominant factors for such a delay are caches implemented in software and hardware. \end{labeling} In other words: $\alpha$ and $\beta$ represent two concrete physical issues that limit the achievable speedup for parallel execution. Note that the contention and coherency terms grow linearly respectively quadratically with $N$. As a consequence their influence becomes larger with an increasing $N$. Due to the quadratic characteristic of the coherency term there will be a point where the throughput of the system will start to go retrograde, i.e., will start to decrease with further increasing load. Dr.~Gunther proved that \cref{eqn:usl2} is reduced to Amdahl's Law for $\beta = 0$ (cf. \cite{Gun07}). Therefore the Universal Scalability Law can be seen as a generalization of Amdahl's Law for speedup in parallel computing. We can solve this nonlinear equation to estimate the coefficients $\alpha$ and $\beta$ using a sparse set of measurements for the throughput $X_i$ at different loads $N_i$. Initially the Universal Scalability Law normalized the measurements using the throughput $X(1)$ for load $1$ as the point of reference. This placed an additional burden on the performance analyst who needed to either measure the value or find a smart way to estimate it. It was then discovered that a third coefficient $\gamma$ can be added to the formula to represent the missing point of reference (cf. \cite{Gun:usl3}). \Cref{eqn:usl3} shows the updated Universal Scalability Law using three parameters. \begin{equation}\label{eqn:usl3} X(N) = \frac{\gamma N}{1 + \alpha (N - 1) + \beta N (N - 1)} \end{equation} Nonlinear regression will solve this equation and find best-fit values for the three parameters $\alpha$, $\beta$ and $\gamma$. The \usl{} package has been created to subsume the computation into one simple function call. This greatly reduces the manual work that previously was needed to perform the scalability analysis. Initially the coefficients are called $\sigma$ and $\kappa$ when hardware scalability is evaluated but $\alpha$ and $\beta$ when software scalability is analyzed. Up to version 1.8.0 the \usl{} package always used \texttt{sigma} and \texttt{kappa} as coefficients. This has been changed starting with the 2.0.0 release of the \usl{} package. Now the package will also use \texttt{alpha} and \texttt{beta} as coefficient names to follow Dr.~Gunther's nomenclature. Additionally the \texttt{gamma} parameter was added to implement the Universal Scalability Law with three parameters. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Examples of Scalability Analysis} The following sections present some examples of how the \usl{} package can be used when performing a scalability analysis. They also explain typical function calls and their arguments. \subsection{Case Study: Hardware Scalability} The \usl{} package contains a demo dataset with benchmark measurements from a ray tracer software\footnote{\url{http://sourceforge.net/projects/brlcad/}}. The data was gathered on an SGI~Origin~2000 with 64 R12000 processors running at 300~MHz. A number of reference images with different levels of complexity were computed for the benchmark. The measurements contain the average number of calculated ray-geometry intersections per second for the number of used processors. It is important to note that with changing hardware configurations the relative number of \emph{homogeneous} application processes per processor is to be held constant. So when $k$ application processes were used for the $N$ processor benchmark then $2k$ processes must be used to get the result for $2N$ processors. Start the analysis by loading the \usl{} package and look at the supplied dataset. <<>>= library(usl) data(raytracer) raytracer @ The data shows the throughput for different hardware configurations covering the available range from one to 64 processors. We can easily see that the benefit for switching from one processor to four processors is much larger than the gain for upgrading from 48 to 64 processors. Create a simple scatterplot to visualize the raw data. <<'rtplot1', fig.show='hide'>>= plot(throughput ~ processors, data = raytracer) @ \Cref{fig:rtplot2} shows the throughput of the system for the different number of processors. This plot is a typical example for the effects of \emph{diminishing returns}\label{dimret}, because it clearly shows how the benefit of adding more processors to the system gets smaller for higher numbers of processors. <<'rtplot2', echo=FALSE, fig.cap='Measured throughput of a ray tracing software in relation to the number of available processors'>>= <> @ Our next step builds the USL model from the dataset. The \kbd{usl()} function creates an S4 object that encapsulates the computation. The first argument is a formula with a symbolic description of the model we want to analyze. In this case we would like to analyze how the ``throughput'' changes with regard to the number of ``processors'' in the system. The second argument is the dataset with the measured values. Note how this call matches the syntax of the \kbd{plot()} function. <<>>= usl.model <- usl(throughput ~ processors, data = raytracer) @ The model object can be investigated with the \kbd{summary()} function. <<>>= summary(usl.model) @ The output of the \kbd{summary()} function shows different types of information. \begin{itemize} \item First of all it includes the call we used to create the model. \item The efficiency tells us something about the ratio of useful work that is performed per processor. It is obvious that two processors might be able to handle twice the work of one processor but not more. Calculating the ratio of the workload per processor should usually be less or equal to $1$. But often the maximum of the distribution is not exactly $1$. This is caused by the regression model when the expected value differs slightly from the measured value. \item We are performing a regression on the data to calculate the coefficients and therefore we determine the residuals for the fitted values. The distribution of the residuals is also given as part of the summary. \item The coefficients $\alpha$ and $\beta$ are the result that we are essentially interested in. They tell us the magnitude of the contention and coherency effects within the system. \item The third coefficient $\gamma$ estimates the throughput for a single processor. In this case we actually have a measurement for the single processor case but for a real-world analysis this value if often unknown or can't be measured. The difference between the value of $\gamma$ and the measurement is also caused by the regression. The difference should be small if the regression finds a reasonable model. \item The residual standard error estimates how well the model fits the data. We can see that the difference between the model prediction and the measured values is typically within $\Sexpr{signif(sigma(usl.model), 3)}$ ray-tracing operations per second. \item Finally the scalability bounds are printed. These bounds state important limits of the scalability as defined by the model. We will look at the meaning of these values shortly. \end{itemize} The function \kbd{efficiency()} extracts the efficiency values from the model and allows us to have a closer look at the specific efficiencies of the different processor configurations. <<>>= efficiency(usl.model) @ A bar plot is useful to visually compare the decreasing efficiencies for the configurations with an increasing number of processors. \Cref{fig:rtbarplot} shows the output diagram. <<'rtbarplot', fig.cap='Rate of efficiency per processor for different numbers of processors running the ray tracing software'>>= barplot(efficiency(usl.model), ylab = "efficiency / processor", xlab = "processors") @ The efficiency can be used for a first validation and sanity check of the measured values. Values larger than $1.0$ usually need a closer investigation. It is also suspicious if the efficiency gets bigger when the load increases. The model coefficients alone can be retrieved with the \kbd{coef()} function. <<>>= coef(usl.model) @ We can see that the model estimates the contention parameter $\alpha$ to be about $\Sexpr{signif(100 * coef(usl.model)[['alpha']], 2)}$ percent and the coherency parameter $\beta$ to be $\Sexpr{signif(coef(usl.model)[['beta']], 2)}$. With $\beta = 0$ the model indicates no coherency penalty for the system. In this case the throughput will asymptotically approach an upper bound as the number of processors is increased. The summary output above indicates a limit of $\Sexpr{signif(summary(usl.model)$limit, 3)}$ as the upper bound for the maximal achievable throughput. There is no way to increase the throughput beyond that number, no matter how many processors are available whatsoever. For $\beta = 0$ the predicted throughput never reaches a maximum. Theoretically the throughput could always be increased by adding more processors. At the same time it is limited as stated in the previous paragraph. So the improvement becomes infinitesimal for more and more processors. To get an impression of the scalability function we can use the \kbd{plot()} function and create a combined graph with the original data as dots and the calculated scalability function as a solid line. The scalability bounds are added as dotted lines to the output. \Cref{fig:rtplot3} has the result of that plot. <<'rtplot3', fig.cap='Throughput of a ray tracing software using different numbers of processors'>>= plot(throughput ~ processors, data = raytracer, pch = 16, ylim = c(0, 400)) plot(usl.model, add = TRUE, bounds = TRUE) @ <<'bounds', echo=FALSE>>= Xroof <- usl.model$limit Nopt <- usl.model$optimal[1] Xopt <- usl.model$optimal[2] @ The plot has dotted lines for the upper limit defined by Amdahl (here for a throughput of $\Sexpr{signif(Xroof, 3)}$) and the linear scalability we strive to see in all our applications. Unfortunately the throughput falls away from the linear path rather quickly. Optimal scalability is reached at $\Sexpr{signif(Nopt, 3)}$ processors performing $\Sexpr{signif(Xopt, 3)}$ ray-tracing operations per second. The optimum is defined as the point on the x-axis below the intersection of the linear scalability bound and the scalability limit defined by Amdahl's Law. Running a system below this point will underutilize already purchased capacity. But pushing the system above this point leads to \emph{diminishing returns}. The point of optimal scalability is also given in the output of the \kbd{summary()} function as shown before. The confidence intervals for the model coefficients are returned by calling the \kbd{confint()} function. <<>>= confint(usl.model, level = 0.95) @ The confidence interval shows for a given confidence level the expected range that the true value lies in. We can see that the true $\beta$ might in fact not really be zero. One reason is that all measurements contain measurement and observation errors. The other reason is that a very small $\beta$ might only cause a measurable effect when a much higher number of processors are used. SGI marketed the Origin~2000 with up to 128 processors. Let's assume that going from 64 to 128 processors does not introduce any additional limitations to the system architecture. Then we can use the existing model and forecast the system throughput for other numbers like 96 and 128 processors using the \kbd{predict()} function. <<>>= predict(usl.model, data.frame(processors = c(96, 128))) @ We can see from the prediction that the throughput will reach $\Sexpr{signif(predict(usl.model, data.frame(processors = c(128)))[[1]], 4)}$ for 128 processors. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Case Study: Software Scalability} In this section we will perform an analysis of a SPEC benchmark. A SPARCcenter~2000 with 16 CPUs was used in October 1994 for the SDM91 benchmark\footnote{\url{http://www.spec.org/osg/sdm91/results/results.html}}. The benchmark simulates a number of users working on a UNIX server (editing files, compiling \dots) and measures the number of script executions per hour. First, select the demo dataset with the data from the SPEC SDM91 benchmark. <<>>= library(usl) data(specsdm91) specsdm91 @ The data provides the measurements made during the benchmark. The column ``load'' shows the number of virtual users that were simulated by the benchmark and the column ``throughput'' has the measured number of script executions per hour for that load. Next we create the USL model for this dataset by calling the \kbd{usl()} function. Again we specify a symbolic description of the model and the dataset with the measurements. But this time we choose a different method for the analysis. <<>>= usl.model <- usl(throughput ~ load, specsdm91, method = "nls") @ There are currently three possible values for the \texttt{method} parameter: \begin{labeling}{\texttt{default}:} \item[\texttt{nls}:] This method uses the \kbd{nls()} function of the stats package for a nonlinear regression model. It estimates the coefficients $\alpha$, $\beta$ and $\gamma$. The nonlinear regression uses constraints for its parameters which means the ``port'' algorithm is used internally to solve the model. So all restrictions of the ``port'' algorithm apply. \item[\texttt{nlxb}:] A nonlinear regression model is also used in this case. But instead of the \kbd{nls()} function it uses the \kbd{nlxb()} function from the nlsr package (cf. \cite{R:nlsr}). It is expected to be more robust than the \texttt{nls} method. \item[\texttt{default}:] The default method traditionally used a transformation into a \nth{2} degree polynomial and required the data to be normalized. This was the original algorithm for the two-parameter Universal Scalability Law (cf. chapter 5.2.3 of \cite{Gun07}). Starting with version 2.0.0 of the \usl{} package this algorithm is no longer available. The \kbd{nlxb()} function is used instead. \end{labeling} We also use the \kbd{summary()} function to look at the details for the analysis. <<>>= summary(usl.model) @ Looking at the coefficients we notice that $\alpha$ is about \Sexpr{signif(coef(usl.model)[['alpha']], 2)} and $\beta$ is about \Sexpr{signif(coef(usl.model)[['beta']], 1)}. The parameter $\alpha$ indicates that about \Sexpr{signif(100*coef(usl.model)[['alpha']], 2)} percent of the execution time is strictly serial. As a rule of thumb we can say that $3$ percent is an excellent value for a reasonable complex application. Note that this serial fraction is also recognized in Amdahl's Law. We hypothesize that a proposed change to the system --- maybe a redesign of the cache architecture or the elimination of a point to point communication --- could reduce $\beta$ by half. Now we want to predict how the scalability of the system would change. We can calculate the point of maximum scalability for the current system and for the hypothetical system with the \kbd{peak.scalability()} function. <<>>= peak.scalability(usl.model) peak.scalability(usl.model, beta = 0.00005) @ The function accepts the optional arguments \texttt{alpha} and \texttt{beta}. They are useful to do a what-if analysis. Setting these parameters override the calculated model parameters and show how the system would behave with a different contention or coherency coefficient. In this case we learn that the point of peak scalability would move from around \Sexpr{signif(peak.scalability(usl.model), 3)} to about \Sexpr{signif(peak.scalability(usl.model, beta=0.00005), 3)} virtual users if we would be able to actually build the system with the assumed optimization. Both calculated scalability functions can be plotted using the \kbd{plot()} or \kbd{curve()} functions. The following commands create a graph of the original data points and the derived scalability functions. To completely include the scalability of the hypothetical system, we have to increase the range of the plotted values with the first command. <<'spplot1', fig.show='hide'>>= plot(specsdm91, pch = 16, ylim = c(0,2500)) plot(usl.model, add = TRUE) # Create function cache.scale to perform calculations with the model cache.scale <- scalability(usl.model, beta = 0.00005) curve(cache.scale, lty = 2, add = TRUE) @ We used the function \kbd{scalability()} here. It is a higher order function returning a function and not just a single value. The return value is assigned to define \kbd{cache.scale}. That makes it possible to use the \kbd{curve()} function to plot the values over the specific range. \Cref{fig:spplot2} shows the measured throughput in scripts per hour for a given load, i.e., the number of simulated users. The solid line indicates the derived USL model while the dashed line resembles our hypothetical system using the proposed optimization. <<'spplot2', echo=FALSE, fig.cap='The result of the SPEC SDM91 benchmark for a SPARCcenter 2000 (dots) together with the calculated scalability function (solid line) and a hypothetical scalability function (dashed line)'>>= <> @ From the figure we can see that the scalability really peaks at one point. Increasing the load beyond that point leads to retrograde behavior, i.e., the throughput decreases again. As we have calculated earlier, the measured system will reach this point sooner than the hypothetical system. We can combine the \kbd{cache.scale()} (defined before) and \kbd{peak.scalability()} functions to get the predicted throughput values for the peak values. <<>>= scalability(usl.model)(peak.scalability(usl.model)) # Use cache.scale function defined before cache.scale(peak.scalability(usl.model, beta = 0.00005)) @ This illustrates how the Universal Scalability Law can help to decide if the system currently is more limited by contention or by coherency issues and also what impact a proposed change would have. The \kbd{predict()} function can also be used to calculate a confidence bands for the scalability function at a specified level. To get a smoother graph it is advisable to predict the values for a higher number of points. Let's start by creating a data frame with the required load values. <<>>= load <- with(specsdm91, expand.grid(load = seq(min(load), max(load)))) @ We use the data frame to determine the fitted values and also the upper and lower confidence bounds at the requested level. The result will be a matrix with column names \kbd{fit} for the fitted values, \kbd{lwr} for the lower and \kbd{upr} for the upper bounds. <<>>= fit <- predict(usl.model, newdata = load, interval = "confidence", level = 0.95) @ The matrix is used to define the coordinates of a polygon containing the area between the lower and the upper bounds. The polygon connects the points of the lower bounds from lower to higher values and then back using the points of the upper bounds. <<>>= usl.polygon <- matrix(c(load[, 1], rev(load[, 1]), fit[, 'lwr'], rev(fit[, 'upr'])), nrow = 2 * nrow(load)) @ The plot is composed from multiple single plots. The first plot initializes the canvas and creates the axis. Then the polygon is plotted using a gray area. In the next step the measured values are added as points. Finally a solid line is plotted to indicate the fitted scalability function. See \cref{fig:ciplot1} for the entire plot. <<'ciplot1', fig.cap='The result of the SPEC SDM91 benchmark with confidence bands for the scalability function at the 95\\% level'>>= # Create empty plot (define canvas size, axis, ...) plot(specsdm91, xlab = names(specsdm91)[1], ylab = names(specsdm91)[2], ylim = c(0, 2000), type = "n") # Plot gray polygon indicating the confidence interval polygon(usl.polygon, border = NA, col = "gray") # Plot the measured throughput points(specsdm91, pch = 16) # Plot the fit lines(load[, 1], fit[, 'fit']) @ Another way to illustrate the impact of the parameters $\alpha$ and $\beta$ on the scalability is by looking at the achievable speedup when a fixed load is parallelized. A naive estimation would be that doubling the degree of parallelization should cut the execution time in halve. Unfortunately it doesn't work this way. In general there is a range where doubling the parallelization will actually improve the execution time. But the improvement will get smaller and smaller when the degree of parallelism is increased further. This is also an effect of \emph{diminishing returns} as already seen in \cref{dimret}. The real execution time is in fact the sum of the ideal execution time and the overhead for dealing with contention and coherency delays. The total execution time of a parallelized workload depends on the degree of parallelism $p$ and the coefficients $\alpha$ and $\beta$ of the associated USL model. The magnitude of the three components --- given as fractions of the serial execution time $T_1$ --- that account for the total execution time of the parallelized workload can be calculated as follows (cf. \cite{DBLP:journals/corr/abs-0808-1431} eqn. 26). \begin{align} T_{ideal} &= \frac{1}{p} T_1\label{eqn:ideal} \end{align} \begin{align} T_{contention} &= \alpha \left(\frac{p-1}{p}\right) T_1\label{eqn:contention} \end{align} \begin{align} T_{coherency} &= \beta \frac{1}{2} (p-1) T_1\label{eqn:coherency} \end{align} The function \kbd{overhead()} can be used to calculate the correspondent fractions for a given model. The function has the same interface as the \kbd{predict()} function. Calling it with only the model as argument will calculate the overhead for the fitted values. It can also be called with a data frame as second argument. Then the data frame will be used to determine the values for the calculation. Let's use our current model to calculate the overhead for a load of $10$, $20$, $100$ and $200$ simulated users. We create a data frame with the number of users and use the \kbd{overhead()} function to estimate the overhead. <<>>= load <- data.frame(load = c(10, 20, 100, 200)) ovhd <- overhead(usl.model, newdata = load) ovhd @ We can see that the ideal execution time for running $10$ jobs in parallel is $\nicefrac{1}{10}$ of the execution time of running the jobs unparallelized. To get the total fraction we have to add the overhead for contention ($\Sexpr{100*signif(ovhd[1,'contention'], 2)}\%$) and for coherency delays ($\Sexpr{100*signif(ovhd[1,'coherency'], 2)}\%$). This gives a total of $\Sexpr{100*signif(sum(ovhd[1,]), 4)}\%$. So with $10$ jobs in parallel we are only about $\Sexpr{signif(1/sum(ovhd[1,]), 2)}$ times faster than running the same workload in a serial way. \Cref{eqn:contention} shows that the percentage of time spent on dealing with contention will converge to the value of $\alpha$. \Cref{eqn:coherency} explains that coherency delays will grow beyond any limit if the degree of parallelism is large enough. This corresponds to the observation that adding more parallelism will sometimes make performance worse. A stacked barplot can be used to visualize how the different effects change with an increasing degree of parallelism. Note that the result matrix must be transposed to match the format needed for the \kbd{barplot()} command. <<'ovplot1', fig.cap='Decomposition of the execution time for parallelized workloads of the SPECSDM91 benchmark. The time is measured as a fraction of the time needed for serial execution of the workload.'>>= barplot(height = t(ovhd), names.arg = load[, 1], xlab = names(load), legend.text = TRUE) @ \Cref{fig:ovplot1} shows the resulting plot. It clearly shows the decrease in ideal execution time when the degree of parallelism is increased. It also shows how initially almost only contention contributes to the total execution time. For higher degrees of parallelism the impact of coherency delays grows. Note how the difference in ideal execution time between $100$ and $200$ parallel jobs effectively has no effect on the total execution time. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Case Study: Multi-valued Data} It is very common to use multi-valued data for a scalability analysis. These measurements are often taken from a live system and may include many different data points for similar load values. This could be the result of a non-homogeneous workload and an analyst has to decide how to take that into account. But for a production system there is usually no feasible way to create a homogeneous workload. The following data shows a subset of performance data gathered from an Oracle database system providing a login service for multiple web applications. For the analysis we focus on only two of the available metrics: \begin{labeling}{\texttt{txn\_rate}:} \item[\texttt{txn\_rate}:] The average number of processed database transactions. This metric is given as transactions per second. \item[\texttt{db\_time}:] The average time spent inside the database either working on a CPU or waiting for resources (I/O, locks, buffers \dots). The time is expressed as seconds per second, so two sessions working for exactly one quarter of a second each will contribute a total of half a second for that second. Oracle has coined the term \emph{Average Active Sessions} (AAS) for this metric. \end{labeling} Let's have a look at the first couple of data points in our data set. For each time interval of two minutes there is a corresponding value for the average database time per seconds and for the average number of transactions per second in this interval. <<>>= data(oracledb) head(subset(oracledb, select = c(timestamp, db_time, txn_rate))) @ A naive approach would be a plot of the data as a time series (see \cref{fig:oraplot1}). This plot shows the familiar pattern of an OLTP application that is mostly used during office hours. Unfortunately this type of plot is pretty much useless when performing a scalability analysis. <<'oraplot1', echo=FALSE, fig.cap='Transaction rates of an Oracle database system during the day of January 19th, 2012'>>= plot(txn_rate ~ timestamp, oracledb, pch = 20, xlab = "Time of day", ylab = "Txn / sec") @ The Universal Scalability Law correlates a throughput with a load. In this case the throughput is clearly given by the transaction rate and the database time is a taken as the load metric. The definition above states that the total time spent --- either running on a CPU or waiting --- is a measurement for the average number of active sessions. So we use that to express the load on the database system. As usual, we call the \kbd{usl()} function to carry out the analysis. See \cref{fig:orausl1} for the scatterplot of the data including the plot of the estimated scalability function. <<'orausl1', fig.cap='Relationship between the transaction rate and the number of average active sessions in an Oracle database system'>>= plot(txn_rate ~ db_time, oracledb, xlab = "Average active sessions", ylab = "Txn / sec") usl.oracle <- usl(txn_rate ~ db_time, oracledb) plot(usl.oracle, add = TRUE) @ Now we can retrieve the coefficients for this model. <<>>= coef(usl.oracle) @ Here our $\alpha$ is about an order of magnitude bigger than what we have seen in the previous sections. This indicates a major issue with some kind of serialization or queueing that severely limits the scalability. In fact it is so bad that the impact is already visible with only a few active sessions working at the same time: according to the model the peak throughput is reached at about $\Sexpr{signif(peak.scalability(usl.oracle), 2)}$ sessions. <<>>= peak.scalability(usl.oracle) @ The confidence interval for $\alpha$ confirms that there is only a small uncertainty about the magnitude of the calculated coefficients. <<>>= confint(usl.oracle) @ This analysis shows how we can use some of the metrics provided by a live Oracle database system to learn about the scalability. Note that neither the Oracle software nor the application needed any additional instrumentation to collect this data. Also the analysis was done without any internal knowledge about the way the application was using the database. % Looking at the plot of the scalability function in \cref{fig:orausl1}, we % could argue that the estimated function seems to be too pessimistic. There % are many data points showing a much higher transaction rate than the % scalability function suggests. Numerically this is caused by using the % \emph{least squares} method which considers all data points equally. % Let's estimate the scalability when only the alleged \emph{good} data % points are taken into account. Here is a constructive description to % isolate the points that make up the upper boundary of the comprising area. % \begin{enumerate} % \item Determine the convex hull of the data points % \item Locate the data points with the smallest and the largest abscissa % \item Find the straight line passing through these two points % \item Use only the data points on the convex hull that are located on or above % this line % \end{enumerate} % <<'orausl2', echo=FALSE, fig.cap='Using the convex hull to set the boundary between upper and lower data points'>>= % p <- with(oracledb, data.frame(db_time, txn_rate)) % # Plot all points % plot(txn_rate ~ db_time, p, % xlab = "Average active sessions", ylab = "Transactions per second") % # Calculate convex hull % idx <- chull(cbind(p$db_time, p$txn_rate)) % pts <- p[idx, ] % # Get point with smallest abscissa % p1 <- which.min(pts$db_time) % x1 <- pts[p1, 1] % y1 <- pts[p1, 2] % # Get point with largest abscissa % p2 <- which.max(pts$db_time) % x2 <- pts[p2, 1] % y2 <- pts[p2, 2] % # Slope–intercept equation: y=mx+b % m <- (y2 - y1) / (x2 - x1) % b <- (-x1 * m) + y1 % # Plot points on convex hull % points(pts, pch = 16) % # Remove points below the line and sort ascending by db_time % top <- subset(pts, txn_rate >= (m * db_time) + b) % top <- top[order(top$db_time), ] % lines(top$db_time, top$txn_rate, lty = 1) % abline(b, m, lty = 2) % @ % \Cref{fig:orausl2} uses solid black points to mark the points of the convex % hull. The dashed line, defined by the coordinates % $\left(\Sexpr{signif(x1,2)},\Sexpr{signif(y1,2)}\right)$ and % $\left(\Sexpr{signif(x2,2)},\Sexpr{signif(y2,2)}\right)$, is the boundary % between the lower and the upper part. % We will now use only the data points on the upper part of the convex hull % for an additional analysis. % <<>>= % usl.oracle2 <- usl(txn_rate ~ db_time, top, method = "nlxb") % coef(usl.oracle2) % confint(usl.oracle2) % @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \bibliographystyle{plain} \bibliography{usl} \end{document}