So far, the basics of R are introduced. Writing R code is not difficult, in fact quite intuitive. However, using R efficiently is not easy. R scripts are interpretive, thus R can’t compete with compiled languages, such as C and Fortran, in terms of computational efficiency. That becomes especially important for numerical analyses. Interestingly, many approaches we learn from compiled languages are not necessarily applicable to R programming. Here, I will introduce a few techniques to make R codes more efficient. I highly recommend the online book Efficient R programming for details of this topic. For example, one should use factors when possible, as they are good for statistical analysis. A general principle is to avoid computation in R scripts. A good practice of R programming is to avoid intensive/iterative calculations directly in R code. The goal is to do these calculations with more efficient functions, typically written in C or Fortran.

Avoid growing vectors

The following code is extremely slow for large n.

n = 10 # Set the value of n to 10
v = c() # Create an empty vector v
for (i in seq_len(n)) { # Use a for loop to iterate from 1 to n
  v = c(v, i^2) # Append the square of i to vector v
}
v # Print the resulting vector v

##  [1]   1   4   9  16  25  36  49  64  81 100

A better way creates an array of final length first.

n = 10
v = numeric(n) # Create a numeric vector v of length n filled with zeros
for (i in seq_len(n)) {
  v[i] = i^2 # Assign the square of i to the corresponding element in the vector v
}
v

##  [1]   1   4   9  16  25  36  49  64  81 100

Vectorize codes

An even better approach is the following. It uses vector operations instead.

n = 10
v = seq_len(n) # Generate a sequence from 1 to n
v = v * v # Square each element in the vector v
v

##  [1]   1   4   9  16  25  36  49  64  81 100

Iteration is commonly used in programming. Unfortunately, iteration, e.g., using a For Loop, is very slow in R. For example, the following code calculates the mean and standard deviation (SD) of a series of numbers. Please note that the definition of SD here is the mathematical definition.

\[\sigma(\mathbf{x})=\sqrt{\frac{\sum_{j=1}^{n}(\mathbf{x_j}-\mathbf{\overline{x}})^2}{n}}\] It can be shown that the following formula is equivalent and more efficient to be coded.

\[\sigma(\mathbf{x})=\sqrt{\frac{\sum_{j=1}^{n}\mathbf{x_j}^2}{n} - \left(\frac{\sum_{j=1}^{n}\mathbf{x_j}}{n}\right)^2}\]

# Initialize variables
my_sum = 0
my_sum2 = 0
num = 100

for (i in seq_len(num)) { # Loop through a sequence from 1 to num
  my_sum = my_sum + i
  my_sum2 = my_sum2 + i^2
}

# Calculate mean and standard deviation
my_mean = my_sum / num
my_sd = sqrt(my_sum2 / num - my_mean^2)

# Print the mean and standard deviation
my_mean

## [1] 50.5

my_sd

## [1] 28.86607

While the above code is typical for C or Fortran, a better approach for R is to use vector operations.

num = 100 # Define num
v = seq_len(num) # Create a sequence from 1 to num

# Calculate mean, mean square, and standard deviation
my_mean = mean(v)
my_mean_square = mean(v^2)
my_sd = sqrt(my_mean_square - my_mean^2)

# Print the mean and standard deviation
my_mean

## [1] 50.5

my_sd

## [1] 28.86607

Apply can be used to perform operations for columns (or rows) of a matrix.

library(MASS) # Load the necessary library for matrix operations

mat = matrix(rnorm(16), nrow = 4) # Generate a random matrix of size 4x4

# Calculate means, means square, and standard deviations
means = apply(mat, 2, mean)
my_mean_square = apply(mat * mat, 2, mean)
sd = sqrt(my_mean_square - means^2)

sd # Print the standard deviations

## [1] 0.6588735 1.0592286 0.9403377 1.2060033

In statistics, a lightly different definition is typically used.

\[\sigma(\mathbf{x})=\sqrt{\frac{\sum_{j=1}^{n}(\mathbf{x_j}-\mathbf{\overline{x}})^2}{n-1}}\] For example, the sd function in R uses this definition.

apply(mat, 2, sd)

## [1] 0.7608016 1.2230918 1.0858084 1.3925726

Byte compile package

Any function can be compiled into byte code for an easy performance boost.

library(compiler)
my_sd =  function(x) {   # define the function
  my_mean = mean(x)
  my_mean2 = mean(x*x)
  return(sqrt(my_mean2 - my_mean**2))
}
cmp_sd = cmpfun(my_sd)   # compile it to byte code
cmp_sd(seq_len(100))   # same calculations as a previous example, but with compiled function

## [1] 28.86607

apply(mat, 2, cmp_sd)  # apply to a matrix

## [1] 0.6588735 1.0592286 0.9403377 1.2060033

Use C/Fortran codes

Many R packages are written in a compiled language. It is important to use them when possible. It is also feasible to write functions in C, C++, or Fortran, and call these functions from R. Here is an example of Fortran subroutines, provided in the file “extra/src/sub_cal_sd.f90”. Note that only integer, double precision and logical arguments are allowed in the Fortran-R interface approach.

subroutine cal_sd(n,x,sd)
  implicit none
  integer,intent(in) :: n
  double precision, intent(in) :: x(n)
  double precision, intent(out) :: sd
  integer :: i
  double precision :: mean, mean2
  mean = 0.d0
  mean2 = 0.d0
  do i = 1, n
    mean = mean + x(i)
    mean2 = mean2 + x(i)**2
  end do
  mean = mean/n
  mean2 = mean2/n
  sd = sqrt(mean2 - mean**2)
end subroutine cal_sd

The Fortran code needs to be first compiled into a .so file. This needs to be done in command line with a Fortran compiler (gfortran from gcc).

gfortran -fpic -shared extra/src/sub_cal_sd.f90 -o extra/src/sub_cal_sd.so

Then, in R load the shared library

dyn.load("extra/src/sub_cal_sd.so")

cal_sd can be called in R below. The arguments of the Fortran subroutines can be found in the output list.

n = 100
v = seq_len(n)
my_sd = 0.0
results = .Fortran("cal_sd",as.integer(n),as.double(v), my_sd)
results             # [[3]] outputs the sd

## [[1]]
## [1] 100
## 
## [[2]]
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100
## 
## [[3]]
## [1] 28.86607

To make it even more convenient, one can write an R wrapper.

 my_sd_R <- function(x) {
   if (!is.numeric(x))
           stop("argument x must be numeric")
   my_sd = 0.0
   results =  .Fortran("cal_sd", as.integer(length(x)), as.double(x), my_sd)
   return(results[[3]])
 }
my_sd_R(seq_len(100))

## [1] 28.86607

More details can be found in this page.

Parallelization

Lastly, parallelization is another way to improve the performance of R. If simple parallelization is needed, we can use Apply functions. Simply setup the desired calculations in a list. Instead of using lapply, use mclapply and specify the number of cores in use. As demonstrated below, the code with parallelization is much faster.

library(parallel)   # load the parallel library
library(microbenchmark)  # use microbenchmark library for timing
numCores = detectCores()     # the number of available cores 
numCores

## [1] 14

tests = seq(1, 10000)
my_test <- function(test_id) {
  v = rnorm(1000)
  return(my_sd_R(v))
}
my_benchmark = microbenchmark(results_noparallel = lapply(tests, my_test),
               results_withparallel = mclapply(tests, my_test, mc.cores = numCores),
               times = 10, unit = "s")

## Warning in microbenchmark(results_noparallel = lapply(tests, my_test),
## results_withparallel = mclapply(tests, : less accurate nanosecond times to
## avoid potential integer overflows

microbenchmark:::autoplot.microbenchmark(my_benchmark)

## Loading required namespace: ggplot2

For additional read, please check this page

Part 1: Introduction to R Programming

B. Efficient R Programming

Mingyang Lu

10/12/2021

Avoid growing vectors

Vectorize codes

Byte compile package

Use C/Fortran codes

Parallelization