So far, the basics of R are introduced. Writing R code is not difficult, in fact quite intuitive. However, using R efficiently is not easy. R scripts are interpretive, thus R can’t compete with compiled languages, such as C and Fortran, in terms of computational efficiency. That becomes especially important for numerical analyses. Interestingly, many approaches we learn from compiled languages are not necessarily applicable to R programming. Here, I will introduce a few techniques to make R codes more efficient. I highly recommend the online book Efficient R programming for details of this topic. For example, one should use factors when possible, as they are good for statistical analysis. A general principle is to avoid computation in R scripts. A good practice of R programming is to avoid intensive/iterative calculations directly in R code. The goal is to do these calculations with more efficient functions, typically written in C or Fortran.
The following code is extremely slow for large n.
n = 10 # Set the value of n to 10
v = c() # Create an empty vector v
for (i in seq_len(n)) { # Use a for loop to iterate from 1 to n
v = c(v, i^2) # Append the square of i to vector v
}
v # Print the resulting vector v
## [1] 1 4 9 16 25 36 49 64 81 100
A better way creates an array of final length first.
n = 10
v = numeric(n) # Create a numeric vector v of length n filled with zeros
for (i in seq_len(n)) {
v[i] = i^2 # Assign the square of i to the corresponding element in the vector v
}
v
## [1] 1 4 9 16 25 36 49 64 81 100
An even better approach is the following. It uses vector operations instead.
n = 10
v = seq_len(n) # Generate a sequence from 1 to n
v = v * v # Square each element in the vector v
v
## [1] 1 4 9 16 25 36 49 64 81 100
Iteration is commonly used in programming. Unfortunately, iteration, e.g., using a For Loop, is very slow in R. For example, the following code calculates the mean and standard deviation (SD) of a series of numbers. Please note that the definition of SD here is the mathematical definition.
\[\sigma(\mathbf{x})=\sqrt{\frac{\sum_{j=1}^{n}(\mathbf{x_j}-\mathbf{\overline{x}})^2}{n}}\] It can be shown that the following formula is equivalent and more efficient to be coded.
\[\sigma(\mathbf{x})=\sqrt{\frac{\sum_{j=1}^{n}\mathbf{x_j}^2}{n} - \left(\frac{\sum_{j=1}^{n}\mathbf{x_j}}{n}\right)^2}\]
# Initialize variables
my_sum = 0
my_sum2 = 0
num = 100
for (i in seq_len(num)) { # Loop through a sequence from 1 to num
my_sum = my_sum + i
my_sum2 = my_sum2 + i^2
}
# Calculate mean and standard deviation
my_mean = my_sum / num
my_sd = sqrt(my_sum2 / num - my_mean^2)
# Print the mean and standard deviation
my_mean
## [1] 50.5
my_sd
## [1] 28.86607
While the above code is typical for C or Fortran, a better approach for R is to use vector operations.
num = 100 # Define num
v = seq_len(num) # Create a sequence from 1 to num
# Calculate mean, mean square, and standard deviation
my_mean = mean(v)
my_mean_square = mean(v^2)
my_sd = sqrt(my_mean_square - my_mean^2)
# Print the mean and standard deviation
my_mean
## [1] 50.5
my_sd
## [1] 28.86607
Apply can be used to perform operations for columns (or rows) of a matrix.
library(MASS) # Load the necessary library for matrix operations
mat = matrix(rnorm(16), nrow = 4) # Generate a random matrix of size 4x4
# Calculate means, means square, and standard deviations
means = apply(mat, 2, mean)
my_mean_square = apply(mat * mat, 2, mean)
sd = sqrt(my_mean_square - means^2)
sd # Print the standard deviations
## [1] 0.6588735 1.0592286 0.9403377 1.2060033
In statistics, a lightly different definition is typically used.
\[\sigma(\mathbf{x})=\sqrt{\frac{\sum_{j=1}^{n}(\mathbf{x_j}-\mathbf{\overline{x}})^2}{n-1}}\] For example, the sd function in R uses this definition.
apply(mat, 2, sd)
## [1] 0.7608016 1.2230918 1.0858084 1.3925726
Any function can be compiled into byte code for an easy performance boost.
library(compiler)
my_sd = function(x) { # define the function
my_mean = mean(x)
my_mean2 = mean(x*x)
return(sqrt(my_mean2 - my_mean**2))
}
cmp_sd = cmpfun(my_sd) # compile it to byte code
cmp_sd(seq_len(100)) # same calculations as a previous example, but with compiled function
## [1] 28.86607
apply(mat, 2, cmp_sd) # apply to a matrix
## [1] 0.6588735 1.0592286 0.9403377 1.2060033
Many R packages are written in a compiled language. It is important to use them when possible. It is also feasible to write functions in C, C++, or Fortran, and call these functions from R. Here is an example of Fortran subroutines, provided in the file “extra/src/sub_cal_sd.f90”. Note that only integer, double precision and logical arguments are allowed in the Fortran-R interface approach.
subroutine cal_sd(n,x,sd)
implicit none
integer,intent(in) :: n
double precision, intent(in) :: x(n)
double precision, intent(out) :: sd
integer :: i
double precision :: mean, mean2
mean = 0.d0
mean2 = 0.d0
do i = 1, n
mean = mean + x(i)
mean2 = mean2 + x(i)**2
end do
mean = mean/n
mean2 = mean2/n
sd = sqrt(mean2 - mean**2)
end subroutine cal_sd
The Fortran code needs to be first compiled into a .so file. This needs to be done in command line with a Fortran compiler (gfortran from gcc).
gfortran -fpic -shared extra/src/sub_cal_sd.f90 -o extra/src/sub_cal_sd.so
Then, in R load the shared library
dyn.load("extra/src/sub_cal_sd.so")
cal_sd can be called in R below. The arguments of the Fortran subroutines can be found in the output list.
n = 100
v = seq_len(n)
my_sd = 0.0
results = .Fortran("cal_sd",as.integer(n),as.double(v), my_sd)
results # [[3]] outputs the sd
## [[1]]
## [1] 100
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
##
## [[3]]
## [1] 28.86607
To make it even more convenient, one can write an R wrapper.
my_sd_R <- function(x) {
if (!is.numeric(x))
stop("argument x must be numeric")
my_sd = 0.0
results = .Fortran("cal_sd", as.integer(length(x)), as.double(x), my_sd)
return(results[[3]])
}
my_sd_R(seq_len(100))
## [1] 28.86607
More details can be found in this page.
Lastly, parallelization is another way to improve the performance of R. If simple parallelization is needed, we can use Apply functions. Simply setup the desired calculations in a list. Instead of using lapply, use mclapply and specify the number of cores in use. As demonstrated below, the code with parallelization is much faster.
library(parallel) # load the parallel library
library(microbenchmark) # use microbenchmark library for timing
numCores = detectCores() # the number of available cores
numCores
## [1] 14
tests = seq(1, 10000)
my_test <- function(test_id) {
v = rnorm(1000)
return(my_sd_R(v))
}
my_benchmark = microbenchmark(results_noparallel = lapply(tests, my_test),
results_withparallel = mclapply(tests, my_test, mc.cores = numCores),
times = 10, unit = "s")
## Warning in microbenchmark(results_noparallel = lapply(tests, my_test),
## results_withparallel = mclapply(tests, : less accurate nanosecond times to
## avoid potential integer overflows
microbenchmark:::autoplot.microbenchmark(my_benchmark)
## Loading required namespace: ggplot2
For additional read, please check this page