UQRUG -

Why to parallelize

Have you ever noticed that RStudio never reaches 100% CPU usage even when running a very demanding task?

R runs only on a single thread on the CPU by default

Is it the most efficient way to run functions?

Independent operations

results <- rep(0,10)

for(num in 1:10)
  {
    results[num]<-num^2
  }

results

 [1]   1   4   9  16  25  36  49  64  81 100

Parallelization in R

It is possible to parallelize processes in R using specialized packages.

parallel

Most used package
Part of r-core.

library(parallel)

Cores in our PC and management of clusters

Basic concepts

Core: an individual processing unit within a CPU
Cluster: R background sessions that allows parallelization of processes.

Load the package
  library(parallel)

number of cores
  cores<- detectCores()
  
make cluster
  clust <- makeCluster(cores)

start created cluster
  registerDoParallel(clust)
  
status of the clusters
  showConnections()

close the cluster
  stopCluster(cl = clust)

Methods of Paralleization

There are two main ways in which code can be parallelized, via sockets or via forking

Socket approach: launches a new version of R on each core
Forking approach: copies the entire current version of R and moves it to a new core

Socket pros and cons

Pros
- Works on every OS.
- Each process on each node are 100% independent.
Cons
- Each process is unique so it will be slower
- Variables and packages must be imported to the created cores.
- More complicated to implement.

Forking pros and cons¹

Pros
- Faster.
- Not necessary to import the variables and packages.
- Relatively easier to implement.
Cons
- Does NOT work on Windows
- Processes are not totally independent, and can cause weird behaviors when runned in RStudio .

parallel and apply

parallel is designed to work with functions, and it is analogous to the use of functions like apply, as well as its derivatives lapply and sapply

Equivalent functions to the apply family
apply	parallel	INPUT	OUTPUT
apply	parApply (parRapply, parCapply)¹	data.frame, matrix	vector, list, array
sapply	parSapply	List, vector, data.frame	vector/matrix
lapply	parLapply	List, vector, data.frame	list

foreach

foreach is a package designed for looping. It also allows to combine results in diferent formats.

library(foreach)
foreach(i=1:2) %do% exp(i)

[[1]]
[1] 2.718282

[[2]]
[1] 7.389056

library(foreach)
foreach(i=1:2, .combine='c') %do% exp(i)

[1] 2.718282 7.389056

library(foreach)
foreach(a=1:1000, b=rep(10, 2), .combine='c') %do% {a+b}

[1] 11 12

foreach¹

By itself, foreach do not parallelize, but it can be combined with parallel and doParallel to allow paralellization

library(foreach)
library(parallel)
library(doParallel)

clust <- makeCluster(2)
registerDoParallel(clust)

foreach(i=1:2, .combine='c') %dopar% exp(i)

stopCluster(cl = clust)

[1] 2.718282 7.389056

Example

Determine which numbers on a sample are primes

Function:

isprime <- function(num){ 
    prime=TRUE 
    i=2                         #I need to start from 2, as prime numbers can only be divided by 1 and themselves.
    while(i<num){               #The while loop will continue running as long as the value of 'i' is less than the specified number
      if ((num %% i) == 0){     #The '%%' operator calculates the remainder when our number is divided by 'i.' If the remainder is 0, it will terminate the loop
        prime = FALSE 
        break 
      }
      i <- i+1 
    }
    return(prime) 
  }

data (10,000 numbers):

listnumbers <- sample(1:100000,10000)

for¹

primes<- rep(T,10000)   

for(i in 1:length(listnumbers)){
  primes[i] <- isprime(listnumbers[i])
}

result<-data.frame(number=listnumbers, is_prime=primes)

[1] "Time difference of 5.308078 secs"

foreach¹

library(foreach)
   
primes_fe <-foreach(i = 1:length(listnumbers), .combine="c") %do% { 
                    isprime(listnumbers[i]) 
                    }

result_fe<-data.frame(number=listnumbers, is_prime=primes_fe)

[1] "Time difference of 6.629166 secs"

foreach parallelized

library(parallel)
library(foreach)
library(doParallel)

cores <- detectCores()                 
clust <- parallel::makeCluster(cores)  
registerDoParallel(clust)  
 
primes_par_fe <- foreach(i = 1:length(listnumbers), .combine="c") %dopar% { 
isprime(listnumbers[i]) 
}

result_par_fe<-data.frame(number=listnumbers, is_prime=primes_par_fe)

parallel::stopCluster(cl = clust)

[1] "Time difference of 2.964536 secs"

sapply ¹

primes_sa <- sapply(listnumbers, isprime) 

result_sa<-data.frame(number=listnumbers, is_prime=primes_sa)

[1] "Time difference of 5.478974 secs"

parSapply

library(parallel)

cores <- detectCores()     
clust <- parallel::makeCluster(cores)

prime_par_sa <- parSapply(clust, listnumbers, isprime)             

result_par_sa<-data.frame(number=listnumbers, is_prime=prime_par_sa)

parallel::stopCluster(cl = clust)

[1] "Time difference of 1.725008 secs"

Has the processing time improved?

Parallelization in R parallel and foreach

Why to parallelize

Parallelization in R

Cores in our PC and management of clusters

Methods of Paralleization

Socket pros and cons

Forking pros and cons1

parallel and apply

foreach

foreach1

Example

for1

foreach1

foreach parallelized

sapply 1

parSapply

Has the processing time improved?

Parallelization in R
parallel and foreach

Forking pros and cons¹

foreach¹

for¹

foreach¹

sapply ¹