added 609 characters in body

Source Link

edited Jan 4, 2016 at 21:22

265
3
11

Edit 2

This edit is meant to show how to export functions or variables per submitter's issue. Since the submitter has not made available the functions or variables needed within the parallelization this is a generic example.

library(parallel)
cl <- makeCluster(detectCores())

# Load packages on cluster's R sessions
clusterEvalQ(cl, library(pkgname))
 
# Export functions or variables to cluster's R session
func <- function(a){
out <- a*a
return(out)
}

v = 1:10
 
clusterExport(cl,c("func","v"))

stopCluster(cl)

Edit 2

This edit is meant to show how to export functions or variables per submitter's issue. Since the submitter has not made available the functions or variables needed within the parallelization this is a generic example.

library(parallel)
cl <- makeCluster(detectCores())

# Load packages on cluster's R sessions
clusterEvalQ(cl, library(pkgname))
 
# Export functions or variables to cluster's R session
func <- function(a){
out <- a*a
return(out)
}

v = 1:10
 
clusterExport(cl,c("func","v"))

stopCluster(cl)

added 814 characters in body

Source Link

edited Jan 4, 2016 at 19:26

coatless

265
3
11

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check(). Only x is in the function's scope.

Try to return some value to the foreach loop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Edit

Per the submitter's comment that using lapply is not possible. Please note that a for loop in R is the same in this case as using an lapply. There are some benefits (speed being one) that make lapply typically better than using a for loop. Furthermore, the foreach here is really a cast over the parXapply statements.

library(parallel)
cl <- makeCluster(detectCores())

# Function
check <- function(x){
  df <- readRDS(paste0("./", x))
  df$var <- df$var*2
  saveRDS(df, paste0("./temp/", x))
}

# Directory where *.rds files are
files <- list.files("./")

# Obtain unique files    
i = unique(files)

# lapply statement
lapply(i, FUN=check)

# parlapply 
parLapply(cl, X=i, fun=check)

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check(). Only x is in the function's scope.

Try to return some value to the foreach loop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check(). Only x is in the function's scope.

Try to return some value to the foreach loop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Edit

Per the submitter's comment that using lapply is not possible. Please note that a for loop in R is the same in this case as using an lapply. There are some benefits (speed being one) that make lapply typically better than using a for loop. Furthermore, the foreach here is really a cast over the parXapply statements.

library(parallel)
cl <- makeCluster(detectCores())

# Function
check <- function(x){
  df <- readRDS(paste0("./", x))
  df$var <- df$var*2
  saveRDS(df, paste0("./temp/", x))
}

# Directory where *.rds files are
files <- list.files("./")

# Obtain unique files    
i = unique(files)

# lapply statement
lapply(i, FUN=check)

# parlapply 
parLapply(cl, X=i, fun=check)

added 399 characters in body

Source Link

edited Jan 4, 2016 at 9:02

coatless

265
3
11

Cores, cores, where art thou?

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreachforeach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check() you only have. Only x is in the function's scope.

You also should be returningTry to return some value to the foreach setuploop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rdards? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Cores, cores, where art thou?

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback. With that being said, you will probably end up with an error running this code since i is not defined within check() you only have x in scope.

You also should be returning some value to the foreach setup, e.g. an i value from check or a "pass-i", "fail-i".

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rda? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Parallelization is not necessarily implemented nicely in R. However, it is far more ideal to use R's batch process than opening 10x Rstudio sessions as you saw (less of a resource drain per task).

Cores, cores, where art thou?

The first thing I would do is find out how many cores you have access to. Within this script 4 seem to be allocated. It is very important that you have at most or less than the total number of cores on your system for the parallel jobs to be appropriately scheduled.

# Find out how many cores exist
parallel::detectCores()

Speed

Speed-wise, I find the foreach iterator to consume more time than I really care for. The reason is because it has a lot of unnecessary overhead and combine operations that tend to dynamically grow objects. So, I normally just interface directly with R's parallel package's parLapply function.

Code

I realize that this is just a condensed example, however, without seeing everything, we really cannot give super specific feedback.

With that being said, you will probably end up with an error running this code since i is not defined within check(). Only x is in the function's scope.

Try to return some value to the foreach loop, e.g. an i value from check or a "pass-i", "fail-i" to figure out if a process fails.

Another bit to note is that the script is currently set to output to the input directory. Try to create dedicate directories for input and output. The reason for this is, if you were to re-run your script, the processed files would be included within the next run. (Unless the pattern specified by list.files is more specific to file nomenclature that you have.)

Also, since it seems like you are using spatial data. Is this a raster image that is contained within .rds? If so, could you get away with using a stack instead of the whole raster into memory?

Aside

If you are interested in learning more about how to parallelize with R, I would recommend looking over this slide deck (Disclaimer: I wrote it)

Source Link

answered Jan 4, 2016 at 8:16

coatless

265
3
11

Loading

Stack Exchange Network

Return to Answer

Edit 2

Edit 2

Cores, cores, where art thou?

Speed

Code

Aside

Edit

Cores, cores, where art thou?

Speed

Code

Aside

Cores, cores, where art thou?

Speed

Code

Aside

Edit

Cores, cores, where art thou?

Cores, cores, where art thou?

Speed

Code

Aside

Cores, cores, where art thou?

Speed

Code

Aside

Cores, cores, where art thou?

Speed

Code

Aside