Overview
I have 165,000 *.rds files that are around 200,000 obs x 13 variables each. These files are unique to grid files (2.5 square mile data) so I need to keep them in their unique grid numbers. Currently, I'm loading each individual file, performing some function on the data, and resaving them to another directory; however, this took up to a week to process and I'm trying to improve the code to run in parallel. I recently started 10 individual RStudio sessions and divided the grids into 10 separate sessions to run in parallel. It worked fine and only took a day to run through. This is not very efficient though and I would like to try to get it running in parallel using only one rsession.
The following code is a simple example I made that closely resembles the basic function of the loop. This is not the exact code as it would not be useful to post all the files and all the code.
Question
Is this the proper way to run parallel when loading individual files, applying a function to the data, and re-save to a directory? It seems to be working, but I don't know if it is the correct way to do this.
RDS Files:
Code
# Multicore
library(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
# Function
check <- function(x){
df <- readRDS(paste0("./", x))
df$var <- df$var*2
saveRDS(df, paste0("./temp/", ix))
}
# Directory where *.rds files are
files <- list.files("./")
# for loop
for (i in unique(files)){
check(i)
}
# foreach loop
foreach(i = unique(files)) %dopar% check(i)