New Approach
My previous question may have had a little too much going on and I realized I could simplify the problem by constructing the data a bit differently thanks to @Gentian Kasa . Previously, the code was filtering data constantly and causing a big bottle neck in the processing time. I have now constructed the data in a way that both the main station, and local stations have the same number of days, so instead of filtering the code it now simply processes through the data.frames.
Problem
There is 1 main station (df) and 3 local stations (s) stacked in a single data.frame with values for three days. The idea is to take each day from the main station, find the relative anomaly of the three local stations, and smooth it using inverse distance weighting (IDW) from the phylin package. This is then applied to the value in the main station by multiplication.
This code is working fine and it is certainly an improvement from before, but I would like to see if there is a better/faster way using an optimized package/method (e.g. data.table, dplyr, apply). I still don't know how to approach this problem without the cumbersome for loop.
The original data set has around 19,000 days, with 3 different variables, for 20,000 stations totaling 1.14 trillion observations. You can imagine how long this might take -- prior estimates were at 14 days;although, I have not checked with this updated code.
Data
Main Station : df
id lat long year month day value
1 12345 100 50 1900 1 1 54.87800
2 12345 100 50 1900 1 2 106.96603
3 12345 100 50 1900 1 3 98.31988
Local Stations: s
id lat long year month day value
1 USC00031152 33.5900 -92.8236 1900 1 1 63.31576
2 USC00034638 34.7392 -90.7664 1900 1 1 86.04906
3 USC00036352 35.2833 -93.1000 1900 1 1 76.50639
4 USC00031152 33.5900 -92.8236 1900 1 2 71.37608
5 USC00034638 34.7392 -90.7664 1900 1 2 89.91196
6 USC00036352 35.2833 -93.1000 1900 1 2 76.35352
7 USC00031152 33.5900 -92.8236 1900 1 3 53.72596
8 USC00034638 34.7392 -90.7664 1900 1 3 61.79896
9 USC00036352 35.2833 -93.1000 1900 1 3 85.89112
dput
s <- structure(list(id = c("USC00031152", "USC00034638", "USC00036352",
"USC00031152", "USC00034638", "USC00036352", "USC00031152", "USC00034638",
"USC00036352"), lat = c(33.59, 34.7392, 35.2833, 33.59, 34.7392,
35.2833, 33.59, 34.7392, 35.2833), long = c(-92.8236, -90.7664,
-93.1, -92.8236, -90.7664, -93.1, -92.8236, -90.7664, -93.1),
year = c(1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900,
1900), month = c(1, 1, 1, 1, 1, 1, 1, 1, 1), day = c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), value = c(63.3157576809045,
86.0490598902219, 76.506386949066, 71.3760752788486, 89.9119576975542,
76.3535163951321, 53.7259645981243, 61.7989638892985, 85.8911224149051
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-9L), .Names = c("id", "lat", "long", "year", "month", "day",
"value"))
df <- structure(list(id = c(12345, 12345, 12345), lat = c(100, 100,
100), long = c(50, 50, 50), year = c(1900, 1900, 1900), month = c(1,
1, 1), day = 1:3, value = c(54.8780020601509, 106.966029162171,
98.3198828955801)), row.names = c(NA, -3L), class = "data.frame", .Names = c("id",
"lat", "long", "year", "month", "day", "value"))
Code
library(phylin)
nearest <- function(i, loc){
# Stack 3 local stations
stack <- s[loc:(loc+2),]
# Get 1 main station
station <- df[i,]
# Check for NA and build relative anomaly (r)
stack <- stack[!is.na(stack$value),]
stack$r <- stack$value/station$value
# Use IDW and return v
v <- as.numeric(ifelse(dim(stack)[1] == 1,
stack$r,
idw(stack$r, stack[,c(2,3,8)], station[,2:3])))
return(v)
}
ncdc <- 1
for (i in 1:nrow(df)){
# Get relative anomaly from function
r <- nearest(i, ncdc)
# Get value from main station and apply anomaly
p <- df[i,7]
df[i,7] <- p*r
# Iterate to next 3 local stations
ncdc <- ncdc + 3
}
Output
id lat long year month day value
1 12345 100 50 1900 1 1 75.40086
2 12345 100 50 1900 1 2 79.31592
3 12345 100 50 1900 1 3 67.12082
Nof local stations? Are the two setsMandNdistinct, identical, or do they overlap? \$\endgroup\$Nstations. You pick one of them as the main station and consider all the other as local (or maybe only a subset: the 3 closest or those within a specified distance of the main station). But the important point is that you will end up repeating the process for each of theNstations. So my point is that it would be useful to compute aN-by-Nmatrix of distances to start with. Do I understand correctly? \$\endgroup\$Mmain stations andNlocal stations. EachMstation has 3Nlocal stations associated with it that are the closest by proximity. ThenIDWis computed to smooth out relative anomalies for each day of the year. After thisMstation has all their relative anomalies applied to thevaluethe process moves to the nextMstation. Is this more clear? \$\endgroup\$