Consider a "large-ish" data set (~2-5M rows) that goes through multiple stages of cleaning/processing:
library(dplyr)
largedat %>%
mutate(
# overwrite v1 based on others
v1 = somefunc(v1,v2,v3,v4),
errv2 = anotherfunc(v2,v5)
) %>%
group_by(v5) %>%
mutate(
v6 = otherfunc(v7,v8,v9),
errv7 = fourthfunc(v7,v9)
) %>%
ungroup() %>%
mutate(
v2 = if_else(errv2, NA, v2),
v7 = if_else(errv7, NA, v7)
)
With some hand-waving that there is sufficient need to keep things broken out like this (and that some portions might be faster if done manually in base R). The two functions here are clearly "functional" in that they have no side-effect, are given explicit vectors of arguments, and output a vector of the same length (or 1). In a sense, clean. Also, the potential for lots of copying of the data (depending).
Using data.table where in-place operations are standard, side-effect is by-design and an intentional decision that provides considerable improvements in memory and speed.
A more "functional" approach is still quite possible:
library(data.table)
setDT(largedat)
largedat[, newv1 := somefunc(v1, v2, v3, v4)]
errv2 <- largedat[, anotherfunc(v2,v5)]
largedat[, v6 := otherfunc(v7,v8,v9)]
# ...
# eventually using the changes
largedat[, c("v2", "v7") := list(ifelse(errv2, NA, v2), ifelse(errv7, NA, v7)) ]
This still preserves the functional and side-effectless use of the functions, but can be slightly cumbersome. If we understand that at least one of these functions outputs a full data.table instead of just a vector, it gets a little more complicated, especially when we're grouping with by="..." (which does not preserve order in the functional output) (ref: https://stackoverflow.com/q/11680579/3358272).
Another attempt might be to adapt the functions to be in-place operators, something like:
somefunc(largedat) # replaces v1
anotherfunc(largedat) # optionally nullifies v2
# ...
or perhaps
out <- largedat[, somefunc(.SD)
][, anotherfunc(.SD)
][, otherfunc(.SD), by = "v5"
][, fourthfunc(.SD), by = "v5" ]
For simple projects, whatever works (reliably) is often best, but for longer-living packages where flexibility and reliability are required, are there distinct (dis)advantages to the in-place side-effect-based functions as used in the last two code samples?