0

I am trying dtplyr & data.table for first time to do some time optimization in my existing dplyr code.

Issue: if I use data.table / dtplyr data object then I am unable to plot with ggplot. And before plotting in pipe/chain commands if I just convert data.table / dtplyr object into tibble then it works with ggplot but then it takes even more time than working with data.frame/tibble entirely which is shown later in this post.

library(tidyverse)
library(dtplyr)
library(data.table)
library(scale)
library(lubridate)
library(bench)

My Code attempts & time benchmarks:

data:

data.frame object

df_ind_stacked_daily <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>% 
  mutate(Date = ymd(Date))

data.table object

df_ind_stacked_daily2 <- setDT(df_ind_stacked_daily)

Plot with data.table/dtplyr object:

 df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>% 
    # as.tibble() %>%
    
            ggplot(aes(x = Daily_cases_counts, 
                       y = reorder_within(State.UnionTerritory, 
                                          by = Daily_cases_counts, within = Date),
                       fill = State.UnionTerritory)) +
            geom_col(show.legend = FALSE) +
            facet_wrap(~Date, scales = "free_y") +
            
            geom_text(aes(label = Daily_cases_counts), size=3, color="white", 
                       # position = "dodge", 
                      hjust = 1.2) + 
            
            # theme_minimal() +
            theme(legend.position = "none") +
            scale_x_continuous(labels = comma) + # unit_format(scale = 1e-3, unit = "k")
            scale_fill_tableau(palette = "Tableau 20") +
            scale_y_reordered() +
            coord_cartesian(clip = "off")

Error: data must be a data frame, or other object coercible by fortify(), not an S3 object with class dtplyr_step_group/dtplyr_step.

P.S - If I uncomment as.tibble() in above code chunk then ggplot works.


Code Time Benchmarks:

  1. data.table/dtplyr object without converting to tibble
library(bench)

bench::mark(
  df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() 
    # as.tibble() %>%
)
expression       min    median itr/sec
<S3: bench_expr> 2.45ms 2.75ms 320.3396
  1. data.table/dtplyr object after converting to tibble
library(bench)

bench::mark(
  df_ind_stacked_daily2 %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as.tibble()
)
expression       min    median itr/sec
<S3: bench_expr> 12.7ms 14ms   65.41098
  1. data.frame or tibble object
library(bench)

bench::mark(
  df_ind_stacked_daily %>% 
    
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup()
)
expression       min    median itr/sec
<S3: bench_expr> 6.71ms 7.97ms   120.3636

Question: So How can I make ggplot work with data.table / dtplyr without converting it to data.frame / tibble ??


                               ############################

(UPDATE: Response to Answer)

Thanks @teunbrand I am mostly using your code below & added another function along & have put it in 3 scenarios:

I have created two functions: (1) That performs processing & do no coerce to tibble, (2) That coerce it to tibble after processing.

And I ran these in 3 scenarios overall - (1) data.table, (2) data.table converted to tibble after processing, (3) using tibble from beginning

# 1. function doesn't convert to tibble 
fun <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() #%>%
    # as_tibble() # Always coerce to tibble
}

# 2. function convert it to tibble after all processing
fun_to_tbl <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as_tibble() # Always coerce to tibble
}


# Make data larger
dt  <- do.call(rbind, rep(list(as.data.table(df_ind_stacked_daily)), 20))
tbl_df <- do.call(rbind, rep(list(as_tibble(df_ind_stacked_daily)), 20))

# Run data.table on single thread
setDTthreads(1)

For unknown reasons my benchmark didn't run simultaneously so I had to run them one by one.

(bm <- bench::mark(
  dt_res = fun(dt), # bench dt
  min_iterations = 20
))

expression       min    median itr/sec    mem_alloc
<S3: bench_expr> 4.35ms 6.05ms   148.1923 5.12KB
(bm <- bench::mark(
  dt_to_tbl_res = fun_to_tbl(dt), # bench dt converted to tibble at end
  min_iterations = 20
))

expression       min    median itr/sec    mem_alloc
<S3: bench_expr> 65.8ms 72.2ms   12.28566 47.6MB

(bm <- bench::mark(
  tbl_res =  fun(tbl_df),   # bench tbl
  min_iterations = 20
))

expression       min    median itr/sec  mem_alloc
<S3: bench_expr> 55ms 67.8ms   13.70603 47.4MB

Objective: My main objective was to incorporate this into shiny app which has dynamic variable selection so wanted to optimize it with data.table. But I guess there is no way for ggplot to work with s3 objects / data.table.

And only time difference I am getting is when I use data.table & pass it as data.table otherwise there is no benefit.

0

1 Answer 1

2

There are a few observations to be made here:

  1. As far as I understand dtplyr, along your piping chain it accumulates operations that aren't evaluated, they are just translated from dplyr to data.table syntax. Until you realise your pipe as a data.frame, data.table or tibble, your computer doesn't run the operations. This underestimates the runtime of your first benchmark.

  2. Because you're using setDT to convert a data.frame to a data.table, what you are benchmarking as a data.frame is not a benchmark for a data.frame. If you read the documentation of ?setDT, you'll see that the object is converted in memory and without copying meaning that your df_ind_stacked_daily is also a data.table.

  3. The data.table package makes use of multiple threads by default. We should prevent this to make a fair comparison.

  4. Your first filtering operation goes from medium-sized data (75748 rows) to small (252 rows). The majority of your pipe, you are not working with a lot of data, which is where data.table shines.

Adjusting for some of these things, I find that there is no difference in speed.

library(tidyverse)
library(dtplyr)
library(data.table)
library(lubridate)
library(bench)

df <- read.csv(url("https://raw.githubusercontent.com/johnsnow09/covid19-df_stack-code/main/df_ind_stacked_daily.csv")) %>% 
  mutate(Date = ymd(Date))

fun <- function(x) {
  x %>%
    filter(Daily_cases_type == "Daily_confirmed",
           Date >= max(Date) - 6 & Date <= max(Date),
           State.UnionTerritory != "India"
    ) %>%
    
    group_by( Date) %>%
    
    slice_max(order_by = Daily_cases_counts, n = 10) %>% 
    ungroup() %>%
    as_tibble() # Always coerce to tibble
}

# Make data larger
dt  <- do.call(rbind, rep(list(as.data.table(df)), 20))
tbl <- do.call(rbind, rep(list(as_tibble(df)), 20))

# Run data.table on single thread
setDTthreads(1)

# Benchmark simultaneously
(bm <- bench::mark(
  dt = fun(dt),
  tbl = fun(tbl),
  min_iterations = 20
))
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt           41.1ms   42.5ms      23.4    72.2MB     35.2
#> 2 tbl          40.7ms   41.5ms      24.0      71MB     36.0
plot(bm)

Created on 2021-08-19 by the reprex package (v1.0.0)

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks @teunbrand for putting in so much efforts, details & explanation. I have tried your code & added another function along and updated them in the post (UPDATE: Response to Answer). Basically I wanted to optimize this for my dynamic shiny plots but if ggplot doesn't accept data.table then the whole point is gone.
ggplot2 does accept regular data.table objects, it's just that dtplyr wraps data.tables in another class with lazy_dt() that does not go directly into ggplot2 because it has unevaluated computations that get resolved upon coersion. The ?lazy_dt documentation explains a bit more.
Ohhh I see, that means working entirely in data.table instead of lazy_dt() for shiny plots can be really fruitful for time optimization.
I don't think it will make as much of a difference with this size of data: data.table is benchmarked on millions of rows. The most time consuming step in your code is probably the filtering and you may have some minor speed benefit in your do the Daily_cases_type == "Daily_confirmed" step first instead of simultaneously.
yes for this single plot alone it may not be alot but I have multiple plots on a shiny page which are created from multiple datasets. So combined time save from all of them may may reduce initial page loading time by few seconds. And I may not be using large data now but may be in future I should consider this approach for larger datasets. And I forgot to say it earlier that I really appreciate your help in learning this.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.