Skip to content

Speed up duplicate detection in as_epi_df() #560

@brookslogan

Description

@brookslogan

When writing some code for archive to archive slides, as_epi_df was taking most of the time. I can/should probably avoid that with new_epi_df or an as_epi_df.data.table, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of as_epi_df.

Most of the time in as_epi_df appears to be spent in duplicate detection:
2024-10-31-072042_535x39_scrot

Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter

dup_check1 <- function(x, other_keys) {
  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms      3.34        NA     13.4
#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms      5.85        NA     21.5
#> 3 dup_check3(test_tbl, character…   4.09ms   4.56ms    194.          NA     22.0

Created on 2024-10-31 with reprex v2.1.1

vctrs::vec_equal should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.

I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions