Speed up duplicate detection in `as_epi_df()`

When writing some code for archive to archive slides, `as_epi_df` was taking most of the time.  I can/should probably avoid that with `new_epi_df` or an `as_epi_df.data.table`, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of `as_epi_df`.

Most of the time in `as_epi_df` appears to be spent in duplicate detection:
![2024-10-31-072042_535x39_scrot](https://github.com/user-attachments/assets/b157bead-fdd3-4ed7-b9bf-9578752d95e9)

Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.
``` r
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#> 
#>     filter

dup_check1 <- function(x, other_keys) {
  duplicated_time_values <- x %>%
    group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
    filter(dplyr::n() > 1) %>%
    ungroup()
  nrow(duplicated_time_values) != 0
}

dup_check2 <- function(x, other_keys) {
  anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}

dup_check3 <- function(x, other_keys) {
  if (nrow(x) <= 1L) {
    FALSE
  } else {
    epikeytime_names <- c("geo_value", "time_value", other_keys)
    arranged <- arrange(x, across(all_of(epikeytime_names)))
    arranged_epikeytimes <- arranged[epikeytime_names]
    any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
  }
}

test_tbl <- as_tibble(covid_case_death_rates_extended)

bench::mark(
  dup_check1(test_tbl, character()),
  dup_check2(test_tbl, character()),
  dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms      3.34        NA     13.4
#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms      5.85        NA     21.5
#> 3 dup_check3(test_tbl, character…   4.09ms   4.56ms    194.          NA     22.0
```

<sup>Created on 2024-10-31 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

`vctrs::vec_equal` should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.

I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up duplicate detection in `as_epi_df()` #560

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speed up duplicate detection in as_epi_df() #560

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Speed up duplicate detection in `as_epi_df()` #560