-
Notifications
You must be signed in to change notification settings - Fork 8
Description
When writing some code for archive to archive slides, as_epi_df was taking most of the time. I can/should probably avoid that with new_epi_df or an as_epi_df.data.table, but it'd probably still be nice to speed this up in case we/users want to have the convenience/security of as_epi_df.
Most of the time in as_epi_df appears to be spent in duplicate detection:

Here's some limited testing on duplicate check approaches; looks like we can speed duplicate checks up by >50x, for "medium"-sized inputs at least.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(epiprocess)
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
#>
#> Attaching package: 'epiprocess'
#> The following object is masked from 'package:stats':
#>
#> filter
dup_check1 <- function(x, other_keys) {
duplicated_time_values <- x %>%
group_by(across(all_of(c("geo_value", "time_value", other_keys)))) %>%
filter(dplyr::n() > 1) %>%
ungroup()
nrow(duplicated_time_values) != 0
}
dup_check2 <- function(x, other_keys) {
anyDuplicated(x[c("geo_value", "time_value", other_keys)]) != 0L
}
dup_check3 <- function(x, other_keys) {
if (nrow(x) <= 1L) {
FALSE
} else {
epikeytime_names <- c("geo_value", "time_value", other_keys)
arranged <- arrange(x, across(all_of(epikeytime_names)))
arranged_epikeytimes <- arranged[epikeytime_names]
any(vctrs::vec_equal(arranged_epikeytimes[-1L,], arranged_epikeytimes[-nrow(arranged_epikeytimes),]))
}
}
test_tbl <- as_tibble(covid_case_death_rates_extended)
bench::mark(
dup_check1(test_tbl, character()),
dup_check2(test_tbl, character()),
dup_check3(test_tbl, character())
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dup_check1(test_tbl, character… 295.55ms 299.13ms 3.34 NA 13.4
#> 2 dup_check2(test_tbl, character… 168.25ms 170.59ms 5.85 NA 21.5
#> 3 dup_check3(test_tbl, character… 4.09ms 4.56ms 194. NA 22.0Created on 2024-10-31 with reprex v2.1.1
vctrs::vec_equal should keep this pretty general, though I don't know how it compares to less general approaches speed-wise.
I'm not immediately PR-ing this because it probably needs a bit more correctness and performance testing on different sizes.