Make backfill corrections faster pt 1 #1802

nmdefries · 2023-03-02T22:22:07Z

Description

Starting to identify places to make backfill corrections faster.

Reduce number of calls in QC to functions checking row duplication
Pipes, rlang pronouns, and str_interp are slower than base/standard approaches, so remove where they don't make the code more readable.
Date columns are slow to handle in various operations because they are constantly being converted back to string format. Change to only converting them from string -> date after QC happens, for speed. (We may want to keep date columns in string format the entire time and convert to Date explicitly when needed, but that needs more investigation).

Based on a smaller test example (a full run takes too long to reasonably profile and too much memory to profile locally), this reduces setup (data read + validation) runtime by 3x and training/testing by 25%. Setup doesn't take that much time but it is responsible for peak memory usage for the entire pipeline; this also reduces memory usage there. Since this was run on a smaller dataset, speedup in production may differ somewhat.

jingjtang

LGTM

nmdefries · 2023-03-13T17:34:47Z

@korlaxxalrok This is ready to merge.

nmdefries added 5 commits February 27, 2023 18:16

msg fn to concat automatically; replace str_interp with paste

96f2ecf

remove rlang pronouns for speed

1202646

remove unnecessary pipes and combine filters

86e1163

perform data validation on dates-as-strings

52f747a

call validation once to reduce calls to "duplicate"

523b1a6

nmdefries marked this pull request as ready for review March 3, 2023 15:44

nmdefries requested a review from jingjtang March 3, 2023 15:44

jingjtang approved these changes Mar 13, 2023

View reviewed changes

korlaxxalrok merged commit b4caa58 into main Mar 14, 2023

korlaxxalrok deleted the ndefries/backfill/speed branch March 14, 2023 14:30

nmdefries mentioned this pull request Mar 17, 2023

[Backfill corrections] Efficient date handling and rolling averages #1807

Merged

krivard mentioned this pull request Mar 29, 2023

Release covidcast-indicators 0.3.34 #1816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make backfill corrections faster pt 1 #1802

Make backfill corrections faster pt 1 #1802

Uh oh!

nmdefries commented Mar 2, 2023 •

edited

Loading

Uh oh!

jingjtang left a comment

Uh oh!

nmdefries commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make backfill corrections faster pt 1 #1802

Make backfill corrections faster pt 1 #1802

Uh oh!

Conversation

nmdefries commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

jingjtang left a comment

Choose a reason for hiding this comment

Uh oh!

nmdefries commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nmdefries commented Mar 2, 2023 •

edited

Loading