Skip to content

Conversation

@nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Mar 2, 2023

Description

Starting to identify places to make backfill corrections faster.

  • Reduce number of calls in QC to functions checking row duplication
  • Pipes, rlang pronouns, and str_interp are slower than base/standard approaches, so remove where they don't make the code more readable.
  • Date columns are slow to handle in various operations because they are constantly being converted back to string format. Change to only converting them from string -> date after QC happens, for speed. (We may want to keep date columns in string format the entire time and convert to Date explicitly when needed, but that needs more investigation).

Based on a smaller test example (a full run takes too long to reasonably profile and too much memory to profile locally), this reduces setup (data read + validation) runtime by 3x and training/testing by 25%. Setup doesn't take that much time but it is responsible for peak memory usage for the entire pipeline; this also reduces memory usage there. Since this was run on a smaller dataset, speedup in production may differ somewhat.

@nmdefries nmdefries marked this pull request as ready for review March 3, 2023 15:44
@nmdefries nmdefries requested a review from jingjtang March 3, 2023 15:44
Copy link
Contributor

@jingjtang jingjtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nmdefries
Copy link
Contributor Author

@korlaxxalrok This is ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants