Skip to content

Conversation

@nmdefries
Copy link
Contributor

@nmdefries nmdefries commented Oct 10, 2024

Description

Add nchs-mortality raw data backups and backup export utility

Changelog

  • add create_backup_csv fn in delphi_utils/export.py
  • use the utility in nchs_mortality's pull_nchs_mortality_data fn
  • related tests

Associated Issue(s)

Context and writeup

@nmdefries
Copy link
Contributor Author

nmdefries commented Oct 10, 2024

I guess the test is failing (on linting, with delphi_nchs_mortality/pull.py:11:0: E0611: No name 'create_backup_csv' in module 'delphi_utils' (no-name-in-module)) because the new fn is being added to delphi_utils at the same time.

Also, tests for the new create_backup_csv fn need to be added, but this is the idea for how this should work. Adding backups for other indicators should be faster.

# Label the file with today's date (the date the data was fetched).
if not issue:
issue = datetime.today().strftime('%Y%m%d')
backup_filename = [issue, geo_res, table_name, metric, sensor]
Copy link
Contributor Author

@nmdefries nmdefries Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: For simplicity of using backup data later, would prefer to compress all tables for a given issue into a single compressed archive.

Copy link
Contributor

@minhkhul minhkhul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the custom run flag!

change requested:

  • Adding file compression.
  • Add some logging to note on which indicator stashing is done.
  • Adjust the params.json.template in nchs_mortality as well.

suggestion: When I wrote and run the script to stash nssp source similar to this on one, the small vm ran out of disk space at one point. To save disk space, apart from adding zipping, I also added a feature to check if there has been changes at all to the dataset in comparison to the latest past csv.gz on disk, and only save the latest new version of the dataset after confirming there's a difference. It's helpful on a weekly signal like nssp. I think it'd be nice to add that but not needed.

@nmdefries
Copy link
Contributor Author

nmdefries commented Oct 11, 2024

Thanks for your quick feedback @minhkhul!

Add some logging to note on which indicator stashing is done

Agreed. Related to this, @korlaxxalrok suggested including metadata in each day's backup data or unique IDs we can use to track provenance of downstream data. Designing that will likely be too complex and thus take too long for getting V1 of data backups out, but could be very useful in the future.

Adjust the params.json.template in nchs_mortality as well.

I don't have strong feelings about this, but given the default the custom_run param takes in the code means we don't necessarily need to add it to params.json.

suggestion: When I wrote and run the script to stash nssp source similar to this on one, the small vm ran out of disk space at one point. To save disk space, apart from adding zipping, I also added a feature to check if there has been changes at all to the dataset in comparison to the latest past csv.gz on disk, and only save the latest new version of the dataset after confirming there's a difference. It's helpful on a weekly signal like nssp. I think it'd be nice to add that but not needed.

Hm, so we've found that saving data like this causes storage issues. Since you refer to a "vm", I wonder if the limit you hit was that of the VM (O(1 GB)) rather than with the host machine (O(100 GB)). How big is that entire collection of backups?

RE "only sav[ing] the latest new version of the dataset after confirming there's a difference" with the last backup, do we think this is safe/robust enough to do? One initial concern is that this is starting to sound like "archive differ V2". Of course, it's simpler than the current one, but any extra code increases the risk of introducing bugs. To know how to balance the risk, we'd want an estimate of how big the data backups would be.

@minhkhul
Copy link
Contributor

Yep I very much agree with the potential for an archive differ v2 problem. Let's scratch that for now.

minhkhul
minhkhul previously approved these changes Oct 14, 2024
@minhkhul minhkhul dismissed their stale review October 18, 2024 18:32

need to add compression

@minhkhul minhkhul requested a review from nolangormley October 21, 2024 21:33
with gzip.open(backup_file, "wt", newline="") as f:
df.to_csv(f, index=False, na_rep="NA")

if logger:
Copy link
Contributor

@aysim319 aysim319 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is logger optional? we want to keep track if backup was created or not right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is also copied from create_export_csv.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also originally pull method doesn't take logger as a variable at all, so I wasn' sure if we should force it to within the scope of this PR.

@minhkhul minhkhul requested a review from aysim319 October 22, 2024 15:55
@minhkhul
Copy link
Contributor

minhkhul commented Oct 22, 2024

Also, been running this locally daily this since yesterday at the same time normal nchs run and keep the backup file, so we can take our time w this PR.

Copy link
Contributor

@nolangormley nolangormley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@minhkhul minhkhul merged commit 3450dfc into main Oct 28, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants