-
Notifications
You must be signed in to change notification settings - Fork 16
Add nchs-mortality raw data backups and backup export utility
#2065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I guess the test is failing (on linting, with Also, tests for the new |
| # Label the file with today's date (the date the data was fetched). | ||
| if not issue: | ||
| issue = datetime.today().strftime('%Y%m%d') | ||
| backup_filename = [issue, geo_res, table_name, metric, sensor] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: For simplicity of using backup data later, would prefer to compress all tables for a given issue into a single compressed archive.
minhkhul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the custom run flag!
change requested:
- Adding file compression.
- Add some logging to note on which indicator stashing is done.
- Adjust the params.json.template in nchs_mortality as well.
suggestion: When I wrote and run the script to stash nssp source similar to this on one, the small vm ran out of disk space at one point. To save disk space, apart from adding zipping, I also added a feature to check if there has been changes at all to the dataset in comparison to the latest past csv.gz on disk, and only save the latest new version of the dataset after confirming there's a difference. It's helpful on a weekly signal like nssp. I think it'd be nice to add that but not needed.
|
Thanks for your quick feedback @minhkhul!
Agreed. Related to this, @korlaxxalrok suggested including metadata in each day's backup data or unique IDs we can use to track provenance of downstream data. Designing that will likely be too complex and thus take too long for getting V1 of data backups out, but could be very useful in the future.
I don't have strong feelings about this, but given the default the
Hm, so we've found that saving data like this causes storage issues. Since you refer to a "vm", I wonder if the limit you hit was that of the VM (O(1 GB)) rather than with the host machine (O(100 GB)). How big is that entire collection of backups? RE "only sav[ing] the latest new version of the dataset after confirming there's a difference" with the last backup, do we think this is safe/robust enough to do? One initial concern is that this is starting to sound like "archive differ V2". Of course, it's simpler than the current one, but any extra code increases the risk of introducing bugs. To know how to balance the risk, we'd want an estimate of how big the data backups would be. |
|
Yep I very much agree with the potential for an archive differ v2 problem. Let's scratch that for now. |
| with gzip.open(backup_file, "wt", newline="") as f: | ||
| df.to_csv(f, index=False, na_rep="NA") | ||
|
|
||
| if logger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is logger optional? we want to keep track if backup was created or not right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is also copied from create_export_csv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also originally pull method doesn't take logger as a variable at all, so I wasn' sure if we should force it to within the scope of this PR.
|
Also, been running this locally daily this since yesterday at the same time normal nchs run and keep the backup file, so we can take our time w this PR. |
nolangormley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
Add
nchs-mortalityraw data backups and backup export utilityChangelog
create_backup_csvfn indelphi_utils/export.pynchs_mortality'spull_nchs_mortality_datafnAssociated Issue(s)
Context and writeup