Add `nchs-mortality` raw data backups and backup export utility #2065

nmdefries · 2024-10-10T20:55:29Z

Description

Add nchs-mortality raw data backups and backup export utility

Changelog

add create_backup_csv fn in delphi_utils/export.py
use the utility in nchs_mortality's pull_nchs_mortality_data fn
related tests

Associated Issue(s)

nmdefries · 2024-10-10T21:35:00Z

I guess the test is failing (on linting, with delphi_nchs_mortality/pull.py:11:0: E0611: No name 'create_backup_csv' in module 'delphi_utils' (no-name-in-module)) because the new fn is being added to delphi_utils at the same time.

Also, tests for the new create_backup_csv fn need to be added, but this is the idea for how this should work. Adding backups for other indicators should be faster.

nmdefries · 2024-10-11T14:28:23Z

_delphi_utils_python/delphi_utils/export.py

+        # Label the file with today's date (the date the data was fetched).
+        if not issue:
+            issue = datetime.today().strftime('%Y%m%d')
+        backup_filename = [issue, geo_res, table_name, metric, sensor]


suggestion: For simplicity of using backup data later, would prefer to compress all tables for a given issue into a single compressed archive.

minhkhul

Appreciate the custom run flag!

change requested:

Adding file compression.
Add some logging to note on which indicator stashing is done.
Adjust the params.json.template in nchs_mortality as well.

suggestion: When I wrote and run the script to stash nssp source similar to this on one, the small vm ran out of disk space at one point. To save disk space, apart from adding zipping, I also added a feature to check if there has been changes at all to the dataset in comparison to the latest past csv.gz on disk, and only save the latest new version of the dataset after confirming there's a difference. It's helpful on a weekly signal like nssp. I think it'd be nice to add that but not needed.

nmdefries · 2024-10-11T20:25:57Z

Thanks for your quick feedback @minhkhul!

Add some logging to note on which indicator stashing is done

Agreed. Related to this, @korlaxxalrok suggested including metadata in each day's backup data or unique IDs we can use to track provenance of downstream data. Designing that will likely be too complex and thus take too long for getting V1 of data backups out, but could be very useful in the future.

Adjust the params.json.template in nchs_mortality as well.

I don't have strong feelings about this, but given the default the custom_run param takes in the code means we don't necessarily need to add it to params.json.

suggestion: When I wrote and run the script to stash nssp source similar to this on one, the small vm ran out of disk space at one point. To save disk space, apart from adding zipping, I also added a feature to check if there has been changes at all to the dataset in comparison to the latest past csv.gz on disk, and only save the latest new version of the dataset after confirming there's a difference. It's helpful on a weekly signal like nssp. I think it'd be nice to add that but not needed.

Hm, so we've found that saving data like this causes storage issues. Since you refer to a "vm", I wonder if the limit you hit was that of the VM (O(1 GB)) rather than with the host machine (O(100 GB)). How big is that entire collection of backups?

RE "only sav[ing] the latest new version of the dataset after confirming there's a difference" with the last backup, do we think this is safe/robust enough to do? One initial concern is that this is starting to sound like "archive differ V2". Of course, it's simpler than the current one, but any extra code increases the risk of introducing bugs. To know how to balance the risk, we'd want an estimate of how big the data backups would be.

minhkhul · 2024-10-13T07:01:43Z

Yep I very much agree with the potential for an archive differ v2 problem. Let's scratch that for now.

need to add compression

_delphi_utils_python/delphi_utils/export.py

aysim319 · 2024-10-22T13:52:40Z

_delphi_utils_python/delphi_utils/export.py

+        with gzip.open(backup_file, "wt", newline="") as f:
+            df.to_csv(f, index=False, na_rep="NA")
+
+        if logger:


why is logger optional? we want to keep track if backup was created or not right?

This behavior is also copied from create_export_csv.

Also originally pull method doesn't take logger as a variable at all, so I wasn' sure if we should force it to within the scope of this PR.

minhkhul · 2024-10-22T16:07:22Z

Also, been running this locally daily this since yesterday at the same time normal nchs run and keep the backup file, so we can take our time w this PR.

nolangormley

LGTM!

nmdefries added 5 commits October 10, 2024 16:52

add helper fn in utils to save backup data to csv

2a43f9a

use helper to save nchs data to disk right after pulling

3c22ff1

add backup dir param

02845f1

import backup utility

a55be53

update arg name

9cbde48

nmdefries commented Oct 11, 2024

View reviewed changes

minhkhul requested changes Oct 11, 2024

View reviewed changes

minhkhul previously approved these changes Oct 14, 2024

View reviewed changes

minhkhul added 9 commits October 18, 2024 19:02

add gzip + fix old json template log + remove table_name

4ac8240

fix current tests to take backup dirs and custom_run flag into account

6fcd2ba

add logging

49383c5

fix log getsize of backup file

6159a72

lint

6a193eb

lint

3115595

lint

b000a18

lint

10d8a77

add backup test

0d300ba

minhkhul requested a review from nolangormley October 21, 2024 21:33

aysim319 reviewed Oct 22, 2024

View reviewed changes

_delphi_utils_python/delphi_utils/export.py Outdated Show resolved Hide resolved

aysim319 reviewed Oct 22, 2024

View reviewed changes

minhkhul requested a review from aysim319 October 22, 2024 15:55

remove deep copy

f464569

nolangormley approved these changes Oct 25, 2024

View reviewed changes

minhkhul merged commit 3450dfc into main Oct 28, 2024
16 checks passed

minhkhul mentioned this pull request Oct 28, 2024

Add backup of source data to nchs and nssp #2070

Closed

8 tasks

minhkhul mentioned this pull request Dec 11, 2024

Release covidcast-indicators 0.3.57 #2086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `nchs-mortality` raw data backups and backup export utility #2065

Add `nchs-mortality` raw data backups and backup export utility #2065

Uh oh!

nmdefries commented Oct 10, 2024 •

edited by minhkhul

Loading

Uh oh!

nmdefries commented Oct 10, 2024 •

edited

Loading

Uh oh!

nmdefries Oct 11, 2024 •

edited

Loading

Uh oh!

minhkhul left a comment

Uh oh!

nmdefries commented Oct 11, 2024 •

edited

Loading

Uh oh!

minhkhul commented Oct 13, 2024

Uh oh!

Uh oh!

aysim319 Oct 22, 2024 •

edited

Loading

Uh oh!

nmdefries Oct 22, 2024

Uh oh!

minhkhul Oct 22, 2024

Uh oh!

minhkhul commented Oct 22, 2024 •

edited

Loading

Uh oh!

nolangormley left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add nchs-mortality raw data backups and backup export utility #2065

Add nchs-mortality raw data backups and backup export utility #2065

Uh oh!

Conversation

nmdefries commented Oct 10, 2024 • edited by minhkhul Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changelog

Associated Issue(s)

Uh oh!

nmdefries commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nmdefries Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minhkhul left a comment

Choose a reason for hiding this comment

Uh oh!

nmdefries commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

minhkhul commented Oct 13, 2024

Uh oh!

Uh oh!

aysim319 Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmdefries Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

minhkhul Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

minhkhul commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nolangormley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add `nchs-mortality` raw data backups and backup export utility #2065

Add `nchs-mortality` raw data backups and backup export utility #2065

nmdefries commented Oct 10, 2024 •

edited by minhkhul

Loading

nmdefries commented Oct 10, 2024 •

edited

Loading

nmdefries Oct 11, 2024 •

edited

Loading

nmdefries commented Oct 11, 2024 •

edited

Loading

aysim319 Oct 22, 2024 •

edited

Loading

minhkhul commented Oct 22, 2024 •

edited

Loading