Diffing and archiving utilities #185

eujing · 2020-08-05T21:02:56Z

Utilities to address #166

krivard · 2020-08-06T18:20:37Z

jhu/delphi_jhu/run.py

+
+    # Report failures: someone should probably look at them
+    for exported_file in fails:
+        print(f"Failed to archive '{exported_file}'")


I believe these will show in the Delphi Automation log, but we should think about how to alert a human about then.

dshemetov · 2020-08-07T08:22:35Z

_delphi_utils_python/delphi_utils/archive.py

+        To be implemented by specific archiving backends.
+        Should set self._cache_updated = True after verifying cache is updated.
+        """
+        raise NotImplementedError


I have some basic architecture questions about this codebase. In terms of the process here, it is:

first we update the cache; just to make sure I understand this, does this mean downloading a past version of the data from an S3 bucket or something?

then we use covidcast to export the most recent snapshot of the data into CSVs in the folder ./receiving

finally we diff the latest data and our cache snapshot; this allows us to upload only the changes onto the cache

Do I have this right?

If so, then is this update_cache implementation waiting for a decision on which service to use for the caching?

(If these questions seem basic, that's probably because they are. I am totally new to this codebase and want to make sure the model in my head is correct 🙂).

Conveniently, we have a record of the csv files that represent the most recent issue for each day available in the API, in archive/successful -- we can use those to prepopulate the cache folder.

Yeah that is right. Just in case the local cache folder gets emptied out for some reason, this lets us re-download whatever we have stored in S3 up to the previous run into the cache

Also right. The receiving folder's contents will eventually get ingested as new data in the API that is issued today, whether new or back-filled.

We do diff the latest data and the cache snapshot, however we retain only the changes in the ./receiving folder for ingesting. Before this retaining of only changes, the cache snapshot in ./cache is updated with just-generated output in ./receiving.

I left the update_cache() and archive_exports() as not-implemented as I was kind of using ArchiveDiffer as a base class that contained the common operations of diffing and replacing/filtering files that were independent of the archiving backend. So in terms of usage, only the more specific child classes S3ArchiveDiffer and GitArchiveDiffer should be used. I think we are mainly using S3, but I happened to already have the git code written so I thought not to let it go to waste haha.

dshemetov · 2020-08-07T17:59:57Z

_delphi_utils_python/delphi_utils/archive.py

+
+    deleted_idx = before_df.index.difference(after_df.index)
+    common_idx = before_df.index.intersection(after_df.index)
+    added_idx = after_df.index.difference(before_df.index)


So as I understand it, the diff between two csv files is being measured in terms of the difference in the index, where the index is geo_id. If I understand correctly, this means that we can detect when new geographic locations are added and old ones removed. What is the anticipated way of handling a changed value in an already-existing index? Say "ca" is already present in both cache and export, but the source updates the value?

Ideally we want to retain all new regions and all regions with updated values.

We want to identify deleted regions but how we handle them is TBD.

So regarding these index operations, I use common_idx to do a index-aligned comparison in the next few lines between the before and after df. Particularly, same_mask later on should be able to tell us (in a boolean-mask way) which values are changed among the common indices, then I use this mask to just return a df of changed-values only (with the after values).

I see, the same_mask code below handles the value comparisons and returns the changed rows dataframe. Got it.

dshemetov · 2020-08-07T19:59:09Z

_delphi_utils_python/delphi_utils/archive.py

+                print(f"Updating cache with {cached_file}")
+                obj.Object().download_file(cached_file)
+
+        self._cache_updated = True


Oh! So this is implemented! Makes sense. So I believe this is the spot that needs a change to handle the case when the bucket is empty.

Yeah that is true, if the bucket is empty nothing will happen now, and all output will be treated as "new".
From what I gathered during the meeting, we can determine which snapshot of the output (from a separate archive/successful folder in the server) we want to initially populate the bucket with if we don't want this empty cache behavior.

dshemetov · 2020-08-07T22:35:01Z

_delphi_utils_python/tests/test_archive.py

+            "csv1": pd.DataFrame({
+                "geo_id": ["1", "2", "4"],
+                "val": [1.0, 2.1, 4.0],
+                "se": [0.1, 0.21, 0.4],


Might be a good idea to add a couple nans to csv1 to make sure that part works.

Yeah that makes sense, thank you! I have added some nans to the test case.

dshemetov · 2020-08-07T22:51:05Z

_delphi_utils_python/delphi_utils/archive.py

+
+            # Replace exports where diff file was generated
+            else:
+                replace(diff_file, exported_file)


So there must be a piece of code that knows how to handle a diff .csv file when uploading the API?

So each .csv.diff file that is created is actually a regular CSV file too that is in the same format as the original .csv, just that it only contains new and modified rows of the original. This should replace an original .csv with its .csv.diff, and at the end there should only be .csv files left to ingest.

dshemetov

Looks good to me! The tests run on my machine. Just needs the update to handle filling the cache when the bucket is empty.

eujing added 9 commits July 29, 2020 15:20

Initial diff upload utility function

57e5c69

Split up functions into diff, archive, replace

cee1cbb

Create commits in a specified branch instead of active branch

15e63f4

Abstracted out archiving

fefc1fd

Support for S3 bucket archiving

7c1aecc

Added S3 archive diff unit tests

8a49e17

Updated unit tests and docs

0087f4d

Updated JHU example

7f93a5b

Update JHU params.json.template

eecb91e

krivard reviewed Aug 6, 2020

View reviewed changes

krivard requested a review from dshemetov August 6, 2020 18:22

Rearranged imports

2e4facd

eujing marked this pull request as ready for review August 6, 2020 18:33

Added more docs

e472089

dshemetov reviewed Aug 7, 2020

View reviewed changes

Added some NaNs to tests to verify behavior

8cac767

krivard merged commit 65dfc43 into main Aug 10, 2020

krivard deleted the diff-uploads branch September 1, 2020 18:51

krivard restored the diff-uploads branch September 1, 2020 18:51

krivard deleted the diff-uploads branch September 1, 2020 18:51

Diffing and archiving utilities #185

Diffing and archiving utilities #185

Uh oh!

Conversation

eujing commented Aug 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dshemetov Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dshemetov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dshemetov Aug 7, 2020 •

edited

Loading