Improvements to acquisition / CSV file loading. #150

melange396 · 2020-07-09T15:56:15Z

SQL batching improves ingestion speed significantly.
Loads files in name-sorted order for determinism and saner progress tracking.
Individual transactions for each file (instead of one transaction for the full collection of files) results in:
-an UPSERT failure means the entirety of the file will not be loaded (before a single line failure would preserve the preceding lines and attempt the following lines of the same file).
-lock should be released more frequently, meaning read access should not block as much while acquisition is in progress.

SQL batching improves ingestion speed significantly. Loads files in name-sorted order for determinism and saner progress tracking. Individual transactions for each file (instead of one transaction for the full collection of files) results in: -an UPSERT failure means the entirety of the file will not be loaded (before a single line failure would preserve the preceding lines and attempt the following lines of the same file). -lock should be released more frequently, meaning read access should not block as much while acquisition is in progress.

melange396 · 2020-07-09T15:57:00Z

i started but then resisted the urge to rewrite more of this stuff in lieu of a smaller changelist... there is some dead code that could be removed but i thought leaving it might provide context as to why the call structure is the way it is. i also began to look into better handling of import failures, but decided that should go into its own issue, and i wasnt sure how representative my sample data files are. can we try this out on the staging server? if so, how can i do that?

melange396 · 2020-07-09T15:58:04Z

some TODOs i didnt put in the code:
--do more speed testing with UPDATEs instead of clean INSERTs (UPDATE will be slower than INSERT in any case, but its unclear how much batching improves this)
--better deal with failure files - save error message somewhere, alert someone that errors happened, perhaps preserve a summary of what files were/werent loaded in each run, etc...
--ON DUPLICATE KEY - update (zero out) timestamp2 & direction ?? [ database.py:104 -ish ]
--make sure directories exist before writing files? it might do this already?

melange396 · 2020-07-09T16:15:58Z

im seeing 2-4x speed up for acquisition depending on which subsets of sample data i provide it with, though that is on a beefy but shared test machine. @AlSaeed is there a particular tool used to get this profiling output, or was that just something you did by hand? #133 (comment)

AlSaeed · 2020-07-09T17:00:09Z

im seeing 2-4x speed up for acquisition depending on which subsets of sample data i provide it with, though that is on a beefy but shared test machine. @AlSaeed is there a particular tool used to get this profiling output, or was that just something you did by hand? #133 (comment)

This was done by hand via inserting boilerplate timing code around specific code blocks.

krivard

Bunch of suspicious stuff, so I pulled down a copy of your fork and ran the unit and integration tests, which, indeed, fail -- I commented on the errors I found, but to catch all of them you'll want to make sure the tests pass. Instructions are in the dev guide; I also put together a convenience script which can be run as follows (as root or whatever Docker-empowered user you have):

# ops_epidata_devcycle.sh [database] [web] \
  [test repos/delphi/delphi-epidata/{tests|integration}]

Include database if the database schema has changed; include web if the PHP has changed; include test if the python has changed. It automatically shuts down the relevant Docker containers, rebuilds them, and starts them back up again, then if you're testing it runs the test utility against either the unit tests directory (tests) or the integration tests directory (integration). To run the other set of tests without rebuilding the python container, you can just run e.g.

docker run --rm --network delphi-net delphi_python python3 -m undefx.py3tester.py3tester repos/delphi/delphi-epidata/tests

src/acquisition/covidcast/csv_to_database.py

src/acquisition/covidcast/database.py

krivard suggested changes Jul 10, 2020

View reviewed changes

src/acquisition/covidcast/csv_to_database.py Outdated Show resolved Hide resolved

src/acquisition/covidcast/csv_to_database.py Outdated Show resolved Hide resolved

src/acquisition/covidcast/database.py Show resolved Hide resolved

melange396 added 2 commits July 20, 2020 20:28

fix for tests and other PR comments

5e654d1

spellign error

445a34c

krivard self-requested a review July 22, 2020 18:51

krivard approved these changes Jul 22, 2020

View reviewed changes

melange396 added 2 commits July 22, 2020 20:27

Merge remote-tracking branch 'upstream/main' into csv_acquisition

46d6a69

catching up w/ versioning by adding issue and lag fields

68af3b6

krivard merged commit 45d34ca into cmu-delphi:main Jul 23, 2020

melange396 deleted the csv_acquisition branch July 23, 2020 18:05

melange396 mentioned this pull request Jul 30, 2020

COVIDcast API should not hang during the acquisition cycle #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to acquisition / CSV file loading. #150

Improvements to acquisition / CSV file loading. #150

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

AlSaeed commented Jul 9, 2020

Uh oh!

krivard left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improvements to acquisition / CSV file loading. #150

Improvements to acquisition / CSV file loading. #150

Uh oh!

Conversation

melange396 commented Jul 9, 2020

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

melange396 commented Jul 9, 2020

Uh oh!

AlSaeed commented Jul 9, 2020

Uh oh!

krivard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants