Skip to content

Conversation

@melange396
Copy link
Collaborator

SQL batching improves ingestion speed significantly.
Loads files in name-sorted order for determinism and saner progress tracking.
Individual transactions for each file (instead of one transaction for the full collection of files) results in:
-an UPSERT failure means the entirety of the file will not be loaded (before a single line failure would preserve the preceding lines and attempt the following lines of the same file).
-lock should be released more frequently, meaning read access should not block as much while acquisition is in progress.

SQL batching improves ingestion speed significantly.
Loads files in name-sorted order for determinism and saner progress tracking.
Individual transactions for each file (instead of one transaction for the full collection of files) results in:
-an UPSERT failure means the entirety of the file will not be loaded (before a single line failure would preserve the preceding lines and attempt the following lines of the same file).
-lock should be released more frequently, meaning read access should not block as much while acquisition is in progress.
@melange396
Copy link
Collaborator Author

i started but then resisted the urge to rewrite more of this stuff in lieu of a smaller changelist... there is some dead code that could be removed but i thought leaving it might provide context as to why the call structure is the way it is. i also began to look into better handling of import failures, but decided that should go into its own issue, and i wasnt sure how representative my sample data files are. can we try this out on the staging server? if so, how can i do that?

@melange396
Copy link
Collaborator Author

some TODOs i didnt put in the code:
--do more speed testing with UPDATEs instead of clean INSERTs (UPDATE will be slower than INSERT in any case, but its unclear how much batching improves this)
--better deal with failure files - save error message somewhere, alert someone that errors happened, perhaps preserve a summary of what files were/werent loaded in each run, etc...
--ON DUPLICATE KEY - update (zero out) timestamp2 & direction ?? [ database.py:104 -ish ]
--make sure directories exist before writing files? it might do this already?

@melange396
Copy link
Collaborator Author

im seeing 2-4x speed up for acquisition depending on which subsets of sample data i provide it with, though that is on a beefy but shared test machine. @AlSaeed is there a particular tool used to get this profiling output, or was that just something you did by hand? #133 (comment)

@AlSaeed
Copy link
Contributor

AlSaeed commented Jul 9, 2020

im seeing 2-4x speed up for acquisition depending on which subsets of sample data i provide it with, though that is on a beefy but shared test machine. @AlSaeed is there a particular tool used to get this profiling output, or was that just something you did by hand? #133 (comment)

This was done by hand via inserting boilerplate timing code around specific code blocks.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of suspicious stuff, so I pulled down a copy of your fork and ran the unit and integration tests, which, indeed, fail -- I commented on the errors I found, but to catch all of them you'll want to make sure the tests pass. Instructions are in the dev guide; I also put together a convenience script which can be run as follows (as root or whatever Docker-empowered user you have):

# ops_epidata_devcycle.sh [database] [web] \
  [test repos/delphi/delphi-epidata/{tests|integration}]

Include database if the database schema has changed; include web if the PHP has changed; include test if the python has changed. It automatically shuts down the relevant Docker containers, rebuilds them, and starts them back up again, then if you're testing it runs the test utility against either the unit tests directory (tests) or the integration tests directory (integration). To run the other set of tests without rebuilding the python container, you can just run e.g.

docker run --rm --network delphi-net delphi_python python3 -m undefx.py3tester.py3tester repos/delphi/delphi-epidata/tests

@krivard krivard self-requested a review July 22, 2020 18:51
@krivard krivard merged commit 45d34ca into cmu-delphi:main Jul 23, 2020
@melange396 melange396 deleted the csv_acquisition branch July 23, 2020 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants