-
Notifications
You must be signed in to change notification settings - Fork 67
Improvements to acquisition / CSV file loading. #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
SQL batching improves ingestion speed significantly. Loads files in name-sorted order for determinism and saner progress tracking. Individual transactions for each file (instead of one transaction for the full collection of files) results in: -an UPSERT failure means the entirety of the file will not be loaded (before a single line failure would preserve the preceding lines and attempt the following lines of the same file). -lock should be released more frequently, meaning read access should not block as much while acquisition is in progress.
|
i started but then resisted the urge to rewrite more of this stuff in lieu of a smaller changelist... there is some dead code that could be removed but i thought leaving it might provide context as to why the call structure is the way it is. i also began to look into better handling of import failures, but decided that should go into its own issue, and i wasnt sure how representative my sample data files are. can we try this out on the staging server? if so, how can i do that? |
|
some TODOs i didnt put in the code: |
|
im seeing 2-4x speed up for acquisition depending on which subsets of sample data i provide it with, though that is on a beefy but shared test machine. @AlSaeed is there a particular tool used to get this profiling output, or was that just something you did by hand? #133 (comment) |
This was done by hand via inserting boilerplate timing code around specific code blocks. |
krivard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of suspicious stuff, so I pulled down a copy of your fork and ran the unit and integration tests, which, indeed, fail -- I commented on the errors I found, but to catch all of them you'll want to make sure the tests pass. Instructions are in the dev guide; I also put together a convenience script which can be run as follows (as root or whatever Docker-empowered user you have):
# ops_epidata_devcycle.sh [database] [web] \
[test repos/delphi/delphi-epidata/{tests|integration}]
Include database if the database schema has changed; include web if the PHP has changed; include test if the python has changed. It automatically shuts down the relevant Docker containers, rebuilds them, and starts them back up again, then if you're testing it runs the test utility against either the unit tests directory (tests) or the integration tests directory (integration). To run the other set of tests without rebuilding the python container, you can just run e.g.
docker run --rm --network delphi-net delphi_python python3 -m undefx.py3tester.py3tester repos/delphi/delphi-epidata/tests
SQL batching improves ingestion speed significantly.
Loads files in name-sorted order for determinism and saner progress tracking.
Individual transactions for each file (instead of one transaction for the full collection of files) results in:
-an UPSERT failure means the entirety of the file will not be loaded (before a single line failure would preserve the preceding lines and attempt the following lines of the same file).
-lock should be released more frequently, meaning read access should not block as much while acquisition is in progress.