-
Notifications
You must be signed in to change notification settings - Fork 67
Feature/optimize direction computation #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/optimize direction computation #133
Conversation
* Supporting infra in acquisition * Extend all data insertion test cases to include toy issue and lag * Test that we can insert a new issue for an extant (source,signal,date,geo)
…st-to-store-issue-date Add columns to covidcast for issue and lag.
A method that retrieves all records from all timeseries with potentially stale direction, with an option to store result in temporary table.
Added three methods: 1) update_direction: updates the direction field of the `covidcast` table, given a single new value, and a list of row ids. 2) drop_temporary_table. 3) update_timestamp2_from_temporary_table: takes tmp_table_name as input and updates the timestamp2 field of `covidcast` rows whose id is present in tmp_table_name.
Added optimized_update_loop function.
Corrected a bug in the implementation of selecting the keys of the potentially stale timeseries
* optimized some pandas calls. * renamed some methods.
Added comments to new methods added to database.py, and direction_updater.py
|
This is the time measurements for the entire dataset (12786416 Rows, 139887 time-series):
|
krivard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits to fix but otherwise legit. Looks like we get a 25% speedup overall and 43% ignoring the linear regression computations, which will be addressed in a separate effort. Great work!
Since this code depends on the issue date changes, should we wait to merge it until those are complete? just in case something comes up in the server/client implementation that requires an upstream change.
| `new_direction_value`: the new value to be stored in `direction` column. | ||
| `id_list`: a list of ids of rows that will change to `new_direction_value`. | ||
| `batch_size`: batch size of the update query. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a brief explanation here on why batch_size is necessary, and guidance on how to decide whether to use the default or specify a higher or lower value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I added batch_size is that there is a maximum limit on the length of an SQL statement, and so in the case where the list size is 4M this limit will be certainly exceeded. As for choosing the batch_size there is a tradeoff, as with larger batch_size the update query will take longer to execute but there will be fewer queries. The only way I can think of to choose batch_size is experimentally. I see your point that it might be more appropriate for it to be hardcoded than to be exposed as a parameter.
|
Replaced by #149 |
Performance Profiling of the existing implementation and optimized implementation. This is derived by average time taken over 5 runs on a table containing 1000 randomly chosen time-series (all stale with
direction==NULLandtimestamp2==0). On average these tables contained 90054.6 rows.directionQueriestimestamp2QueriesRemaining Points:
get_data_stdev_across_locationsmethod in Database, should be updated to only account for rows of latest issue date for any (time-series key, time_value) combination?Around 60% of the time in the setting profiled is spent in computing the direction, so probably this where we should look next for performance gain.