Skip to content

Conversation

@AlSaeed
Copy link
Contributor

@AlSaeed AlSaeed commented Jun 17, 2020

Performance Profiling of the existing implementation and optimized implementation. This is derived by average time taken over 5 runs on a table containing 1000 randomly chosen time-series (all stale with direction==NULL and timestamp2==0). On average these tables contained 90054.6 rows.

Implementation Retrieval Queries Updating direction Queries Updating timestamp2 Queries Direction Computation Other Total Time
Existing 0.96 s 21.50 s 1.019 s 18.91 s 0.30 s 42.69 s
Optimized 3.08 s 1.83 s 1.14 s 20.11 s 7.09 s 33.24 s

Remaining Points:

  • Adding unit and integration tests for the optimized implementation.
  • Reporting performance for the whole dataset (~12M rows).
  • Probably get_data_stdev_across_locations method in Database, should be updated to only account for rows of latest issue date for any (time-series key, time_value) combination?

Around 60% of the time in the setting profiled is spent in computing the direction, so probably this where we should look next for performance gain.

krivard and others added 10 commits June 10, 2020 17:08
* Supporting infra in acquisition
* Extend all data insertion test cases to include toy issue and lag
* Test that we can insert a new issue for an extant (source,signal,date,geo)
…st-to-store-issue-date

Add columns to covidcast for issue and lag.
A method that retrieves all records from all timeseries with potentially stale direction, with an option to store result in temporary table.
Added three methods:
1) update_direction: updates the direction field of the `covidcast` table, given a single new value, and a list of row ids.
2) drop_temporary_table.
3) update_timestamp2_from_temporary_table: takes tmp_table_name as input and updates the timestamp2 field of `covidcast` rows whose id is present in tmp_table_name.
Added optimized_update_loop function.
Corrected a bug in the implementation of selecting the keys of the potentially stale timeseries
* optimized some pandas calls.
* renamed some methods.
Added comments to new methods added to database.py, and direction_updater.py
@AlSaeed
Copy link
Contributor Author

AlSaeed commented Jun 17, 2020

This is the time measurements for the entire dataset (12786416 Rows, 139887 time-series):

Implementation Retrieval Queries Updating direction Queries Updating timestamp2 Queries Direction Computation Other Total Time
Existing 317.59 s 3600.79 s 152.56 s 2834.59 s 44.82 s 6950.36 s
Optimized 868.32 s 341.64 s 267.12 s 2785.32 s 859.63 s 5122.02 s

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits to fix but otherwise legit. Looks like we get a 25% speedup overall and 43% ignoring the linear regression computations, which will be addressed in a separate effort. Great work!

Since this code depends on the issue date changes, should we wait to merge it until those are complete? just in case something comes up in the server/client implementation that requires an upstream change.

`new_direction_value`: the new value to be stored in `direction` column.
`id_list`: a list of ids of rows that will change to `new_direction_value`.
`batch_size`: batch size of the update query.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief explanation here on why batch_size is necessary, and guidance on how to decide whether to use the default or specify a higher or lower value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I added batch_size is that there is a maximum limit on the length of an SQL statement, and so in the case where the list size is 4M this limit will be certainly exceeded. As for choosing the batch_size there is a tradeoff, as with larger batch_size the update query will take longer to execute but there will be fewer queries. The only way I can think of to choose batch_size is experimentally. I see your point that it might be more appropriate for it to be hardcoded than to be exposed as a parameter.

@krivard
Copy link
Contributor

krivard commented Jul 9, 2020

Replaced by #149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants