Feature/optimize direction computation #133

AlSaeed · 2020-06-17T04:18:54Z

Performance Profiling of the existing implementation and optimized implementation. This is derived by average time taken over 5 runs on a table containing 1000 randomly chosen time-series (all stale with direction==NULL and timestamp2==0). On average these tables contained 90054.6 rows.

Implementation	Retrieval Queries	Updating `direction` Queries	Updating `timestamp2` Queries	Direction Computation	Other	Total Time
Existing	0.96 s	21.50 s	1.019 s	18.91 s	0.30 s	42.69 s
Optimized	3.08 s	1.83 s	1.14 s	20.11 s	7.09 s	33.24 s

Remaining Points:

Adding unit and integration tests for the optimized implementation.
Reporting performance for the whole dataset (~12M rows).
Probably get_data_stdev_across_locations method in Database, should be updated to only account for rows of latest issue date for any (time-series key, time_value) combination?

Around 60% of the time in the setting profiled is spent in computing the direction, so probably this where we should look next for performance gain.

* Supporting infra in acquisition * Extend all data insertion test cases to include toy issue and lag * Test that we can insert a new issue for an extant (source,signal,date,geo)

…st-to-store-issue-date Add columns to covidcast for issue and lag.

A method that retrieves all records from all timeseries with potentially stale direction, with an option to store result in temporary table.

Added three methods: 1) update_direction: updates the direction field of the `covidcast` table, given a single new value, and a list of row ids. 2) drop_temporary_table. 3) update_timestamp2_from_temporary_table: takes tmp_table_name as input and updates the timestamp2 field of `covidcast` rows whose id is present in tmp_table_name.

Added optimized_update_loop function.

Corrected a bug in the implementation of selecting the keys of the potentially stale timeseries

* optimized some pandas calls. * renamed some methods.

Added comments to new methods added to database.py, and direction_updater.py

AlSaeed · 2020-06-17T12:59:31Z

This is the time measurements for the entire dataset (12786416 Rows, 139887 time-series):

Implementation	Retrieval Queries	Updating `direction` Queries	Updating `timestamp2` Queries	Direction Computation	Other	Total Time
Existing	317.59 s	3600.79 s	152.56 s	2834.59 s	44.82 s	6950.36 s
Optimized	868.32 s	341.64 s	267.12 s	2785.32 s	859.63 s	5122.02 s

krivard

Minor nits to fix but otherwise legit. Looks like we get a 25% speedup overall and 43% ignoring the linear regression computations, which will be addressed in a separate effort. Great work!

Since this code depends on the issue date changes, should we wait to merge it until those are complete? just in case something comes up in the server/client implementation that requires an upstream change.

.idea/codeStyles/codeStyleConfig.xml

src/acquisition/covidcast/database.py

krivard · 2020-06-17T13:18:30Z

src/acquisition/covidcast/database.py

+    `new_direction_value`: the new value to be stored in `direction` column.
+    `id_list`: a list of ids of rows that will change to `new_direction_value`.
+    `batch_size`: batch size of the update query.
+    """


Can you add a brief explanation here on why batch_size is necessary, and guidance on how to decide whether to use the default or specify a higher or lower value?

The reason why I added batch_size is that there is a maximum limit on the length of an SQL statement, and so in the case where the list size is 4M this limit will be certainly exceeded. As for choosing the batch_size there is a tradeoff, as with larger batch_size the update query will take longer to execute but there will be fewer queries. The only way I can think of to choose batch_size is experimentally. I see your point that it might be more appropriate for it to be hardcoded than to be exposed as a parameter.

krivard · 2020-07-09T14:30:18Z

Replaced by #149

krivard and others added 10 commits June 10, 2020 17:08

Add columns to covidcast for issue and lag.

9d35f1d

* Supporting infra in acquisition * Extend all data insertion test cases to include toy issue and lag * Test that we can insert a new issue for an extant (source,signal,date,geo)

Fixed misplaced whitespace and style

9753bbf

Merge pull request cmu-delphi#122 from krivard/feature/extend-covidca…

ab20e46

…st-to-store-issue-date Add columns to covidcast for issue and lag.

Added a method to Database

e727055

A method that retrieves all records from all timeseries with potentially stale direction, with an option to store result in temporary table.

Update direction_updater.py

0ff6d6d

Added optimized_update_loop function.

Update database.py

ac0851c

Corrected a bug in the implementation of selecting the keys of the potentially stale timeseries

Updated direction_updater.py

2a6c3bf

* optimized some pandas calls. * renamed some methods.

Added Comments to new methods

930cfec

Added comments to new methods added to database.py, and direction_updater.py

Modified changed_rows to store lists rather than sets

e11f243

krivard suggested changes Jun 17, 2020

View reviewed changes

AlSaeed added 2 commits June 17, 2020 10:17

Delete codeStyleConfig.xml

099b81a

Typo correction

48539fa

krivard mentioned this pull request Jun 23, 2020

Increase the speed optimizations for computing direction on covidcast signals #125

Closed

krivard mentioned this pull request Jul 9, 2020

Feature/data versioning and optimization #149

Merged

krivard closed this Jul 9, 2020

melange396 mentioned this pull request Jul 9, 2020

Improvements to acquisition / CSV file loading. #150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/optimize direction computation #133

Feature/optimize direction computation #133

Uh oh!

AlSaeed commented Jun 17, 2020

Uh oh!

AlSaeed commented Jun 17, 2020

Uh oh!

krivard left a comment

Uh oh!

Uh oh!

Uh oh!

krivard Jun 17, 2020

Uh oh!

AlSaeed Jun 17, 2020

Uh oh!

krivard commented Jul 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/optimize direction computation #133

Feature/optimize direction computation #133

Uh oh!

Conversation

AlSaeed commented Jun 17, 2020

Uh oh!

AlSaeed commented Jun 17, 2020

Uh oh!

krivard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

krivard Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

AlSaeed Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

krivard commented Jul 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants