pagination #184

sgratzl · 2020-08-10T09:12:52Z

closes #51

TODO

separate append and execute single query

closes #51

capnrefsmmat · 2020-08-18T21:21:29Z

So my understanding is that pagination with LIMIT and OFFSET can be slow because the database has to calculate the entire result set and then throw away the first $offset rows: https://mariadb.com/kb/en/pagination-optimization/

This means that if a person wants to request all data over N days, and if $MAX_RESULTS is such that they issue roughly one query per day, the database will do roughly O(N^2) computation. (Not exactly that much -- the first day will be fast and the last day will be slow, but the overall effect grows quadratically.)

If our goal is to limit database load, do we want some other approach? I see some options:

As the MariaDB docs suggest, return some kind of unique ID for the "next" row, and then do the pagination based on that ID. If the ID is indexed, the pagination is fast. (I think we already have an ID for every database row, right?)
Paginate by a time window. Take the time values in a query and iterate over them. Then we're using an indexed column in the database (or at least I hope it's indexed?) and there's no O(N^2) behavior. This replicates what the COVIDcast clients are doing, but at the server level. It's also dumber if we're only returning a few results per day, like a query for just one geography.
Do some performance testing. Find the worst API query someone could possibly make (all days of JHU in all geos, as_of some date, plus...), see if it actually hurts the database to do a thousand a second, and if not, just make the maximum result limit stratospherically high and avoid this problem entirely.

krivard · 2020-08-19T14:25:18Z

We do have an ID for every row, but API queries have ORDER BY time_value, geo_value, issue, so I don't think it will be directly usable here. We could use time_value as an ersatz ID, not just segmenting by each value but essentially "rounding" the result set to the nearest time_value boundary. That's in the index so it should be fast.

We should 100% do some performance benchmarking. The largest dataset retrievable using the current set of API parameters and ignoring pagination limits contains ~24.2M rows:

Source: jhu-csse
Signal: confirmed_cumulative_num
Geo type: county
Geo value: *
Time type: day
Time values: 20200122-20200817
Issues: 20200122-20200817

capnrefsmmat · 2020-08-19T14:33:10Z

Is it necessary to maintain the ordering of API results? I don't know if any clients do (or should) rely on results being returned in a specific order, so we could order by the unique ID.

krivard · 2020-08-20T19:56:22Z

Any deterministic order should be fine, but it does tend to take a couple days to revise all the unit tests -- we shouldn't depend on being able to make ordering changes immediately and/or repeatedly.

sgratzl · 2020-12-16T13:05:39Z

closing in favor of #337

sgratzl added 6 commits August 10, 2020 10:25

refactor: simplify code

35f0576

separate append and execute single query

feat: pagination

16ae958

closes #51

fix test

defa52e

feat: auto pagination for python api

36d9b96

feat: coffeescript auto pagination

f450ad0

feat: r client pagination

7ce7181

krivard mentioned this pull request Aug 10, 2020

Add pagination support to COVIDcast clients cmu-delphi/covidcast-indicators#197

Closed

chinandrew mentioned this pull request Aug 27, 2020

Update client to utilize pagination cmu-delphi/covidcast#45

Open

dfarrow0 mentioned this pull request Nov 2, 2020

Feature Request: Support streaming rather than truncate results #51

Closed

sgratzl closed this Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pagination #184

pagination #184

Uh oh!

sgratzl commented Aug 10, 2020 •

edited

Loading

Uh oh!

capnrefsmmat commented Aug 18, 2020

Uh oh!

krivard commented Aug 19, 2020

Uh oh!

capnrefsmmat commented Aug 19, 2020

Uh oh!

krivard commented Aug 20, 2020

Uh oh!

sgratzl commented Dec 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pagination #184

pagination #184

Uh oh!

Conversation

sgratzl commented Aug 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

capnrefsmmat commented Aug 18, 2020

Uh oh!

krivard commented Aug 19, 2020

Uh oh!

capnrefsmmat commented Aug 19, 2020

Uh oh!

krivard commented Aug 20, 2020

Uh oh!

sgratzl commented Dec 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sgratzl commented Aug 10, 2020 •

edited

Loading