Skip to content

Conversation

@sgratzl
Copy link
Member

@sgratzl sgratzl commented Aug 10, 2020

closes #51

TODO

  • adapt python client
  • adapt coffee script client
  • recompile javascript -> where is docu how to do that?
  • adapt R client
  • adapt docu

@capnrefsmmat
Copy link
Contributor

So my understanding is that pagination with LIMIT and OFFSET can be slow because the database has to calculate the entire result set and then throw away the first $offset rows: https://mariadb.com/kb/en/pagination-optimization/

This means that if a person wants to request all data over N days, and if $MAX_RESULTS is such that they issue roughly one query per day, the database will do roughly O(N^2) computation. (Not exactly that much -- the first day will be fast and the last day will be slow, but the overall effect grows quadratically.)

If our goal is to limit database load, do we want some other approach? I see some options:

  • As the MariaDB docs suggest, return some kind of unique ID for the "next" row, and then do the pagination based on that ID. If the ID is indexed, the pagination is fast. (I think we already have an ID for every database row, right?)
  • Paginate by a time window. Take the time values in a query and iterate over them. Then we're using an indexed column in the database (or at least I hope it's indexed?) and there's no O(N^2) behavior. This replicates what the COVIDcast clients are doing, but at the server level. It's also dumber if we're only returning a few results per day, like a query for just one geography.
  • Do some performance testing. Find the worst API query someone could possibly make (all days of JHU in all geos, as_of some date, plus...), see if it actually hurts the database to do a thousand a second, and if not, just make the maximum result limit stratospherically high and avoid this problem entirely.

@krivard
Copy link
Contributor

krivard commented Aug 19, 2020

We do have an ID for every row, but API queries have ORDER BY time_value, geo_value, issue, so I don't think it will be directly usable here. We could use time_value as an ersatz ID, not just segmenting by each value but essentially "rounding" the result set to the nearest time_value boundary. That's in the index so it should be fast.

We should 100% do some performance benchmarking. The largest dataset retrievable using the current set of API parameters and ignoring pagination limits contains ~24.2M rows:

  • Source: jhu-csse
  • Signal: confirmed_cumulative_num
  • Geo type: county
  • Geo value: *
  • Time type: day
  • Time values: 20200122-20200817
  • Issues: 20200122-20200817

@capnrefsmmat
Copy link
Contributor

Is it necessary to maintain the ordering of API results? I don't know if any clients do (or should) rely on results being returned in a specific order, so we could order by the unique ID.

@krivard
Copy link
Contributor

krivard commented Aug 20, 2020

Any deterministic order should be fine, but it does tend to take a couple days to revise all the unit tests -- we shouldn't depend on being able to make ordering changes immediately and/or repeatedly.

@sgratzl
Copy link
Member Author

sgratzl commented Dec 16, 2020

closing in favor of #337

@sgratzl sgratzl closed this Dec 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support streaming rather than truncate results

3 participants