Skip to content

Incorrect values in jhu-csse:confirmed_incidence_num #1685

@dshemetov

Description

@dshemetov

Working on JIT A/B tests cmu-delphi/delphi-epidata#947, I found this.

Actual Behavior:

The incidence values in the county '02100' don't agree with their cumulative counterparts.

from epidatpy.request import EpiRange, Epidata

manual_incidence = Epidata.covidcast("jhu-csse", "confirmed_cumulative_num", "day", "county", EpiRange(20200620, 20200630), "02100").df().value.diff()[1:]
api_incidence    = Epidata.covidcast("jhu-csse", "confirmed_incidence_num",  "day", "county", EpiRange(20200621, 20200630), "02100").df().value
manual_incidence.to_numpy()
# array([ 0.,  0.,  1.,  2.,  0.,  0.,  0.,  0.,  0., -3.])
api_incidence.to_numpy()
# array([ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

Expected behavior

These should match. The dates line up.

Context

Looking at confirmed_cumulative_num, there's definitely a jump of 2 cases on 2020-06-24

Epidata.covidcast("jhu-csse", "confirmed_cumulative_num", "day", "county", EpiRange(20200620, 20200630), "02100").df()

#       source                    signal geo_type geo_value time_type time_value      issue  lag  value stderr sample_size direction  missing_value  missing_stderr  missing_sample_size
# 0   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-20 2020-06-21    1    2.0   None        None      None              0               0                    0
# 1   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-21 2020-10-29  130    2.0   None        None      None              0               0                    0
# 2   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-22 2020-06-23    1    2.0   None        None      None              0               0                    0
# 3   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-23 2020-06-24    1    3.0   None        None      None              0               0                    0
# 4   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-24 2021-04-01  281    5.0   None        None      None              0               0                    0
# 5   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-25 2021-04-01  280    5.0   None        None      None              0               0                    0
# 6   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-26 2021-04-01  279    5.0   None        None      None              0               0                    0
# 7   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-27 2021-04-01  278    5.0   None        None      None              0               0                    0
# 8   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-28 2021-04-01  277    5.0   None        None      None              0               0                    0
# 9   jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-29 2021-04-01  276    5.0   None        None      None              0               0                    0
# 10  jhu-csse  confirmed_cumulative_num   county     02100       day 2020-06-30 2020-07-01    1    2.0   None        None      None              0               0                    0

but incidence doesn't reflect that

Epidata.covidcast("jhu-csse", "confirmed_incidence_num", "day", "county", EpiRange(20200620, 20200630), "02100").df()

#       source                   signal geo_type geo_value time_type time_value      issue  lag  value stderr sample_size direction  missing_value  missing_stderr  missing_sample_size
# 0   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-20 2020-06-21    1    0.0   None        None      None              0               0                    0
# 1   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-21 2020-10-29  130    0.0   None        None      None              0               0                    0
# 2   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-22 2020-06-23    1    0.0   None        None      None              0               0                    0
# 3   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-23 2020-06-24    1    1.0   None        None      None              0               0                    0
# 4   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-24 2021-04-01  281    0.0   None        None      None              0               0                    0
# 5   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-25 2021-04-01  280    0.0   None        None      None              0               0                    0
# 6   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-26 2021-04-01  279    0.0   None        None      None              0               0                    0
# 7   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-27 2021-04-01  278    0.0   None        None      None              0               0                    0
# 8   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-28 2021-04-01  277    0.0   None        None      None              0               0                    0
# 9   jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-29 2021-04-01  276    0.0   None        None      None              0               0                    0
# 10  jhu-csse  confirmed_incidence_num   county     02100       day 2020-06-30 2020-07-01    1    0.0   None        None      None              0               0                    0

Looking at the issue column, it looks like the dates 2020-06-24 -- 2020-06-29 were last updated on 2021-04-01, unlike the previous dates. Maybe this was a patch that didn't get computed correctly?

But what's surprising to me is that the jhu indicator processes all the dates every single time it runs and writes them to CSV files, which should've been caught in the archiver diffs. So:

  • either the diff isn't correctly computed in the indicator
  • the archiver isn't detecting this change correctly

Metadata

Metadata

Labels

data qualityMissing data, weird data, broken data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions