-
Notifications
You must be signed in to change notification settings - Fork 16
Closed
Labels
data qualityMissing data, weird data, broken dataMissing data, weird data, broken data
Description
Working on JIT A/B tests cmu-delphi/delphi-epidata#947, I found this.
Actual Behavior:
The incidence values in the county '02100' don't agree with their cumulative counterparts.
from epidatpy.request import EpiRange, Epidata
manual_incidence = Epidata.covidcast("jhu-csse", "confirmed_cumulative_num", "day", "county", EpiRange(20200620, 20200630), "02100").df().value.diff()[1:]
api_incidence = Epidata.covidcast("jhu-csse", "confirmed_incidence_num", "day", "county", EpiRange(20200621, 20200630), "02100").df().value
manual_incidence.to_numpy()
# array([ 0., 0., 1., 2., 0., 0., 0., 0., 0., -3.])
api_incidence.to_numpy()
# array([ 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])Expected behavior
These should match. The dates line up.
Context
Looking at confirmed_cumulative_num, there's definitely a jump of 2 cases on 2020-06-24
Epidata.covidcast("jhu-csse", "confirmed_cumulative_num", "day", "county", EpiRange(20200620, 20200630), "02100").df()
# source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
# 0 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-20 2020-06-21 1 2.0 None None None 0 0 0
# 1 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-21 2020-10-29 130 2.0 None None None 0 0 0
# 2 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-22 2020-06-23 1 2.0 None None None 0 0 0
# 3 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-23 2020-06-24 1 3.0 None None None 0 0 0
# 4 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-24 2021-04-01 281 5.0 None None None 0 0 0
# 5 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-25 2021-04-01 280 5.0 None None None 0 0 0
# 6 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-26 2021-04-01 279 5.0 None None None 0 0 0
# 7 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-27 2021-04-01 278 5.0 None None None 0 0 0
# 8 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-28 2021-04-01 277 5.0 None None None 0 0 0
# 9 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-29 2021-04-01 276 5.0 None None None 0 0 0
# 10 jhu-csse confirmed_cumulative_num county 02100 day 2020-06-30 2020-07-01 1 2.0 None None None 0 0 0but incidence doesn't reflect that
Epidata.covidcast("jhu-csse", "confirmed_incidence_num", "day", "county", EpiRange(20200620, 20200630), "02100").df()
# source signal geo_type geo_value time_type time_value issue lag value stderr sample_size direction missing_value missing_stderr missing_sample_size
# 0 jhu-csse confirmed_incidence_num county 02100 day 2020-06-20 2020-06-21 1 0.0 None None None 0 0 0
# 1 jhu-csse confirmed_incidence_num county 02100 day 2020-06-21 2020-10-29 130 0.0 None None None 0 0 0
# 2 jhu-csse confirmed_incidence_num county 02100 day 2020-06-22 2020-06-23 1 0.0 None None None 0 0 0
# 3 jhu-csse confirmed_incidence_num county 02100 day 2020-06-23 2020-06-24 1 1.0 None None None 0 0 0
# 4 jhu-csse confirmed_incidence_num county 02100 day 2020-06-24 2021-04-01 281 0.0 None None None 0 0 0
# 5 jhu-csse confirmed_incidence_num county 02100 day 2020-06-25 2021-04-01 280 0.0 None None None 0 0 0
# 6 jhu-csse confirmed_incidence_num county 02100 day 2020-06-26 2021-04-01 279 0.0 None None None 0 0 0
# 7 jhu-csse confirmed_incidence_num county 02100 day 2020-06-27 2021-04-01 278 0.0 None None None 0 0 0
# 8 jhu-csse confirmed_incidence_num county 02100 day 2020-06-28 2021-04-01 277 0.0 None None None 0 0 0
# 9 jhu-csse confirmed_incidence_num county 02100 day 2020-06-29 2021-04-01 276 0.0 None None None 0 0 0
# 10 jhu-csse confirmed_incidence_num county 02100 day 2020-06-30 2020-07-01 1 0.0 None None None 0 0 0Looking at the issue column, it looks like the dates 2020-06-24 -- 2020-06-29 were last updated on 2021-04-01, unlike the previous dates. Maybe this was a patch that didn't get computed correctly?
But what's surprising to me is that the jhu indicator processes all the dates every single time it runs and writes them to CSV files, which should've been caught in the archiver diffs. So:
- either the diff isn't correctly computed in the indicator
- the archiver isn't detecting this change correctly
Metadata
Metadata
Assignees
Labels
data qualityMissing data, weird data, broken dataMissing data, weird data, broken data