cmu-delphi · krivard · Aug 28, 2020 · Jul 27, 2020 · Jul 27, 2020 · Jul 30, 2020
diff --git a/quidel_covidtest/DETAILS.md b/quidel_covidtest/DETAILS.md
@@ -4,8 +4,8 @@
 Starting May 9, 2020, we began getting Quidel COVID Test data and started reporting it from May 26, 2020 due to limitation in the data volume. The data contains a number of features for every test, including localization at 5-digit Zip Code level, a TestDate and StorageDate, patient age, and several identifiers that uniquely identify the device on which the test was performed (SofiaSerNum, the individual test (FluTestNum), and the result (ResultID). Multiple tests are stored on each device. The present Quidel COVID Test sensor concerns the positive rate in the test result.
 
 ### Signal names
-- raw_pct_positive: estimates of the percentage of positive tests in total tests 
-- smoothed_pct_positive: same as in the first one, but where the estimates are formed by pooling together the last 7 days of data
+- covid_ag_raw_pct_positive: percent of tests returning positive that day
+- covid_ag_smoothed_pct_positive: same as above, but for the moving average of the most recent 7 days
 
 ### Estimating percent positive test proportion
 Let n be the number of total COVID tests taken over a given time period and a given location (the test result can be negative/positive/invalid). Let x be the number of tests taken with positive results in this location over the given time period. We are interested in estimating the percentage of positive tests which is defined as:
@@ -35,10 +35,28 @@ p = 100 * X / N
 
 The estimated standard error is simply:
 ```
-se = 1/100 * sqrt{ p*(1-p)/N } 
+se = 100 * sqrt{ p/100 *(1-p/100)/N } 
 ```
 where we assume for each time point, the estimates follow a binomial distribution.
 
 
-### Temporal Pooling
-Additionally, as with the Quidel COVID Test signal, we consider smoothed estimates formed by pooling data over time. That is, daily, for each location, we first pool all data available in that location over the last 7 days, and we then recompute everything described in the last two subsections. Pooling in this data makes estimates available in more geographic areas, as many areas report very few tests per day, but have enough data to report when 7 days are considered. 
+### Temporal and Spatial Pooling
+We conduct temporal and spatial pooling for the smoothed signal. The spatial pooling is described in the previous section where we shrink the estimates to the state's mean if the total test number is smaller than 50 for a certain location on a certain day. Additionally, as with the Quidel COVID Test signal, we consider smoothed estimates formed by pooling data over time. That is, daily, for each location, we first pool all data available in that location over the last 7 days, and we then recompute everything described in the last two subsections. Pooling in this data makes estimates available in more geographic areas.
+
+### Exceptions
+There are 9 special zip codes that are included in Quidel COVID raw data but are not included in our reports temporarily since we do not have enough mapping information for them. 
+
+|zip       |State| Number of Tests|
+|---|-------|------|
+|78086    |TX|98|
+|20174    | VA|17|
+|48824    |MI|14|
+|32313     |FL|37|
+|29486    |SC|69|
+|75033    |TX|2318|
+|79430    |TX|43|
+|44325 |OH|56|
+|75072    |TX|63|
+
+* Number of tests calculated until 08-05-2020
+* Until 08-05-2020, only 2,715 tests out of 942,293 tests for those zip codes. 
diff --git a/quidel_covidtest/delphi_quidel_covidtest/constants.py b/quidel_covidtest/delphi_quidel_covidtest/constants.py
@@ -0,0 +1,33 @@
+"""Registry for constants"""
+# global constants
+MIN_OBS = 50  # minimum number of observations in order to compute a proportion.
+POOL_DAYS = 7  # number of days in the past (including today) to pool over
+END_FROM_TODAY_MINUS = 5 # report data until - X days
+EXPORT_DAY_RANGE = 40 # Number of dates to report
+# Signal names
+SMOOTHED_POSITIVE = "covid_ag_smoothed_pct_positive"
+RAW_POSITIVE = "covid_ag_raw_pct_positive"
+SMOOTHED_TEST_PER_DEVICE = "covid_ag_smoothed_test_per_device"
+RAW_TEST_PER_DEVICE = "covid_ag_raw_test_per_device"
+# Geo types
+COUNTY = "county"
+MSA = "msa"
+HRR = "hrr"
+
+GEO_RESOLUTIONS = [
+    COUNTY,
+    MSA,
+    HRR
+]
+SENSORS = [
+    SMOOTHED_POSITIVE,
+    RAW_POSITIVE
+#    SMOOTHED_TEST_PER_DEVICE,
+#    RAW_TEST_PER_DEVICE
+]
+SMOOTHERS = {
+    SMOOTHED_POSITIVE: (False, True),
+    RAW_POSITIVE: (False, False)
+#    SMOOTHED_TEST_PER_DEVICE: (True, True),
+#    RAW_TEST_PER_DEVICE: (True, False)
+}
diff --git a/quidel_covidtest/delphi_quidel_covidtest/data_tools.py b/quidel_covidtest/delphi_quidel_covidtest/data_tools.py
@@ -250,3 +250,126 @@ def smoothed_positive_prop(positives, tests, min_obs, pool_days,
         pooled_tests = tpooled_tests
     ## STEP 2: CALCULATE AS THOUGH THEY'RE RAW
     return raw_positive_prop(pooled_positives, pooled_tests, min_obs)
+
+
+def raw_tests_per_device(devices, tests, min_obs):
+    '''
+    Calculates the tests per device for a single geographic
+    location, without any temporal smoothing.
+
+    If on any day t, tests[t] < min_obs, then we report np.nan.
+    The second and third returned np.ndarray are the standard errors,
+    currently all np.nan; and the sample size.
+    Args:
+        devices: np.ndarray[float]
+            Number of devices, ordered in time, where each array element
+            represents a subsequent day.  If there were no devices, this should
+            be zero (never np.nan).
+        tests: np.ndarray[float]
+            Number of tests performed.  If there were no tests performed, this
+            should be zero (never np.nan).
+        min_obs: int
+            Minimum number of observations in order to compute a ratio
+    Returns:
+        np.ndarray
+            Tests per device on each day, with the same length
+            as devices and tests.
+        np.ndarray
+            Placeholder for standard errors
+        np.ndarray
+            Sample size used to compute estimates.
+    '''
+    devices = devices.astype(float)
+    tests = tests.astype(float)
+    if (np.any(np.isnan(devices)) or np.any(np.isnan(tests))):
+        print(devices)
+        print(tests)
+        raise ValueError('devices and tests should be non-negative '
+                         'with no np.nan')
+    if min_obs <= 0:
+        raise ValueError('min_obs should be positive')
+    tests[tests < min_obs] = np.nan
+    tests_per_device = tests / devices
+    se = np.repeat(np.nan, len(devices))
+    sample_size = tests
+
+    return tests_per_device, se, sample_size
+
+def smoothed_tests_per_device(devices, tests, min_obs, pool_days,
+                              parent_devices=None, parent_tests=None):
+    """
+    Calculates the ratio of tests per device for a single geographic
+    location, with temporal smoothing.
+    For a given day t, if sum(tests[(t-pool_days+1):(t+1)]) < min_obs, then we
+    'borrow' min_obs - sum(tests[(t-pool_days+1):(t+1)]) observations from the
+    parents over the same timespan.  Importantly, it will make sure NOT to
+    borrow observations that are _already_ in the current geographic partition
+    being considered.
+    If min_obs is specified but not satisfied over the pool_days, and
+    parent arrays are not provided, then we report np.nan.
+    The second and third returned np.ndarray are the standard errors,
+    currently all placeholder np.nan; and the reported sample_size.
+    Args:
+        devices: np.ndarray[float]
+            Number of devices, ordered in time, where each array element
+            represents a subsequent day.  If there were no devices, this should
+            be zero (never np.nan).
+        tests: np.ndarray[float]
+            Number of tests performed.  If there were no tests performed, this
+            should be zero (never np.nan).
+        min_obs: int
+            Minimum number of observations in order to compute a ratio
+        pool_days: int
+            Number of days in the past (including today) over which to pool data.
+        parent_devices: np.ndarray
+            Like devices, but for the parent geographic partition (e.g., State)
+            If this is None, then this shall have 0 devices uniformly.
+        parent_tests: np.ndarray
+            Like tests, but for the parent geographic partition (e.g., State)
+            If this is None, then this shall have 0 tests uniformly.
+    Returns:
+        np.ndarray
+            Tests per device after the pool_days pooling, with the same
+            length as devices and tests.
+        np.ndarray
+            Standard errors, currently uniformly np.nan (placeholder).
+        np.ndarray
+            Effective sample size (after temporal and geographic pooling).
+    """
+    devices = devices.astype(float)
+    tests = tests.astype(float)
+    if (parent_devices is None) or (parent_tests is None):
+        has_parent = False
+    else:
+        has_parent = True
+        parent_devices = parent_devices.astype(float)
+        parent_tests = parent_tests.astype(float)
+    if (np.any(np.isnan(devices)) or np.any(np.isnan(tests))):
+        raise ValueError('devices and tests '
+                         'should be non-negative with no np.nan')
+    if has_parent:
+        if (np.any(np.isnan(parent_devices))
+            or np.any(np.isnan(parent_tests))):
+            raise ValueError('parent devices and parent tests '
+                       'should be non-negative with no np.nan')
+    if min_obs <= 0:
+        raise ValueError('min_obs should be positive')
+    if (pool_days <= 0) or not isinstance(pool_days, int):
+        raise ValueError('pool_days should be a positive int')
+    # STEP 0: DO THE TEMPORAL POOLING
+    tpooled_devices = _slide_window_sum(devices, pool_days)
+    tpooled_tests = _slide_window_sum(tests, pool_days)
+    if has_parent:
+        tpooled_pdevices = _slide_window_sum(parent_devices, pool_days)
+        tpooled_ptests = _slide_window_sum(parent_tests, pool_days)
+        borrow_prop = _geographical_pooling(tpooled_tests, tpooled_ptests,
+                                            min_obs)
+        pooled_devices = (tpooled_devices
+                          + borrow_prop * tpooled_pdevices)
+        pooled_tests = (tpooled_tests
+                        + borrow_prop * tpooled_ptests)
+    else:
+        pooled_devices = tpooled_devices
+        pooled_tests = tpooled_tests
+    ## STEP 2: CALCULATE AS THOUGH THEY'RE RAW
+    return raw_tests_per_device(pooled_devices, pooled_tests, min_obs)
diff --git a/quidel_covidtest/delphi_quidel_covidtest/export.py b/quidel_covidtest/delphi_quidel_covidtest/export.py
@@ -32,7 +32,8 @@ def export_csv(df, geo_name, sensor, receiving_dir, start_date, end_date):
         t = pd.to_datetime(str(date))
         date_short = t.strftime('%Y%m%d')
         export_fn = f"{date_short}_{geo_name}_{sensor}.csv"
-        result_df = df[df["timestamp"] == date][["geo_id", "val", "se", "sample_size"]].dropna()
+        result_df = df[df["timestamp"] == date][["geo_id", "val", "se", "sample_size"]]
+        result_df = result_df[result_df["sample_size"].notnull()]
         result_df.to_csv(f"{receiving_dir}/{export_fn}",
                          index=False,
                          float_format="%.8f")
diff --git a/quidel_covidtest/delphi_quidel_covidtest/generate_sensor.py b/quidel_covidtest/delphi_quidel_covidtest/generate_sensor.py
@@ -3,85 +3,145 @@
 Functions to help generate sensor for different geographical levels
 """
 import pandas as pd
-from .data_tools import fill_dates, raw_positive_prop, smoothed_positive_prop
+from .data_tools import (fill_dates, raw_positive_prop,
+                         smoothed_positive_prop,
+                         smoothed_tests_per_device,
+                         raw_tests_per_device)
 
 MIN_OBS = 50  # minimum number of observations in order to compute a proportion.
 POOL_DAYS = 7
 
-def generate_sensor_for_states(state_data, smooth, first_date, last_date):
+def generate_sensor_for_states(state_groups, smooth, device, first_date, last_date):
     """
     fit over states
     Args:
-        state_data: pd.DataFrame
+        state_groups: pd.groupby.generic.DataFrameGroupBy
         state_key: "state_id"
         smooth: bool
+            Consider raw or smooth
+        device: bool
+            Consider test_per_device or pct_positive
     Returns:
         df: pd.DataFrame
     """
     state_df = pd.DataFrame(columns=["geo_id", "val", "se", "sample_size", "timestamp"])
-    state_groups = state_data.groupby("state_id")
     state_list = list(state_groups.groups.keys())
     for state in state_list:
         state_group = state_groups.get_group(state)
         state_group = state_group.drop(columns=["state_id"])
         state_group.set_index("timestamp", inplace=True)
         state_group = fill_dates(state_group, first_date, last_date)
 
-        if smooth:
-            stat, se, sample_size = smoothed_positive_prop(tests=state_group['totalTest'].values,
-                                            positives=state_group['positiveTest'].values,
-                                            min_obs=MIN_OBS, pool_days=POOL_DAYS)
+        # smoothed test per device
+        if device & smooth:
+            stat, se, sample_size = smoothed_tests_per_device(
+                devices=state_group["numUniqueDevices"].values,
+                tests=state_group['totalTest'].values,
+                min_obs=MIN_OBS, pool_days=POOL_DAYS)
+        # raw test per device
+        elif device & (not smooth):
+            stat, se, sample_size = raw_tests_per_device(
+                devices=state_group["numUniqueDevices"].values,
+                tests=state_group['totalTest'].values,
+                min_obs=MIN_OBS)
+        # smoothed pct positive
+        elif (not device) & smooth:
+            stat, se, sample_size = smoothed_positive_prop(
+                tests=state_group['totalTest'].values,
+                positives=state_group['positiveTest'].values,
+                min_obs=MIN_OBS, pool_days=POOL_DAYS)
+            stat = stat * 100
+        # raw pct positive
         else:
-            stat, se, sample_size = raw_positive_prop(tests=state_group['totalTest'].values,
-                                            positives=state_group['positiveTest'].values,
-                                            min_obs=MIN_OBS)
-        stat = stat * 100
+            stat, se, sample_size = raw_positive_prop(
+                tests=state_group['totalTest'].values,
+                positives=state_group['positiveTest'].values,
+                min_obs=MIN_OBS)
+            stat = stat * 100
+
         se = se * 100
         state_df = state_df.append(pd.DataFrame({"geo_id": state,
                                                  "timestamp": state_group.index,
                                                  "val": stat,
                                                  "se": se,
                                                  "sample_size": sample_size}))
-    return state_df, state_groups
+    return state_df
 
-def generate_sensor_for_other_geores(state_groups, data, res_key, smooth, first_date, last_date):
+def generate_sensor_for_other_geores(state_groups, data, res_key, smooth,
+                                     device, first_date, last_date):
     """
     fit over counties/HRRs/MSAs
     Args:
         data: pd.DataFrame
         res_key: "fips", "cbsa_id" or "hrrnum"
         smooth: bool
+            Consider raw or smooth
+        device: bool
+            Consider test_per_device or pct_positive
     Returns:
         df: pd.DataFrame
     """
+    has_parent = True
     res_df = pd.DataFrame(columns=["geo_id", "val", "se", "sample_size"])
     res_groups = data.groupby(res_key)
     loc_list = list(res_groups.groups.keys())
     for loc in loc_list:
         res_group = res_groups.get_group(loc)
         parent_state = res_group['state_id'].values[0]
-        parent_group = state_groups.get_group(parent_state)
-        res_group = res_group.merge(parent_group, how="left",
-                                    on="timestamp", suffixes=('', '_parent'))
-        res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent'])
+        try:
+            parent_group = state_groups.get_group(parent_state)
+            res_group = res_group.merge(parent_group, how="left",
+                                        on="timestamp", suffixes=('', '_parent'))
+            res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent'])
+        except:
+            has_parent = False
+            res_group = res_group.drop(columns=[res_key, "state_id"])
         res_group.set_index("timestamp", inplace=True)
         res_group = fill_dates(res_group, first_date, last_date)
 
         if smooth:
-            stat, se, sample_size = smoothed_positive_prop(
-                tests=res_group['totalTest'].values,
-                positives=res_group['positiveTest'].values,
-                min_obs=MIN_OBS, pool_days=POOL_DAYS,
-                parent_tests=res_group["totalTest_parent"].values,
-                parent_positives=res_group['positiveTest_parent'].values)
+            if has_parent:
+                if device:
+                    stat, se, sample_size = smoothed_tests_per_device(
+                        devices=res_group["numUniqueDevices"].values,
+                        tests=res_group['totalTest'].values,
+                        min_obs=MIN_OBS, pool_days=POOL_DAYS,
+                        parent_devices=res_group["numUniqueDevices_parent"].values,
+                        parent_tests=res_group["totalTest_parent"].values)
+                else:
+                    stat, se, sample_size = smoothed_positive_prop(
+                        tests=res_group['totalTest'].values,
+                        positives=res_group['positiveTest'].values,
+                        min_obs=MIN_OBS, pool_days=POOL_DAYS,
+                        parent_tests=res_group["totalTest_parent"].values,
+                        parent_positives=res_group['positiveTest_parent'].values)
+                    stat = stat * 100
+            else:
+                if device:
+                    stat, se, sample_size = smoothed_tests_per_device(
+                        devices=res_group["numUniqueDevices"].values,
+                        tests=res_group['totalTest'].values,
+                        min_obs=MIN_OBS, pool_days=POOL_DAYS)
+                else:
+                    stat, se, sample_size = smoothed_positive_prop(
+                        tests=res_group['totalTest'].values,
+                        positives=res_group['positiveTest'].values,
+                        min_obs=MIN_OBS, pool_days=POOL_DAYS)
+                    stat = stat * 100
         else:
-            stat, se, sample_size = raw_positive_prop(
-                tests=res_group['totalTest'].values,
-                positives=res_group['positiveTest'].values,
-                min_obs=MIN_OBS)
-        stat = stat * 100
-        se = se * 100
+            if device:
+                stat, se, sample_size = raw_tests_per_device(
+                    devices=res_group["numUniqueDevices"].values,
+                    tests=res_group['totalTest'].values,
+                    min_obs=MIN_OBS)
+            else:
+                stat, se, sample_size = raw_positive_prop(
+                    tests=res_group['totalTest'].values,
+                    positives=res_group['positiveTest'].values,
+                    min_obs=MIN_OBS)
+                stat = stat * 100
 
+        se = se * 100
         res_df = res_df.append(pd.DataFrame({"geo_id": loc,
                                              "timestamp": res_group.index,
                                              "val": stat,