diff --git a/validator/PLANS.md b/validator/PLANS.md new file mode 100644 index 000000000..531d62112 --- /dev/null +++ b/validator/PLANS.md @@ -0,0 +1,89 @@ +# Validator checks and features + +## Current checks for indicator source data + +* Missing dates within the selected range +* Recognized file name format +* Recognized geographical type (county, state, etc) +* Recognized geo id format (e.g. state is two lowercase letters) +* Specific geo id has been seen before, in historical data +* Missing geo type + signal + date combos based on the geo type + signal combos Covidcast metadata says should be available +* Missing ‘val’ values +* Negative ‘val’ values +* Out-of-range ‘val’ values (>0 for all signals, <=100 for percents, <=100 000 for proportions) +* Missing ‘se’ values +* Appropriate ‘se’ values, within a calculated reasonable range +* Stderr != 0 +* If signal and stderr both = 0 (seen in Quidel data due to lack of Jeffreys correction, [issue 255](https://github.com/cmu-delphi/covidcast-indicators/issues/255#issuecomment-692196541)) +* Missing ‘sample_size’ values +* Appropriate ‘sample_size’ values, ≥ 100 (default) or user-defined threshold +* Most recent date seen in source data is recent enough, < 1 day ago (default) or user-defined on a per-signal basis +* Most recent date seen in source data is not in the future +* Most recent date seen in source data is not older than most recent date seen in reference data +* Similar number of obs per day as recent API data (static threshold) +* Similar average value as API data (static threshold) +* Source data for specified date range is empty +* API data for specified date range is empty + + +## Current features + +* Errors and warnings are summarized in class attribute and printed on exit +* If any non-suppressed errors are raised, the validation process exits with non-zero status +* Various check settings are controllable via indicator-specific params.json files +* User can manually disable specific checks for specific datasets using a field in the params.json file +* User can enable test mode (checks only a small number of data files) using a field in the params.json file + +## Checks + features wishlist, and problems to think about + +### Starter/small issues + +* Check for duplicate rows +* Backfill problems, especially with JHU and USA Facts, where a change to old data results in a datapoint that doesn’t agree with surrounding data ([JHU examples](https://delphi-org.slack.com/archives/CF9G83ZJ9/p1600729151013900)) or is very different from the value it replaced. If date is already in the API, have any values changed significantly within the "backfill" window (use span_length setting). See [this](https://github.com/cmu-delphi/covidcast-indicators/pull/155#discussion_r504195207) for context. +* Run check_missing_date_files (or similar) on every geo type-signal type separately in comparative checks loop. + +### Larger issues + +* Expand framework to support nchs_mortality, which is provided on a weekly basis and has some differences from the daily data. E.g. filenames use a different format ("weekly_YYYYWW_geotype_signalname.csv") +* Make backtesting framework so new checks can be run individually on historical indicator data to tune false positives, output verbosity, understand frequency of error raising, etc. Should pull data from API the first time and save locally in `cache` dir. +* Add DETAILS.md doc with detailed descriptions of what each check does and how. Will be especially important for statistical/anomaly detection checks. +* Improve errors and error report + * Check if [errors raised from validating all signals](https://docs.google.com/spreadsheets/d/1_aRBDrNeaI-3ZwuvkRNSZuZ2wfHJk6Bxj35Ol_XZ9yQ/edit#gid=1226266834) are correct, not false positives, not overly verbose or repetitive + * Easier suppression of many errors at once + * Maybe store errors as dict of dicts. Keys could be check strings (e.g. "check_bad_se"), then next layer geo type, etc + * Nicer formatting for error “report”. + * E.g. if a single type of error is raised for many different datasets, summarize all error messages into a single message? But it still has to be clear how to suppress each individually +* Check for erratic data sources that wrongly report all zeroes + * E.g. the error with the Wisconsin data for the 10/26 forecasts + * Wary of a purely static check for this + * Are there any geo regions where this might cause false positives? E.g. small counties or MSAs, certain signals (deaths, since it's << cases) + * This test is partially captured by checking avgs in source vs reference data, unless erroneous zeroes continue for more than a week + * Also partially captured by outlier checking. If zeroes aren't outliers, then it's hard to say that they're erroneous at all. +* Outlier detection (in progress) + * Current approach is tuned to daily cases and daily deaths; use just on those signals? + * prophet (package) detection is flexible, but needs 2-3 months historical data to fit on. May make sense to use if other statistical checks also need that much data. +* Use known erroneous/anomalous days of source data to tune static thresholds and test behavior +* If can't get data from API, do we want to use substitute data for the comparative checks instead? + * E.g. most recent successful API pull -- might end up being a couple weeks older + * Currently, any API fetch problems just doesn't do comparative checks at all. +* Improve performance and reduce runtime (no particular goal, just avoid being painfully slow!) + * Profiling (iterate) + * Save intermediate files? + * Currently a bottleneck at "individual file checks" section. Parallelize? + * Make `all_frames` MultiIndex-ed by geo type and signal name? Make a dict of data indexed by geo type and signal name? May improve performance or may just make access more readable. +* Ensure validator runs on signals that require AWS credentials (iterate) + +### Longer-term issues + +* Data correctness and consistency over longer time periods (weeks to months). Compare data against long-ago (3 months?) API data for changes in trends. + * Long-term trends and correlations between time series. Currently, checks only look at a data window of a few days + * Any relevant anomaly detection packages already exist? + * What sorts of hypothesis tests to use? See [time series trend analysis](https://www.genasis.cz/time-series/index.php?pg=home--trend-analysis). + * See data-quality GitHub issues, Ryan’s [correlation notebook](https://github.com/cmu-delphi/covidcast/tree/main/R-notebooks), and Dmitry's [indicator validation notebook](https://github.com/cmu-delphi/covidcast-indicators/blob/deploy-jhu/testing_utils/indicator_validation.template.ipynb) for ideas + * E.g. Doctor visits decreasing correlation with cases + * E.g. WY/RI missing or very low compared to historical +* Use hypothesis testing p-values to decide when to raise error or not, instead of static thresholds. Many low but non-significant p-values will also raise error. See [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1601307675021000?thread_ts=1600277030.103500&cid=CV1SYBC90) and [here](https://delphi-org.slack.com/archives/CV1SYBC90/p1600978037007500?thread_ts=1600277030.103500&cid=CV1SYBC90) for more background. + * Order raised exceptions by p-value + * Raise errors when one p-value (per geo region, e.g.) is significant OR when a bunch of p-values for that same type of test (different geo regions, e.g.) are "close" to significant + * Correct p-values for multiple testing + * Bonferroni would be easy but is sensitive to choice of "family" of tests; Benjamimi-Hochberg is a bit more involved but is less sensitive to choice of "family"; [comparison of the two](https://delphi-org.slack.com/archives/D01A9KNTPKL/p1603294915000500) diff --git a/validator/README.md b/validator/README.md new file mode 100644 index 000000000..8b72a7ab9 --- /dev/null +++ b/validator/README.md @@ -0,0 +1,112 @@ +# Validator + +The validator performs two main tasks: +1) Sanity checks on daily data generated from the pipeline of a specific data + source. +2) Comparative analysis with recent data from the API + to detect any anomalies, such as spikes or significant value differences + +The validator validates new source data in CSV format against data pulled from the [COVIDcast API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html). + + +## Running the Validator + +The validator is run by executing the Python module contained in this +directory from the main directory of the indicator of interest. + +The safest way to do this is to create a virtual environment, +install the common DELPHI tools, install the indicator module and its +dependencies, and then install the validator module and its +dependencies to the virtual environment. + +To do this, navigate to the main directory of the indicator of interest and run the following code: + +``` +python -m venv env +source env/bin/activate +pip install ../_delphi_utils_python/. +pip install . +pip install ../validator +``` + +To execute the module and validate source data (by default, in `receiving`), run the indicator to generate data files, then run +the validator, as follows: + +``` +env/bin/python -m delphi_INDICATORNAME +env/bin/python -m delphi_validator +``` + +Once you are finished with the code, you can deactivate the virtual environment +and (optionally) remove the environment itself. + +``` +deactivate +rm -r env +``` + +### Customization + +All of the user-changable parameters are stored in the `validation` field of the indicator's `params.json` file. If `params.json` does not already include a `validation` field, please copy that provided in this module's `params.json.template`. + +Please update the follow settings: + +* `data_source`: should match the [formatting](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html) as used in COVIDcast API calls +* `end_date`: specifies the last date to be checked; if set to "latest", `end_date` will always be the current date +* `span_length`: specifies the number of days before the `end_date` to check. `span_length` should be long enough to contain all recent source data that is still in the process of being updated (i.e. in the backfill period), for example, if the data source of interest has a 2-week lag before all reports are in for a given date, `scan_length` should be 14 days +* `smoothed_signals`: list of the names of the signals that are smoothed (e.g. 7-day average) +* `expected_lag`: dictionary of signal name-int pairs specifying the number of days of expected lag (time between event occurrence and when data about that event was published) for that signal +* `test_mode`: boolean; `true` checks only a small number of data files +* `suppressed_errors`: list of lists uniquely specifying errors that have been manually verified as false positives or acceptable deviations from expected + +All other fields contain working defaults, to be modified as needed. + +## Testing the code + +To test the code, please create a new virtual environment in the main module directory using the following procedure, similar to above: + +``` +python -m venv env +source env/bin/activate +pip install ../_delphi_utils_python/. +pip install . +``` + +To do a static test of the code style, it is recommended to run **pylint** on +the module. To do this, run the following from the main module directory: + +``` +env/bin/pylint delphi_validator +``` + +The most aggressive checks are turned off; only relatively important issues +should be raised and they should be manually checked (or better, fixed). + +Unit tests are also included in the module. To execute these, run the following command from this directory: + +``` +(cd tests && ../env/bin/pytest --cov=delphi_validator --cov-report=term-missing) +``` + +The output will show the number of unit tests that passed and failed, along with the percentage of code covered by the tests. None of the tests should fail and the code lines that are not covered by unit tests should be small and should not include critical sub-routines. + + +## Code tour + +* run.py: sends params.json fields to and runs the validation process +* datafetcher.py: methods for loading source and API data +* validate.py: methods for validating data. Includes the individual check methods and supporting functions. +* errors.py: custom errors + + +## Adding checks + +To add a new validation check, define the check as a `Validator` class method in `validate.py`. Each check should append a descriptive error message to the `raised` attribute if triggered. All checks should allow the user to override exception raising for a specific file using the `suppressed_errors` setting in `params.json`. + +This features requires that the `check_data_id` defined for an error uniquely identifies that combination of check and test data. This usually takes the form of a tuple of strings with the check method and test identifier, and test data filename or date, geo type, and signal name. + +Add the newly defined check to the `validate()` method to be executed. It should go in one of three sections: + +* data sanity checks where a data file is compared against static format settings, +* data trend and value checks where a set of source data (can be one or several days) is compared against recent API data, from the previous few days, +* data trend and value checks where a set of source data is compared against long term API data, from the last few months \ No newline at end of file diff --git a/validator/REVIEW.md b/validator/REVIEW.md new file mode 100644 index 000000000..d7dd2ce77 --- /dev/null +++ b/validator/REVIEW.md @@ -0,0 +1,37 @@ +## Code Review (Python) + +A code review of this module should include a careful look at the code and the +output. To assist in the process, but certainly not in replace of it, please +check the following items. + +**Documentation** + +- [ ] the README.md file template is filled out and currently accurate; it is +possible to load and test the code using only the instructions given +- [ ] minimal docstrings (one line describing what the function does) are +included for all functions; full docstrings describing the inputs and expected +outputs should be given for non-trivial functions + +**Structure** + +- [ ] code should use 4 spaces for indentation; other style decisions are +flexible, but be consistent within a module +- [ ] any required metadata files are checked into the repository and placed +within the directory `static` +- [ ] any intermediate files that are created and stored by the module should +be placed in the directory `cache` +- [ ] all options and API keys are passed through the file `params.json` +- [ ] template parameter file (`params.json.template`) is checked into the +code; no personal (i.e., usernames) or private (i.e., API keys) information is +included in this template file + +**Testing** + +- [ ] module can be installed in a new virtual environment +- [ ] pylint with the default `.pylint` settings run over the module produces +minimal warnings; warnings that do exist have been confirmed as false positives +- [ ] reasonably high level of unit test coverage covering all of the main logic +of the code (e.g., missing coverage for raised errors that do not currently seem +possible to reach are okay; missing coverage for options that will be needed are +not) +- [ ] all unit tests run without errors diff --git a/validator/delphi_validator/__init__.py b/validator/delphi_validator/__init__.py new file mode 100644 index 000000000..04a4ece92 --- /dev/null +++ b/validator/delphi_validator/__init__.py @@ -0,0 +1,13 @@ +# -*- coding: utf-8 -*- +"""Module to validate indicator source data before uploading to the public COVIDcast API. + +This file defines the functions that are made public by the module. As the +module is intended to be executed though the main method, these are primarily +for testing. +""" + +from __future__ import absolute_import + +from . import run + +__version__ = "0.1.0" diff --git a/validator/delphi_validator/__main__.py b/validator/delphi_validator/__main__.py new file mode 100644 index 000000000..c7cca0ec9 --- /dev/null +++ b/validator/delphi_validator/__main__.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8 -*- +"""Call the function run_module when executed. + +This file indicates that running the module (`python -m delphi_validator`) will +call the function `run_module` found within the run.py file. +""" + +from .run import run_module + +run_module() diff --git a/validator/delphi_validator/datafetcher.py b/validator/delphi_validator/datafetcher.py new file mode 100644 index 000000000..b920259e4 --- /dev/null +++ b/validator/delphi_validator/datafetcher.py @@ -0,0 +1,97 @@ +# -*- coding: utf-8 -*- +""" +Functions to get CSV filenames and data. +""" + +import re +from os import listdir +from os.path import isfile, join +from itertools import product +import pandas as pd +import numpy as np + +import covidcast +from .errors import APIDataFetchError + +filename_regex = re.compile( + r'^(?P\d{8})_(?P\w+?)_(?P\w+)\.csv$') + + +def read_filenames(path): + """ + Return a list of tuples of every filename and regex match to the CSV filename + format in the specified directory. + + Arguments: + - path: path to the directory containing CSV data files. + + Returns: + - list of tuples + """ + daily_filenames = [(f, filename_regex.match(f)) + for f in listdir(path) if isfile(join(path, f))] + return daily_filenames + + +def load_csv(path): + """ + Load CSV with specified column types. + """ + return pd.read_csv( + path, + dtype={ + 'geo_id': str, + 'val': float, + 'se': float, + 'sample_size': float, + }) + + +def get_geo_signal_combos(data_source): + """ + Get list of geo type-signal type combinations that we expect to see, based on + combinations reported available by COVIDcast metadata. + """ + meta = covidcast.metadata() + source_meta = meta[meta['data_source'] == data_source] + unique_signals = source_meta['signal'].unique().tolist() + unique_geotypes = source_meta['geo_type'].unique().tolist() + + geo_signal_combos = list(product(unique_geotypes, unique_signals)) + print("Number of expected geo region-signal combinations:", + len(geo_signal_combos)) + + return geo_signal_combos + + +def fetch_api_reference(data_source, start_date, end_date, geo_type, signal_type): + """ + Get and process API data for use as a reference. Formatting is changed + to match that of source data CSVs. + """ + api_df = covidcast.signal( + data_source, signal_type, start_date, end_date, geo_type) + + if not isinstance(api_df, pd.DataFrame): + custom_msg = "Error fetching data from " + str(start_date) + \ + " to " + str(end_date) + \ + "for data source: " + data_source + \ + ", signal type: " + signal_type + \ + ", geo type: " + geo_type + + raise APIDataFetchError(custom_msg) + + column_names = ["geo_id", "val", + "se", "sample_size", "time_value"] + + # Replace None with NA to make numerical manipulation easier. + # Rename and reorder columns to match those in df_to_test. + api_df = api_df.replace( + to_replace=[None], value=np.nan + ).rename( + columns={'geo_value': "geo_id", 'stderr': 'se', 'value': 'val'} + ).drop( + ['direction', 'issue', 'lag'], axis=1 + ).reindex(columns=column_names) + + return api_df diff --git a/validator/delphi_validator/errors.py b/validator/delphi_validator/errors.py new file mode 100644 index 000000000..aa688ab54 --- /dev/null +++ b/validator/delphi_validator/errors.py @@ -0,0 +1,37 @@ +# -*- coding: utf-8 -*- +""" +Custom validator exceptions. +""" + + +class APIDataFetchError(Exception): + """Exception raised when reading API data goes wrong. + + Attributes: + custom_msg -- parameters which caused the error + """ + + def __init__(self, custom_msg): + self.custom_msg = custom_msg + super().__init__(self.custom_msg) + + def __str__(self): + return '{}'.format(self.custom_msg) + + +class ValidationError(Exception): + """ Error raised when validation check fails. """ + + def __init__(self, check_data_id, expression, message): + """ + Arguments: + - check_data_id: str or tuple/list of str uniquely identifying the + check that was run and on what data + - expression: relevant variables to message, e.g., if a date doesn't + pass a check, provide the date + - message: str explaining why an error was raised + """ + self.check_data_id = (check_data_id,) if not isinstance( + check_data_id, tuple) and not isinstance(check_data_id, list) else tuple(check_data_id) + self.expression = expression + self.message = message diff --git a/validator/delphi_validator/run.py b/validator/delphi_validator/run.py new file mode 100644 index 000000000..74371518b --- /dev/null +++ b/validator/delphi_validator/run.py @@ -0,0 +1,16 @@ +# -*- coding: utf-8 -*- +"""Functions to call when running the tool. + +This module should contain a function called `run_module`, that is executed +when the module is run with `python -m delphi_validator`. +""" +from delphi_utils import read_params +from .validate import Validator + + +def run_module(): + parent_params = read_params() + params = parent_params['validation'] + + validator = Validator(params) + validator.validate(parent_params["export_dir"]) diff --git a/validator/delphi_validator/validate.py b/validator/delphi_validator/validate.py new file mode 100644 index 000000000..448a3a847 --- /dev/null +++ b/validator/delphi_validator/validate.py @@ -0,0 +1,975 @@ +# -*- coding: utf-8 -*- +""" +Tools to validate CSV source data, including various check methods. +""" +import sys +import re +import math +import threading +from os.path import join +from datetime import date, datetime, timedelta +import pandas as pd + +from .errors import ValidationError, APIDataFetchError +from .datafetcher import filename_regex, \ + read_filenames, load_csv, get_geo_signal_combos, \ + fetch_api_reference + +# Recognized geo types. +geo_regex_dict = { + 'county': '^\d{5}$', + 'hrr': '^\d{1,3}$', + 'msa': '^\d{5}$', + 'dma': '^\d{3}$', + 'state': '^[a-zA-Z]{2}$', + 'national': '^[a-zA-Z]{2}$' +} + + +def relative_difference_by_min(x, y): + """ + Calculate relative difference between two numbers. + """ + return (x - y) / min(x, y) + + +def make_date_filter(start_date, end_date): + """ + Create a function to return a boolean of whether a filename of appropriate + format contains a date within (inclusive) the specified date range. + + Arguments: + - start_date: datetime date object + - end_date: datetime date object + + Returns: + - Custom function object + """ + # Convert dates from datetime format to int. + start_code = int(start_date.strftime("%Y%m%d")) + end_code = int(end_date.strftime("%Y%m%d")) + + def custom_date_filter(match): + """ + Return a boolean of whether a filename of appropriate format contains a date + within the specified date range. + + Arguments: + - match: regex match object based on filename_regex applied to a filename str + + Returns: + - boolean + """ + # If regex match doesn't exist, current filename is not an appropriately + # formatted source data file. + if not match: + return False + + # Convert date found in CSV name to int. + code = int(match.groupdict()['date']) + + # Return boolean True if current file date "code" is within the defined date range. + return start_code <= code <= end_code + + return custom_date_filter + + +class Validator(): + """ Class containing validation() function and supporting functions. Stores a list + of all raised errors, and user settings. """ + + def __init__(self, params): + """ + Initialize object and set parameters. + + Arguments: + - params: dictionary of user settings; if empty, defaults will be used + + Attributes: + - data_source: str; data source name, one of + https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html + - start_date: beginning date of data to check, in datetime date format + - span_length: number of days before the end date to include in checking + - end_date: end date of data to check, in datetime date format + - generation_date: date that this df_to_test was generated; typically 1 day + after the last date in df_to_test + - max_check_lookbehind: number of days back to perform sanity checks, starting + from the last date appearing in df_to_test + - minimum_sample_size: int + - missing_se_allowed: boolean indicating if missing standard errors should + raise an exception or not + - missing_sample_size_allowed: boolean indicating if missing sample size should + raise an exception or not + - sanity_check_rows_per_day: boolean; check flag + - sanity_check_value_diffs: boolean; check flag + - smoothed_signals: set of strings; names of signals that are smoothed (7-day + avg, etc) + - expected_lag: dict of signal names: int pairs; how many days behind do we + expect each signal to be + - suppressed_errors: set of check_data_ids used to identify error messages to ignore + - raised_errors: list to append data upload-blocking errors to as they are raised + - total_checks: incremental counter to track total number of checks run + - raised_warnings: list to append non-data upload-blocking errors to as they are raised + """ + # Get user settings from params or if not provided, set default. + self.data_source = params['data_source'] + self.validator_static_file_dir = params.get( + 'validator_static_file_dir', '../validator/static') + + # Date/time settings + self.span_length = timedelta(days=params['span_length']) + self.end_date = date.today() if params['end_date'] == "latest" else datetime.strptime( + params['end_date'], '%Y-%m-%d').date() + self.start_date = self.end_date - self.span_length + self.generation_date = date.today() + + # General options: flags, thresholds + self.max_check_lookbehind = timedelta( + days=params.get("ref_window_size", 7)) + self.minimum_sample_size = params.get('minimum_sample_size', 100) + self.missing_se_allowed = params.get('missing_se_allowed', False) + self.missing_sample_size_allowed = params.get( + 'missing_sample_size_allowed', False) + + self.sanity_check_rows_per_day = params.get( + 'sanity_check_rows_per_day', True) + self.sanity_check_value_diffs = params.get( + 'sanity_check_value_diffs', True) + self.test_mode = params.get("test_mode", False) + + # Signal-specific settings + self.smoothed_signals = set(params.get("smoothed_signals", [])) + self.expected_lag = params["expected_lag"] + + self.suppressed_errors = {(item,) if not isinstance(item, tuple) and not isinstance( + item, list) else tuple(item) for item in params.get('suppressed_errors', [])} + + # Output + self.raised_errors = [] + self.total_checks = 0 + + self.raised_warnings = [] + + def increment_total_checks(self): + """ Add 1 to total_checks counter """ + self.total_checks += 1 + + def check_missing_date_files(self, daily_filenames): + """ + Check for missing dates between the specified start and end dates. + + Arguments: + - daily_filenames: list of tuples, each containing CSV source data filename + and the regex match object corresponding to filename_regex. + + Returns: + - None + """ + number_of_dates = self.end_date - self.start_date + timedelta(days=1) + + # Create set of all expected dates. + date_seq = {self.start_date + timedelta(days=x) + for x in range(number_of_dates.days)} + # Create set of all dates seen in CSV names. + unique_dates = {datetime.strptime( + daily_filename[0][0:8], '%Y%m%d').date() for daily_filename in daily_filenames} + + # Diff expected and observed dates. + check_dateholes = list(date_seq.difference(unique_dates)) + check_dateholes.sort() + + if check_dateholes: + self.raised_errors.append(ValidationError( + "check_missing_date_files", + check_dateholes, + "Missing dates are observed; if these dates are" + + " already in the API they would not be updated")) + + self.increment_total_checks() + + def check_settings(self): + """ + Perform some automated format & sanity checks of parameters. + + Arguments: + - None + + Returns: + - None + """ + if not isinstance(self.max_check_lookbehind, timedelta): + self.raised_errors.append(ValidationError( + ("check_type_max_check_lookbehind"), + self.max_check_lookbehind, + "max_check_lookbehind must be of type datetime.timedelta")) + + self.increment_total_checks() + + if not isinstance(self.generation_date, date): + self.raised_errors.append(ValidationError( + ("check_type_generation_date"), self.generation_date, + "generation_date must be a datetime.date type")) + + self.increment_total_checks() + + if self.generation_date > date.today(): + self.raised_errors.append(ValidationError( + ("check_future_generation_date"), self.generation_date, + "generation_date must not be in the future")) + + self.increment_total_checks() + + def check_df_format(self, df_to_test, nameformat): + """ + Check basic format of source data CSV df. + + Arguments: + - df_to_test: pandas dataframe of a single CSV of source data + (one day-signal-geo_type combo) + - nameformat: str CSV name; for example, "20200624_county_smoothed_nohh_cmnty_cli.csv" + + Returns: + - None + """ + pattern_found = filename_regex.match(nameformat) + if not nameformat or not pattern_found: + self.raised_errors.append(ValidationError( + ("check_filename_format", nameformat), + nameformat, 'nameformat not recognized')) + + self.increment_total_checks() + + if not isinstance(df_to_test, pd.DataFrame): + self.raised_errors.append(ValidationError( + ("check_file_data_format", nameformat), + type(df_to_test), 'df_to_test must be a pandas dataframe.')) + + self.increment_total_checks() + + def check_bad_geo_id_value(self, df_to_test, filename, geo_type): + """ + Check for bad geo_id values, by comparing to a list of known values (drawn from + historical data) + + Arguments: + - df_to_test: pandas dataframe of CSV source data containing the geo_id column to check + - geo_type: string from CSV name specifying geo type (state, county, msa, etc.) of data + """ + file_path = join(self.validator_static_file_dir, geo_type + '_geo.csv') + valid_geo_df = pd.read_csv(file_path, dtype={'geo_id': str}) + valid_geos = valid_geo_df['geo_id'].values + unexpected_geos = [geo for geo in df_to_test['geo_id'] + if geo.lower() not in valid_geos] + if len(unexpected_geos) > 0: + self.raised_errors.append(ValidationError( + ("check_bad_geo_id_value", filename), + unexpected_geos, "Unrecognized geo_ids (not in historical data)")) + self.increment_total_checks() + upper_case_geos = [ + geo for geo in df_to_test['geo_id'] if geo.lower() != geo] + if len(upper_case_geos) > 0: + self.raised_warnings.append(ValidationError( + ("check_geo_id_lowercase", filename), + upper_case_geos, "geo_id contains uppercase characters. Lowercase is preferred.")) + self.increment_total_checks() + + def check_bad_geo_id_format(self, df_to_test, nameformat, geo_type): + """ + Check validity of geo_type and format of geo_ids, according to regex pattern. + + Arguments: + - df_to_test: pandas dataframe of CSV source data + - geo_type: string from CSV name specifying geo type (state, county, msa, hrr) of data + + Returns: + - None + """ + def find_all_unexpected_geo_ids(df_to_test, geo_regex, geo_type): + """ + Check if any geo_ids in df_to_test aren't formatted correctly, according + to the geo type dictionary negated_regex_dict. + """ + numeric_geo_types = {"msa", "county", "hrr", "dma"} + fill_len = {"msa": 5, "county": 5, "dma": 3} + + if geo_type in numeric_geo_types: + # Check if geo_ids were stored as floats (contain decimal point) and + # contents before decimal match the specified regex pattern. + leftover = [geo[1] for geo in df_to_test["geo_id"].str.split( + ".") if len(geo) > 1 and re.match(geo_regex, geo[0])] + + # If any floats found, remove decimal and anything after. + if len(leftover) > 0: + df_to_test["geo_id"] = [geo[0] + for geo in df_to_test["geo_id"].str.split(".")] + + self.raised_warnings.append(ValidationError( + ("check_geo_id_type", nameformat), + None, "geo_ids saved as floats; strings preferred")) + + if geo_type in fill_len.keys(): + # Left-pad with zeroes up to expected length. Fixes missing leading zeroes + # caused by FIPS codes saved as numeric. + df_to_test["geo_id"] = pd.Series([geo.zfill(fill_len[geo_type]) + for geo in df_to_test["geo_id"]], dtype=str) + + expected_geos = [geo[0] for geo in df_to_test['geo_id'].str.findall( + geo_regex) if len(geo) > 0] + + unexpected_geos = {geo for geo in set( + df_to_test['geo_id']) if geo not in expected_geos} + + if len(unexpected_geos) > 0: + self.raised_errors.append(ValidationError( + ("check_geo_id_format", nameformat), + unexpected_geos, "Non-conforming geo_ids found")) + + if geo_type not in geo_regex_dict: + self.raised_errors.append(ValidationError( + ("check_geo_type", nameformat), + geo_type, "Unrecognized geo type")) + else: + find_all_unexpected_geo_ids( + df_to_test, geo_regex_dict[geo_type], geo_type) + + self.increment_total_checks() + + def check_bad_val(self, df_to_test, nameformat, signal_type): + """ + Check value field for validity. + + Arguments: + - df_to_test: pandas dataframe of a single CSV of source data + - signal_type: string from CSV name specifying signal type (smoothed_cli, etc) of data + + Returns: + - None + """ + # Determine if signal is a proportion (# of x out of 100k people) or percent + percent_option = bool('pct' in signal_type) + proportion_option = bool('prop' in signal_type) + + if percent_option: + if not df_to_test[(df_to_test['val'] > 100)].empty: + self.raised_errors.append(ValidationError( + ("check_val_pct_gt_100", nameformat), + df_to_test[(df_to_test['val'] > 100)], + "val column can't have any cell greater than 100 for percents")) + + self.increment_total_checks() + + if proportion_option: + if not df_to_test[(df_to_test['val'] > 100000)].empty: + self.raised_errors.append(ValidationError( + ("check_val_prop_gt_100k", nameformat), + df_to_test[(df_to_test['val'] > 100000)], + "val column can't have any cell greater than 100000 for proportions")) + + self.increment_total_checks() + + if df_to_test['val'].isnull().values.any(): + self.raised_errors.append(ValidationError( + ("check_val_missing", nameformat), + None, "val column can't have any cell that is NA")) + + self.increment_total_checks() + + if not df_to_test[(df_to_test['val'] < 0)].empty: + self.raised_errors.append(ValidationError( + ("check_val_lt_0", nameformat), + df_to_test[(df_to_test['val'] < 0)], + "val column can't have any cell smaller than 0")) + + self.increment_total_checks() + + def check_bad_se(self, df_to_test, nameformat): + """ + Check standard errors for validity. + + Arguments: + - df_to_test: pandas dataframe of a single CSV of source data + (one day-signal-geo_type combo) + - nameformat: str CSV name; for example, "20200624_county_smoothed_nohh_cmnty_cli.csv" + + Returns: + - None + """ + # Add a new se_upper_limit column. + df_to_test.eval( + 'se_upper_limit = (val * sample_size + 50)/(sample_size + 1)', inplace=True) + + df_to_test['se'] = df_to_test['se'].round(3) + df_to_test['se_upper_limit'] = df_to_test['se_upper_limit'].round(3) + + if not self.missing_se_allowed: + # Find rows not in the allowed range for se. + result = df_to_test.query( + '~((se > 0) & (se < 50) & (se <= se_upper_limit))') + + if not result.empty: + self.raised_errors.append(ValidationError( + ("check_se_not_missing_and_in_range", nameformat), + result, "se must be in (0, min(50,val*(1+eps))] and not missing")) + + self.increment_total_checks() + + if df_to_test["se"].isnull().mean() > 0.5: + self.raised_errors.append(ValidationError( + ("check_se_many_missing", nameformat), + None, 'Recent se values are >50% NA')) + + self.increment_total_checks() + + elif self.missing_se_allowed: + result = df_to_test.query( + '~(se.isnull() | ((se > 0) & (se < 50) & (se <= se_upper_limit)))') + + if not result.empty: + self.raised_errors.append(ValidationError( + ("check_se_missing_or_in_range", nameformat), + result, "se must be NA or in (0, min(50,val*(1+eps))]")) + + self.increment_total_checks() + + result_jeffreys = df_to_test.query('(val == 0) & (se == 0)') + result_alt = df_to_test.query('se == 0') + + if not result_jeffreys.empty: + self.raised_errors.append(ValidationError( + ("check_se_0_when_val_0", nameformat), + None, + "when signal value is 0, se must be non-zero. please " + + "use Jeffreys correction to generate an appropriate se" + + " (see wikipedia.org/wiki/Binomial_proportion_confidence" + + "_interval#Jeffreys_interval for details)")) + elif not result_alt.empty: + self.raised_errors.append(ValidationError( + ("check_se_0", nameformat), + result_alt, "se must be non-zero")) + + self.increment_total_checks() + + # Remove se_upper_limit column. + df_to_test.drop(columns=["se_upper_limit"]) + + def check_bad_sample_size(self, df_to_test, nameformat): + """ + Check sample sizes for validity. + + Arguments: + - df_to_test: pandas dataframe of a single CSV of source data + (one day-signal-geo_type combo) + - nameformat: str CSV name; for example, "20200624_county_smoothed_nohh_cmnty_cli.csv" + + Returns: + - None + """ + if not self.missing_sample_size_allowed: + if df_to_test['sample_size'].isnull().values.any(): + self.raised_errors.append(ValidationError( + ("check_n_missing", nameformat), + None, "sample_size must not be NA")) + + self.increment_total_checks() + + # Find rows with sample size less than minimum allowed + result = df_to_test.query( + '(sample_size < @self.minimum_sample_size)') + + if not result.empty: + self.raised_errors.append(ValidationError( + ("check_n_gt_min", nameformat), + result, f"sample size must be >= {self.minimum_sample_size}")) + + self.increment_total_checks() + + elif self.missing_sample_size_allowed: + result = df_to_test.query( + '~(sample_size.isnull() | (sample_size >= @self.minimum_sample_size))') + + if not result.empty: + self.raised_errors.append(ValidationError( + ("check_n_missing_or_gt_min", nameformat), + result, + f"sample size must be NA or >= {self.minimum_sample_size}")) + + self.increment_total_checks() + + def check_min_allowed_max_date(self, max_date, geo_type, signal_type): + """ + Check if time since data was generated is reasonable or too long ago. + + Arguments: + - max_date: date of most recent data to be validated; datetime format. + - geo_type: str; geo type name (county, msa, hrr, state) as in the CSV name + - signal_type: str; signal name as in the CSV name + + Returns: + - None + """ + thres = timedelta( + days=self.expected_lag[signal_type] if signal_type in self.expected_lag + else 1) + + if max_date < self.generation_date - thres: + self.raised_errors.append(ValidationError( + ("check_min_max_date", geo_type, signal_type), + max_date, + "date of most recent generated file seems too long ago")) + + self.increment_total_checks() + + def check_max_allowed_max_date(self, max_date, geo_type, signal_type): + """ + Check if time since data was generated is reasonable or too recent. + + Arguments: + - max_date: date of most recent data to be validated; datetime format. + - geo_type: str; geo type name (county, msa, hrr, state) as in the CSV name + - signal_type: str; signal name as in the CSV name + + Returns: + - None + """ + if max_date > self.generation_date: + self.raised_errors.append(ValidationError( + ("check_max_max_date", geo_type, signal_type), + max_date, + "date of most recent generated file seems too recent")) + + self.increment_total_checks() + + def check_max_date_vs_reference(self, df_to_test, df_to_reference, checking_date, + geo_type, signal_type): + """ + Check if reference data is more recent than test data. + + Arguments: + - df_to_test: pandas dataframe of a single CSV of source data + (one day-signal-geo_type combo) + - df_to_reference: pandas dataframe of reference data, either from the + COVIDcast API or semirecent data + - geo_type: str; geo type name (county, msa, hrr, state) as in the CSV name + - signal_type: str; signal name as in the CSV name + + Returns: + - None + """ + if df_to_test["time_value"].max() < df_to_reference["time_value"].max(): + self.raised_errors.append(ValidationError( + ("check_max_date_vs_reference", + checking_date.date(), geo_type, signal_type), + (df_to_test["time_value"].max(), + df_to_reference["time_value"].max()), + 'reference df has days beyond the max date in the =df_to_test=; ' + + 'checks are not constructed to handle this case, and this situation ' + + 'may indicate that something locally is out of date, or, if the local ' + + 'working files have already been compared against the reference, ' + + 'that there is a bug somewhere')) + + self.increment_total_checks() + + def check_rapid_change_num_rows(self, df_to_test, df_to_reference, checking_date, + geo_type, signal_type): + """ + Compare number of obervations per day in test dataframe vs reference dataframe. + + Arguments: + - df_to_test: pandas dataframe of CSV source data + - df_to_reference: pandas dataframe of reference data, either from the + COVIDcast API or semirecent data + - checking_date: datetime date + - geo_type: str; geo type name (county, msa, hrr, state) as in the CSV name + - signal_type: str; signal name as in the CSV name + + Returns: + - None + """ + test_rows_per_reporting_day = df_to_test[df_to_test['time_value'] + == checking_date].shape[0] + reference_rows_per_reporting_day = df_to_reference.shape[0] / len( + set(df_to_reference["time_value"])) + + try: + compare_rows = relative_difference_by_min( + test_rows_per_reporting_day, + reference_rows_per_reporting_day) + except ZeroDivisionError as e: + print(checking_date, geo_type, signal_type) + raise e + + if abs(compare_rows) > 0.35: + self.raised_errors.append(ValidationError( + ("check_rapid_change_num_rows", + checking_date, geo_type, signal_type), + (test_rows_per_reporting_day, reference_rows_per_reporting_day), + "Number of rows per day (-with-any-rows) seems to have changed " + + "rapidly (reference vs test data)")) + + self.increment_total_checks() + + def check_avg_val_vs_reference(self, df_to_test, df_to_reference, checking_date, geo_type, + signal_type): + """ + Compare average values for each variable in test dataframe vs reference dataframe. + + Arguments: + - df_to_test: pandas dataframe of CSV source data + - df_to_reference: pandas dataframe of reference data, either from the + COVIDcast API or semirecent data + - geo_type: str; geo type name (county, msa, hrr, state) as in the CSV name + - signal_type: str; signal name as in the CSV name + + Returns: + - None + """ + # Average each of val, se, and sample_size over all dates for a given geo_id. + # Ignores NA by default. + df_to_test = df_to_test.groupby(['geo_id'], as_index=False)[ + ['val', 'se', 'sample_size']].mean() + df_to_test["type"] = "test" + + df_to_reference = df_to_reference.groupby(['geo_id'], as_index=False)[ + ['val', 'se', 'sample_size']].mean() + df_to_reference["type"] = "reference" + + df_all = pd.concat([df_to_test, df_to_reference]) + + # For each variable (val, se, and sample size) where not missing, calculate the + # relative mean difference and mean absolute difference between the test data + # and the reference data across all geographic regions. + # + # Steps: + # - melt: creates a long version of df, where 'variable' specifies variable + # name (val, se, sample size) and 'value' specifies the value of said variable; + # geo_id and type columns are unchanged + # - pivot: each row is the test and reference values for a given geo + # region-variable type combo + # - reset_index: index is set to auto-incrementing int; geo_id and variable + # names are back as normal columns + # - dropna: drop all rows with at least one missing value (makes it + # impossible to compare reference and test) + # - assign: create new temporary columns, raw and abs value of difference + # between test and reference columns + # - groupby: group by variable name + # - agg: for every variable name group (across geo regions), calculate the + # mean of each of the raw difference between test and reference columns, the + # abs value of the difference between test and reference columns, all test + # values, all reference values + # - assign: use the new aggregate vars to calculate the relative mean + # difference, 2 * mean(differences) / sum(means) of two groups. + df_all = pd.melt( + df_all, id_vars=["geo_id", "type"], value_vars=["val", "se", "sample_size"] + ).pivot( + index=("geo_id", "variable"), columns="type", values="value" + ).reset_index( + ("geo_id", "variable") + ).dropna( + ).assign( + type_diff=lambda x: x["test"] - x["reference"], + abs_type_diff=lambda x: abs(x["type_diff"]) + ).groupby( + "variable", as_index=False + ).agg( + mean_type_diff=("type_diff", "mean"), + mean_abs_type_diff=("abs_type_diff", "mean"), + mean_test_var=("test", "mean"), + mean_ref_var=("reference", "mean") + ).assign( + mean_stddiff=lambda x: 2 * + x["mean_type_diff"] / (x["mean_test_var"] + x["mean_ref_var"]), + mean_stdabsdiff=lambda x: 2 * + x["mean_abs_type_diff"] / (x["mean_test_var"] + x["mean_ref_var"]) + )[["variable", "mean_stddiff", "mean_stdabsdiff"]] + + # Set thresholds for raw and smoothed variables. + classes = ['mean_stddiff', 'val_mean_stddiff', 'mean_stdabsdiff'] + raw_thresholds = pd.DataFrame( + [[1.50, 1.30, 1.80]], columns=classes) + smoothed_thresholds = raw_thresholds.apply( + lambda x: x/(math.sqrt(7) * 1.5)) + + switcher = { + 'raw': raw_thresholds, + 'smoothed': smoothed_thresholds, + } + + # Get the selected thresholds from switcher dictionary + smooth_option = "smoothed" if signal_type in self.smoothed_signals else "raw" + thres = switcher.get(smooth_option, lambda: "Invalid smoothing option") + + # Check if the calculated mean differences are high compared to the thresholds. + mean_stddiff_high = ( + abs(df_all["mean_stddiff"]) > float(thres["mean_stddiff"])).any() or ( + (df_all["variable"] == "val").any() and + (abs(df_all[df_all["variable"] == "val"]["mean_stddiff"]) + > float(thres["val_mean_stddiff"])).any() + ) + mean_stdabsdiff_high = ( + df_all["mean_stdabsdiff"] > float(thres["mean_stdabsdiff"])).any() + + if mean_stddiff_high or mean_stdabsdiff_high: + self.raised_errors.append(ValidationError( + ("check_test_vs_reference_avg_changed", + checking_date, geo_type, signal_type), + (mean_stddiff_high, mean_stdabsdiff_high), + 'Average differences in variables by geo_id between recent & reference data ' + + 'seem large --- either large increase ' + + 'tending toward one direction or large mean absolute difference, relative ' + + 'to average values of corresponding variables. For the former check, ' + + 'tolerances for `val` are more restrictive than those for other columns.')) + + self.increment_total_checks() + + def validate(self, export_dir): + """ + Runs all data checks. + + Arguments: + - export_dir: path to data CSVs + + Returns: + - None + """ + # Get relevant data file names and info. + export_files = read_filenames(export_dir) + date_filter = make_date_filter(self.start_date, self.end_date) + + # Make list of tuples of CSV names and regex match objects. + validate_files = [(f, m) for (f, m) in export_files if date_filter(m)] + + self.check_missing_date_files(validate_files) + self.check_settings() + + all_frames = [] + + # Individual file checks + # For every daily file, read in and do some basic format and value checks. + for filename, match in validate_files: + data_df = load_csv(join(export_dir, filename)) + + self.check_df_format(data_df, filename) + self.check_bad_geo_id_format( + data_df, filename, match.groupdict()['geo_type']) + self.check_bad_geo_id_value( + data_df, filename, match.groupdict()['geo_type']) + self.check_bad_val(data_df, filename, match.groupdict()['signal']) + self.check_bad_se(data_df, filename) + self.check_bad_sample_size(data_df, filename) + + # Get geo_type, date, and signal name as specified by CSV name. + data_df['geo_type'] = match.groupdict()['geo_type'] + data_df['time_value'] = datetime.strptime( + match.groupdict()['date'], "%Y%m%d").date() + data_df['signal'] = match.groupdict()['signal'] + + # Add current CSV data to all_frames. + all_frames.append(data_df) + + all_frames = pd.concat(all_frames) + + # recent_lookbehind: start from the check date and working backward in time, + # how many days at a time do we want to check for anomalies? + # Choosing 1 day checks just the daily data. + recent_lookbehind = timedelta(days=1) + + # semirecent_lookbehind: starting from the check date and working backward + # in time, how many days do we use to form the reference statistics. + semirecent_lookbehind = timedelta(days=7) + + # Get list of dates we want to check. + date_list = [self.start_date + timedelta(days=days) + for days in range(self.span_length.days + 1)] + + # Get all expected combinations of geo_type and signal. + geo_signal_combos = get_geo_signal_combos(self.data_source) + + all_api_df = self.threaded_api_calls( + self.start_date - min(semirecent_lookbehind, + self.max_check_lookbehind), + self.end_date, geo_signal_combos) + + # Keeps script from checking all files in a test run. + if self.test_mode: + kroc = 0 + + # Comparison checks + # Run checks for recent dates in each geo-sig combo vs semirecent (previous + # week) API data. + for geo_type, signal_type in geo_signal_combos: + geo_sig_df = all_frames.query( + "geo_type == @geo_type & signal == @signal_type") + # Drop unused columns. + geo_sig_df.drop(columns=["geo_type", "signal"]) + + self.increment_total_checks() + + if geo_sig_df.empty: + self.raised_errors.append(ValidationError( + ("check_missing_geo_sig_combo", geo_type, signal_type), + None, + "file with geo_type-signal combo does not exist")) + continue + + max_date = geo_sig_df["time_value"].max() + self.check_min_allowed_max_date(max_date, geo_type, signal_type) + self.check_max_allowed_max_date(max_date, geo_type, signal_type) + + # Get relevant reference data from API dictionary. + geo_sig_api_df = all_api_df[(geo_type, signal_type)] + + if geo_sig_api_df is None: + continue + + # Check data from a group of dates against recent (previous 7 days, + # by default) data from the API. + for checking_date in date_list: + recent_cutoff_date = checking_date - \ + recent_lookbehind + timedelta(days=1) + recent_df = geo_sig_df.query( + 'time_value <= @checking_date & time_value >= @recent_cutoff_date') + + self.increment_total_checks() + + if recent_df.empty: + self.raised_errors.append(ValidationError( + ("check_missing_geo_sig_date_combo", + checking_date, geo_type, signal_type), + None, + "test data for a given checking date-geo type-signal type" + + " combination is missing. Source data may be missing" + + " for one or more dates")) + continue + + # Reference dataframe runs backwards from the recent_cutoff_date + reference_start_date = recent_cutoff_date - \ + min(semirecent_lookbehind, self.max_check_lookbehind) - \ + timedelta(days=1) + reference_end_date = recent_cutoff_date - timedelta(days=1) + + # Subset API data to relevant range of dates. + reference_api_df = geo_sig_api_df.query( + "time_value >= @reference_start_date & time_value <= @reference_end_date") + + self.increment_total_checks() + + if reference_api_df.empty: + self.raised_errors.append(ValidationError( + ("empty_reference_data", + checking_date, geo_type, signal_type), None, + "reference data is empty; comparative checks could not be performed")) + continue + + self.check_max_date_vs_reference( + recent_df, reference_api_df, checking_date, geo_type, signal_type) + + if self.sanity_check_rows_per_day: + self.check_rapid_change_num_rows( + recent_df, reference_api_df, checking_date, geo_type, signal_type) + + if self.sanity_check_value_diffs: + self.check_avg_val_vs_reference( + recent_df, reference_api_df, checking_date, geo_type, signal_type) + + # Keeps script from checking all files in a test run. + if self.test_mode: + kroc += 1 + if kroc == 2: + break + + self.exit() + + def get_one_api_df(self, min_date, max_date, + geo_type, signal_type, + api_semaphore, dict_lock, output_dict): + """ + Pull API data for a single geo type-signal combination. Raises + error if data couldn't be retrieved. Saves data to data dict. + """ + api_semaphore.acquire() + + # Pull reference data from API for all dates. + try: + geo_sig_api_df = fetch_api_reference( + self.data_source, min_date, max_date, geo_type, signal_type) + + except APIDataFetchError as e: + self.increment_total_checks() + self.raised_errors.append(ValidationError( + ("api_data_fetch_error", geo_type, signal_type), None, e)) + + geo_sig_api_df = None + + api_semaphore.release() + + # Use a lock so only one thread can access the dictionary. + dict_lock.acquire() + output_dict[(geo_type, signal_type)] = geo_sig_api_df + dict_lock.release() + + def threaded_api_calls(self, min_date, max_date, + geo_signal_combos, n_threads=32): + """ + Get data from API for all geo-signal combinations in a threaded way + to save time. + """ + if n_threads > 32: + n_threads = 32 + print("Warning: Don't run more than 32 threads at once due " + + "to API resource limitations") + + output_dict = dict() + dict_lock = threading.Lock() + api_semaphore = threading.Semaphore(value=n_threads) + + thread_objs = [threading.Thread( + target=self.get_one_api_df, args=(min_date, max_date, + geo_type, signal_type, + api_semaphore, + dict_lock, output_dict) + ) for geo_type, signal_type in geo_signal_combos] + + # Start all threads. + for thread in thread_objs: + thread.start() + + # Wait until all threads are finished. + for thread in thread_objs: + thread.join() + + return output_dict + + def exit(self): + """ + If any not-suppressed exceptions were raised, print and exit with non-zero status. + """ + suppressed_counter = 0 + subset_raised_errors = [] + + for val_error in self.raised_errors: + # Convert any dates in check_data_id to strings for the purpose of comparing + # to manually suppressed errors. + raised_check_id = tuple([ + item.strftime("%Y-%m-%d") if isinstance(item, (date, datetime)) + else item for item in val_error.check_data_id]) + + if raised_check_id not in self.suppressed_errors: + subset_raised_errors.append(val_error) + else: + self.suppressed_errors.remove(raised_check_id) + suppressed_counter += 1 + + print(self.total_checks, "checks run") + print(len(subset_raised_errors), "checks failed") + print(suppressed_counter, "checks suppressed") + print(len(self.raised_warnings), "warnings") + + for message in subset_raised_errors: + print(message) + for message in self.raised_warnings: + print(message) + + if len(subset_raised_errors) != 0: + sys.exit(1) + else: + sys.exit(0) diff --git a/validator/params.json.template b/validator/params.json.template new file mode 100644 index 000000000..643dc7838 --- /dev/null +++ b/validator/params.json.template @@ -0,0 +1,42 @@ +{ + "validation": { + "data_source": "usa-facts", + "end_date": "2020-09-08", + "span_length": 3, + "ref_window_size": 7, + "validator_static_file_dir": "../validator/static", + "minimum_sample_size": 100, + "missing_se_allowed": true, + "missing_sample_size_allowed": true, + "smoothed_signals": [ + "confirmed_7dav_cumulative_num", + "confirmed_7dav_cumulative_prop", + "confirmed_7dav_incidence_num", + "confirmed_7dav_incidence_prop", + "deaths_7dav_cumulative_num", + "deaths_7dav_cumulative_prop", + "deaths_7dav_incidence_num", + "deaths_7dav_incidence_prop"], + "expected_lag": { + "confirmed_7dav_cumulative_num": 1, + "confirmed_7dav_cumulative_prop": 1, + "confirmed_7dav_incidence_num": 1, + "confirmed_7dav_incidence_prop": 1, + "deaths_7dav_cumulative_num": 1, + "deaths_7dav_cumulative_prop": 1, + "deaths_7dav_incidence_num": 1, + "deaths_7dav_incidence_prop": 1, + "confirmed_cumulative_num": 1, + "confirmed_cumulative_prop": 1, + "confirmed_incidence_num": 1, + "confirmed_incidence_prop": 1, + "deaths_cumulative_num": 1, + "deaths_cumulative_prop": 1, + "deaths_incidence_num": 1, + "deaths_incidence_prop": 1}, + "test_mode": true, + "suppressed_errors": [ + ["check_min_max_date", "county", "confirmed_7dav_cumulative_prop"], + ["check_val_lt_0", "20200906_county_deaths_7dav_incidence_num.csv"]] + } +} diff --git a/validator/scripts/unique_geoids.R b/validator/scripts/unique_geoids.R new file mode 100644 index 000000000..676223be3 --- /dev/null +++ b/validator/scripts/unique_geoids.R @@ -0,0 +1,18 @@ +library(covidcast) +library(dplyr) +meta_info = covidcast_meta() +locations_by_type = meta_info %>% group_by(geo_type) %>% summarize(Value = max(num_locations)) + +results = list() +for (i in 1:nrow(locations_by_type)){ + type = locations_by_type$geo_type[i] + max_locations = locations_by_type$Value[i] + max_row = with(meta_info, meta_info[geo_type == type & num_locations == max_locations,][1,]) + data_source = max_row$data_source + signal = max_row$signal + results[[i]] = covidcast_signal(data_source, signal, geo_type = type) + geo_values = sort(unique(results[[i]]$geo_value)) + file_name = paste0("../static/", type, "_geo.csv") + write.table(geo_values, file = file_name, row.names = F, col.names = "geo_id") +} + diff --git a/validator/setup.py b/validator/setup.py new file mode 100644 index 000000000..2ac236570 --- /dev/null +++ b/validator/setup.py @@ -0,0 +1,28 @@ +from setuptools import setup +from setuptools import find_packages + +required = [ + "numpy", + "pandas", + "pytest", + "pytest-cov", + "pylint", + "delphi-utils", + "covidcast" +] + +setup( + name="delphi_validator", + version="0.1.0", + description="Validates newly generated daily-data against previously issued data", + author="", + author_email="", + url="https://github.com/cmu-delphi/covidcast-indicators", + install_requires=required, + classifiers=[ + "Development Status :: 5 - Production/Stable", + "Intended Audience :: Developers", + "Programming Language :: Python :: 3.7", + ], + packages=find_packages(), +) \ No newline at end of file diff --git a/validator/static/county_geo.csv b/validator/static/county_geo.csv new file mode 100644 index 000000000..3812b9693 --- /dev/null +++ b/validator/static/county_geo.csv @@ -0,0 +1,3283 @@ +"geo_id" +"01000" +"01001" +"01003" +"01005" +"01007" +"01009" +"01011" +"01013" +"01015" +"01017" +"01019" +"01021" +"01023" +"01025" +"01027" +"01029" +"01031" +"01033" +"01035" +"01037" +"01039" +"01041" +"01043" +"01045" +"01047" +"01049" +"01051" +"01053" +"01055" +"01057" +"01059" +"01061" +"01063" +"01065" +"01067" +"01069" +"01071" +"01073" +"01075" +"01077" +"01079" +"01081" +"01083" +"01085" +"01087" +"01089" +"01091" +"01093" +"01095" +"01097" +"01099" +"01101" +"01103" +"01105" +"01107" +"01109" +"01111" +"01113" +"01115" +"01117" +"01119" +"01121" +"01123" +"01125" +"01127" +"01129" +"01131" +"01133" +"02000" +"02013" +"02016" +"02020" +"02050" +"02060" +"02068" +"02070" +"02090" +"02100" +"02105" +"02110" +"02122" +"02130" +"02150" +"02158" +"02164" +"02170" +"02180" +"02185" +"02188" +"02195" +"02198" +"02220" +"02230" +"02240" +"02261" +"02270" +"02275" +"02282" +"02290" +"04000" +"04001" +"04003" +"04005" +"04007" +"04009" +"04011" +"04012" +"04013" +"04015" +"04017" +"04019" +"04021" +"04023" +"04025" +"04027" +"05000" +"05001" +"05003" +"05005" +"05007" +"05009" +"05011" +"05013" +"05015" +"05017" +"05019" +"05021" +"05023" +"05025" +"05027" +"05029" +"05031" +"05033" +"05035" +"05037" +"05039" +"05041" +"05043" +"05045" +"05047" +"05049" +"05051" +"05053" +"05055" +"05057" +"05059" +"05061" +"05063" +"05065" +"05067" +"05069" +"05071" +"05073" +"05075" +"05077" +"05079" +"05081" +"05083" +"05085" +"05087" +"05089" +"05091" +"05093" +"05095" +"05097" +"05099" +"05101" +"05103" +"05105" +"05107" +"05109" +"05111" +"05113" +"05115" +"05117" +"05119" +"05121" +"05123" +"05125" +"05127" +"05129" +"05131" +"05133" +"05135" +"05137" +"05139" +"05141" +"05143" +"05145" +"05147" +"05149" +"06000" +"06001" +"06003" +"06005" +"06007" +"06009" +"06011" +"06013" +"06015" +"06017" +"06019" +"06021" +"06023" +"06025" +"06027" +"06029" +"06031" +"06033" +"06035" +"06037" +"06039" +"06041" +"06043" +"06045" +"06047" +"06049" +"06051" +"06053" +"06055" +"06057" +"06059" +"06061" +"06063" +"06065" +"06067" +"06069" +"06071" +"06073" +"06075" +"06077" +"06079" +"06081" +"06083" +"06085" +"06087" +"06089" +"06091" +"06093" +"06095" +"06097" +"06099" +"06101" +"06103" +"06105" +"06107" +"06109" +"06111" +"06113" +"06115" +"08000" +"08001" +"08003" +"08005" +"08007" +"08009" +"08011" +"08013" +"08014" +"08015" +"08017" +"08019" +"08021" +"08023" +"08025" +"08027" +"08029" +"08031" +"08033" +"08035" +"08037" +"08039" +"08041" +"08043" +"08045" +"08047" +"08049" +"08051" +"08053" +"08055" +"08057" +"08059" +"08061" +"08063" +"08065" +"08067" +"08069" +"08071" +"08073" +"08075" +"08077" +"08079" +"08081" +"08083" +"08085" +"08087" +"08089" +"08091" +"08093" +"08095" +"08097" +"08099" +"08101" +"08103" +"08105" +"08107" +"08109" +"08111" +"08113" +"08115" +"08117" +"08119" +"08121" +"08123" +"08125" +"09000" +"09001" +"09003" +"09005" +"09007" +"09009" +"09011" +"09013" +"09015" +"10000" +"10001" +"10003" +"10005" +"11000" +"11001" +"12000" +"12001" +"12003" +"12005" +"12007" +"12009" +"12011" +"12013" +"12015" +"12017" +"12019" +"12021" +"12023" +"12027" +"12029" +"12031" +"12033" +"12035" +"12037" +"12039" +"12041" +"12043" +"12045" +"12047" +"12049" +"12051" +"12053" +"12055" +"12057" +"12059" +"12061" +"12063" +"12065" +"12067" +"12069" +"12071" +"12073" +"12075" +"12077" +"12079" +"12081" +"12083" +"12085" +"12086" +"12087" +"12089" +"12091" +"12093" +"12095" +"12097" +"12099" +"12101" +"12103" +"12105" +"12107" +"12109" +"12111" +"12113" +"12115" +"12117" +"12119" +"12121" +"12123" +"12125" +"12127" +"12129" +"12131" +"12133" +"13000" +"13001" +"13003" +"13005" +"13007" +"13009" +"13011" +"13013" +"13015" +"13017" +"13019" +"13021" +"13023" +"13025" +"13027" +"13029" +"13031" +"13033" +"13035" +"13037" +"13039" +"13043" +"13045" +"13047" +"13049" +"13051" +"13053" +"13055" +"13057" +"13059" +"13061" +"13063" +"13065" +"13067" +"13069" +"13071" +"13073" +"13075" +"13077" +"13079" +"13081" +"13083" +"13085" +"13087" +"13089" +"13091" +"13093" +"13095" +"13097" +"13099" +"13101" +"13103" +"13105" +"13107" +"13109" +"13111" +"13113" +"13115" +"13117" +"13119" +"13121" +"13123" +"13125" +"13127" +"13129" +"13131" +"13133" +"13135" +"13137" +"13139" +"13141" +"13143" +"13145" +"13147" +"13149" +"13151" +"13153" +"13155" +"13157" +"13159" +"13161" +"13163" +"13165" +"13167" +"13169" +"13171" +"13173" +"13175" +"13177" +"13179" +"13181" +"13183" +"13185" +"13187" +"13189" +"13191" +"13193" +"13195" +"13197" +"13199" +"13201" +"13205" +"13207" +"13209" +"13211" +"13213" +"13215" +"13217" +"13219" +"13221" +"13223" +"13225" +"13227" +"13229" +"13231" +"13233" +"13235" +"13237" +"13239" +"13241" +"13243" +"13245" +"13247" +"13249" +"13251" +"13253" +"13255" +"13257" +"13259" +"13261" +"13263" +"13265" +"13267" +"13269" +"13271" +"13273" +"13275" +"13277" +"13279" +"13281" +"13283" +"13285" +"13287" +"13289" +"13291" +"13293" +"13295" +"13297" +"13299" +"13301" +"13303" +"13305" +"13307" +"13309" +"13311" +"13313" +"13315" +"13317" +"13319" +"13321" +"15000" +"15001" +"15003" +"15005" +"15007" +"15009" +"16000" +"16001" +"16003" +"16005" +"16007" +"16009" +"16011" +"16013" +"16015" +"16017" +"16019" +"16021" +"16023" +"16025" +"16027" +"16029" +"16031" +"16033" +"16035" +"16037" +"16039" +"16041" +"16043" +"16045" +"16047" +"16049" +"16051" +"16053" +"16055" +"16057" +"16059" +"16061" +"16063" +"16065" +"16067" +"16069" +"16071" +"16073" +"16075" +"16077" +"16079" +"16081" +"16083" +"16085" +"16087" +"17000" +"17001" +"17003" +"17005" +"17007" +"17009" +"17011" +"17013" +"17015" +"17017" +"17019" +"17021" +"17023" +"17025" +"17027" +"17029" +"17031" +"17033" +"17035" +"17037" +"17039" +"17041" +"17043" +"17045" +"17047" +"17049" +"17051" +"17053" +"17055" +"17057" +"17059" +"17061" +"17063" +"17065" +"17067" +"17069" +"17071" +"17073" +"17075" +"17077" +"17079" +"17081" +"17083" +"17085" +"17087" +"17089" +"17091" +"17093" +"17095" +"17097" +"17099" +"17101" +"17103" +"17105" +"17107" +"17109" +"17111" +"17113" +"17115" +"17117" +"17119" +"17121" +"17123" +"17125" +"17127" +"17129" +"17131" +"17133" +"17135" +"17137" +"17139" +"17141" +"17143" +"17145" +"17147" +"17149" +"17151" +"17153" +"17155" +"17157" +"17159" +"17161" +"17163" +"17165" +"17167" +"17169" +"17171" +"17173" +"17175" +"17177" +"17179" +"17181" +"17183" +"17185" +"17187" +"17189" +"17191" +"17193" +"17195" +"17197" +"17199" +"17201" +"17203" +"18000" +"18001" +"18003" +"18005" +"18007" +"18009" +"18011" +"18013" +"18015" +"18017" +"18019" +"18021" +"18023" +"18025" +"18027" +"18029" +"18031" +"18033" +"18035" +"18037" +"18039" +"18041" +"18043" +"18045" +"18047" +"18049" +"18051" +"18053" +"18055" +"18057" +"18059" +"18061" +"18063" +"18065" +"18067" +"18069" +"18071" +"18073" +"18075" +"18077" +"18079" +"18081" +"18083" +"18085" +"18087" +"18089" +"18091" +"18093" +"18095" +"18097" +"18099" +"18101" +"18103" +"18105" +"18107" +"18109" +"18111" +"18113" +"18115" +"18117" +"18119" +"18121" +"18123" +"18125" +"18127" +"18129" +"18131" +"18133" +"18135" +"18137" +"18139" +"18141" +"18143" +"18145" +"18147" +"18149" +"18151" +"18153" +"18155" +"18157" +"18159" +"18161" +"18163" +"18165" +"18167" +"18169" +"18171" +"18173" +"18175" +"18177" +"18179" +"18181" +"18183" +"19000" +"19001" +"19003" +"19005" +"19007" +"19009" +"19011" +"19013" +"19015" +"19017" +"19019" +"19021" +"19023" +"19025" +"19027" +"19029" +"19031" +"19033" +"19035" +"19037" +"19039" +"19041" +"19043" +"19045" +"19047" +"19049" +"19051" +"19053" +"19055" +"19057" +"19059" +"19061" +"19063" +"19065" +"19067" +"19069" +"19071" +"19073" +"19075" +"19077" +"19079" +"19081" +"19083" +"19085" +"19087" +"19089" +"19091" +"19093" +"19095" +"19097" +"19099" +"19101" +"19103" +"19105" +"19107" +"19109" +"19111" +"19113" +"19115" +"19117" +"19119" +"19121" +"19123" +"19125" +"19127" +"19129" +"19131" +"19133" +"19135" +"19137" +"19139" +"19141" +"19143" +"19145" +"19147" +"19149" +"19151" +"19153" +"19155" +"19157" +"19159" +"19161" +"19163" +"19165" +"19167" +"19169" +"19171" +"19173" +"19175" +"19177" +"19179" +"19181" +"19183" +"19185" +"19187" +"19189" +"19191" +"19193" +"19195" +"19197" +"20000" +"20001" +"20003" +"20005" +"20007" +"20009" +"20011" +"20013" +"20015" +"20017" +"20019" +"20021" +"20023" +"20025" +"20027" +"20029" +"20031" +"20033" +"20035" +"20037" +"20039" +"20041" +"20043" +"20045" +"20047" +"20049" +"20051" +"20053" +"20055" +"20057" +"20059" +"20061" +"20063" +"20065" +"20067" +"20069" +"20071" +"20073" +"20075" +"20077" +"20079" +"20081" +"20083" +"20085" +"20087" +"20089" +"20091" +"20093" +"20095" +"20097" +"20099" +"20101" +"20103" +"20105" +"20107" +"20109" +"20111" +"20113" +"20115" +"20117" +"20119" +"20121" +"20123" +"20125" +"20127" +"20129" +"20131" +"20133" +"20135" +"20137" +"20139" +"20141" +"20143" +"20145" +"20147" +"20149" +"20151" +"20153" +"20155" +"20157" +"20159" +"20161" +"20163" +"20165" +"20167" +"20169" +"20171" +"20173" +"20175" +"20177" +"20179" +"20181" +"20183" +"20185" +"20187" +"20189" +"20191" +"20193" +"20195" +"20197" +"20199" +"20201" +"20203" +"20205" +"20207" +"20209" +"21000" +"21001" +"21003" +"21005" +"21007" +"21009" +"21011" +"21013" +"21015" +"21017" +"21019" +"21021" +"21023" +"21025" +"21027" +"21029" +"21031" +"21033" +"21035" +"21037" +"21039" +"21041" +"21043" +"21045" +"21047" +"21049" +"21051" +"21053" +"21055" +"21057" +"21059" +"21061" +"21063" +"21065" +"21067" +"21069" +"21071" +"21073" +"21075" +"21077" +"21079" +"21081" +"21083" +"21085" +"21087" +"21089" +"21091" +"21093" +"21095" +"21097" +"21099" +"21101" +"21103" +"21105" +"21107" +"21109" +"21111" +"21113" +"21115" +"21117" +"21119" +"21121" +"21123" +"21125" +"21127" +"21129" +"21131" +"21133" +"21135" +"21137" +"21139" +"21141" +"21143" +"21145" +"21147" +"21149" +"21151" +"21153" +"21155" +"21157" +"21159" +"21161" +"21163" +"21165" +"21167" +"21169" +"21171" +"21173" +"21175" +"21177" +"21179" +"21181" +"21183" +"21185" +"21187" +"21189" +"21191" +"21193" +"21195" +"21197" +"21199" +"21201" +"21203" +"21205" +"21207" +"21209" +"21211" +"21213" +"21215" +"21217" +"21219" +"21221" +"21223" +"21225" +"21227" +"21229" +"21231" +"21233" +"21235" +"21237" +"21239" +"22000" +"22001" +"22003" +"22005" +"22007" +"22009" +"22011" +"22013" +"22015" +"22017" +"22019" +"22021" +"22023" +"22025" +"22027" +"22029" +"22031" +"22033" +"22035" +"22037" +"22039" +"22041" +"22043" +"22045" +"22047" +"22049" +"22051" +"22053" +"22055" +"22057" +"22059" +"22061" +"22063" +"22065" +"22067" +"22069" +"22071" +"22073" +"22075" +"22077" +"22079" +"22081" +"22083" +"22085" +"22087" +"22089" +"22091" +"22093" +"22095" +"22097" +"22099" +"22101" +"22103" +"22105" +"22107" +"22109" +"22111" +"22113" +"22115" +"22117" +"22119" +"22121" +"22123" +"22125" +"22127" +"23000" +"23001" +"23003" +"23005" +"23007" +"23009" +"23011" +"23013" +"23015" +"23017" +"23019" +"23021" +"23023" +"23025" +"23027" +"23029" +"23031" +"24000" +"24001" +"24003" +"24005" +"24009" +"24011" +"24013" +"24015" +"24017" +"24019" +"24021" +"24023" +"24025" +"24027" +"24029" +"24031" +"24033" +"24035" +"24037" +"24039" +"24041" +"24043" +"24045" +"24047" +"24510" +"25000" +"25001" +"25003" +"25005" +"25007" +"25009" +"25011" +"25013" +"25015" +"25017" +"25019" +"25021" +"25023" +"25025" +"25027" +"26000" +"26001" +"26003" +"26005" +"26007" +"26009" +"26011" +"26013" +"26015" +"26017" +"26019" +"26021" +"26023" +"26025" +"26027" +"26029" +"26031" +"26033" +"26035" +"26037" +"26039" +"26041" +"26043" +"26045" +"26047" +"26049" +"26051" +"26053" +"26055" +"26057" +"26059" +"26061" +"26063" +"26065" +"26067" +"26069" +"26071" +"26073" +"26075" +"26077" +"26079" +"26081" +"26083" +"26085" +"26087" +"26089" +"26091" +"26093" +"26095" +"26097" +"26099" +"26101" +"26103" +"26105" +"26107" +"26109" +"26111" +"26113" +"26115" +"26117" +"26119" +"26121" +"26123" +"26125" +"26127" +"26129" +"26131" +"26133" +"26135" +"26137" +"26139" +"26141" +"26143" +"26145" +"26147" +"26149" +"26151" +"26153" +"26155" +"26157" +"26159" +"26161" +"26163" +"26165" +"27000" +"27001" +"27003" +"27005" +"27007" +"27009" +"27011" +"27013" +"27015" +"27017" +"27019" +"27021" +"27023" +"27025" +"27027" +"27029" +"27031" +"27033" +"27035" +"27037" +"27039" +"27041" +"27043" +"27045" +"27047" +"27049" +"27051" +"27053" +"27055" +"27057" +"27059" +"27061" +"27063" +"27065" +"27067" +"27069" +"27071" +"27073" +"27075" +"27077" +"27079" +"27081" +"27083" +"27085" +"27087" +"27089" +"27091" +"27093" +"27095" +"27097" +"27099" +"27101" +"27103" +"27105" +"27107" +"27109" +"27111" +"27113" +"27115" +"27117" +"27119" +"27121" +"27123" +"27125" +"27127" +"27129" +"27131" +"27133" +"27135" +"27137" +"27139" +"27141" +"27143" +"27145" +"27147" +"27149" +"27151" +"27153" +"27155" +"27157" +"27159" +"27161" +"27163" +"27165" +"27167" +"27169" +"27171" +"27173" +"28000" +"28001" +"28003" +"28005" +"28007" +"28009" +"28011" +"28013" +"28015" +"28017" +"28019" +"28021" +"28023" +"28025" +"28027" +"28029" +"28031" +"28033" +"28035" +"28037" +"28039" +"28041" +"28043" +"28045" +"28047" +"28049" +"28051" +"28053" +"28055" +"28057" +"28059" +"28061" +"28063" +"28065" +"28067" +"28069" +"28071" +"28073" +"28075" +"28077" +"28079" +"28081" +"28083" +"28085" +"28087" +"28089" +"28091" +"28093" +"28095" +"28097" +"28099" +"28101" +"28103" +"28105" +"28107" +"28109" +"28111" +"28113" +"28115" +"28117" +"28119" +"28121" +"28123" +"28125" +"28127" +"28129" +"28131" +"28133" +"28135" +"28137" +"28139" +"28141" +"28143" +"28145" +"28147" +"28149" +"28151" +"28153" +"28155" +"28157" +"28159" +"28161" +"28163" +"29000" +"29001" +"29003" +"29005" +"29007" +"29009" +"29011" +"29013" +"29015" +"29017" +"29019" +"29021" +"29023" +"29025" +"29027" +"29029" +"29031" +"29033" +"29035" +"29037" +"29039" +"29041" +"29043" +"29045" +"29047" +"29049" +"29051" +"29053" +"29055" +"29057" +"29059" +"29061" +"29063" +"29065" +"29067" +"29069" +"29071" +"29073" +"29075" +"29077" +"29079" +"29081" +"29083" +"29085" +"29087" +"29089" +"29091" +"29093" +"29095" +"29097" +"29099" +"29101" +"29103" +"29105" +"29107" +"29109" +"29111" +"29113" +"29115" +"29117" +"29119" +"29121" +"29123" +"29125" +"29127" +"29129" +"29131" +"29133" +"29135" +"29137" +"29139" +"29141" +"29143" +"29145" +"29147" +"29149" +"29151" +"29153" +"29155" +"29157" +"29159" +"29161" +"29163" +"29165" +"29167" +"29169" +"29171" +"29173" +"29175" +"29177" +"29179" +"29181" +"29183" +"29185" +"29186" +"29187" +"29189" +"29195" +"29197" +"29199" +"29201" +"29203" +"29205" +"29207" +"29209" +"29211" +"29213" +"29215" +"29217" +"29219" +"29221" +"29223" +"29225" +"29227" +"29229" +"29510" +"30000" +"30001" +"30003" +"30005" +"30007" +"30009" +"30011" +"30013" +"30015" +"30017" +"30019" +"30021" +"30023" +"30025" +"30027" +"30029" +"30031" +"30033" +"30035" +"30037" +"30039" +"30041" +"30043" +"30045" +"30047" +"30049" +"30051" +"30053" +"30055" +"30057" +"30059" +"30061" +"30063" +"30065" +"30067" +"30069" +"30071" +"30073" +"30075" +"30077" +"30079" +"30081" +"30083" +"30085" +"30087" +"30089" +"30091" +"30093" +"30095" +"30097" +"30099" +"30101" +"30103" +"30105" +"30107" +"30109" +"30111" +"31000" +"31001" +"31003" +"31005" +"31007" +"31009" +"31011" +"31013" +"31015" +"31017" +"31019" +"31021" +"31023" +"31025" +"31027" +"31029" +"31031" +"31033" +"31035" +"31037" +"31039" +"31041" +"31043" +"31045" +"31047" +"31049" +"31051" +"31053" +"31055" +"31057" +"31059" +"31061" +"31063" +"31065" +"31067" +"31069" +"31071" +"31073" +"31075" +"31077" +"31079" +"31081" +"31083" +"31085" +"31087" +"31089" +"31091" +"31093" +"31095" +"31097" +"31099" +"31101" +"31103" +"31105" +"31107" +"31109" +"31111" +"31113" +"31115" +"31117" +"31119" +"31121" +"31123" +"31125" +"31127" +"31129" +"31131" +"31133" +"31135" +"31137" +"31139" +"31141" +"31143" +"31145" +"31147" +"31149" +"31151" +"31153" +"31155" +"31157" +"31159" +"31161" +"31163" +"31165" +"31167" +"31169" +"31171" +"31173" +"31175" +"31177" +"31179" +"31181" +"31183" +"31185" +"32000" +"32001" +"32003" +"32005" +"32007" +"32009" +"32011" +"32013" +"32015" +"32017" +"32019" +"32021" +"32023" +"32027" +"32029" +"32031" +"32033" +"32510" +"33000" +"33001" +"33003" +"33005" +"33007" +"33009" +"33011" +"33013" +"33015" +"33017" +"33019" +"34000" +"34001" +"34003" +"34005" +"34007" +"34009" +"34011" +"34013" +"34015" +"34017" +"34019" +"34021" +"34023" +"34025" +"34027" +"34029" +"34031" +"34033" +"34035" +"34037" +"34039" +"34041" +"35000" +"35001" +"35003" +"35005" +"35006" +"35007" +"35009" +"35011" +"35013" +"35015" +"35017" +"35019" +"35021" +"35023" +"35025" +"35027" +"35028" +"35029" +"35031" +"35033" +"35035" +"35037" +"35039" +"35041" +"35043" +"35045" +"35047" +"35049" +"35051" +"35053" +"35055" +"35057" +"35059" +"35061" +"36000" +"36001" +"36003" +"36005" +"36007" +"36009" +"36011" +"36013" +"36015" +"36017" +"36019" +"36021" +"36023" +"36025" +"36027" +"36029" +"36031" +"36033" +"36035" +"36037" +"36039" +"36041" +"36043" +"36045" +"36047" +"36049" +"36051" +"36053" +"36055" +"36057" +"36059" +"36061" +"36063" +"36065" +"36067" +"36069" +"36071" +"36073" +"36075" +"36077" +"36079" +"36081" +"36083" +"36085" +"36087" +"36089" +"36091" +"36093" +"36095" +"36097" +"36099" +"36101" +"36103" +"36105" +"36107" +"36109" +"36111" +"36113" +"36115" +"36117" +"36119" +"36121" +"36123" +"37000" +"37001" +"37003" +"37005" +"37007" +"37009" +"37011" +"37013" +"37015" +"37017" +"37019" +"37021" +"37023" +"37025" +"37027" +"37029" +"37031" +"37033" +"37035" +"37037" +"37039" +"37041" +"37043" +"37045" +"37047" +"37049" +"37051" +"37053" +"37055" +"37057" +"37059" +"37061" +"37063" +"37065" +"37067" +"37069" +"37071" +"37073" +"37075" +"37077" +"37079" +"37081" +"37083" +"37085" +"37087" +"37089" +"37091" +"37093" +"37095" +"37097" +"37099" +"37101" +"37103" +"37105" +"37107" +"37109" +"37111" +"37113" +"37115" +"37117" +"37119" +"37121" +"37123" +"37125" +"37127" +"37129" +"37131" +"37133" +"37135" +"37137" +"37139" +"37141" +"37143" +"37145" +"37147" +"37149" +"37151" +"37153" +"37155" +"37157" +"37159" +"37161" +"37163" +"37165" +"37167" +"37169" +"37171" +"37173" +"37175" +"37177" +"37179" +"37181" +"37183" +"37185" +"37187" +"37189" +"37191" +"37193" +"37195" +"37197" +"37199" +"38000" +"38001" +"38003" +"38005" +"38007" +"38009" +"38011" +"38013" +"38015" +"38017" +"38019" +"38021" +"38023" +"38025" +"38027" +"38029" +"38031" +"38033" +"38035" +"38037" +"38039" +"38041" +"38043" +"38045" +"38047" +"38049" +"38051" +"38053" +"38055" +"38057" +"38059" +"38061" +"38063" +"38065" +"38067" +"38069" +"38071" +"38073" +"38075" +"38077" +"38079" +"38081" +"38083" +"38085" +"38087" +"38089" +"38091" +"38093" +"38095" +"38097" +"38099" +"38101" +"38103" +"38105" +"39000" +"39001" +"39003" +"39005" +"39007" +"39009" +"39011" +"39013" +"39015" +"39017" +"39019" +"39021" +"39023" +"39025" +"39027" +"39029" +"39031" +"39033" +"39035" +"39037" +"39039" +"39041" +"39043" +"39045" +"39047" +"39049" +"39051" +"39053" +"39055" +"39057" +"39059" +"39061" +"39063" +"39065" +"39067" +"39069" +"39071" +"39073" +"39075" +"39077" +"39079" +"39081" +"39083" +"39085" +"39087" +"39089" +"39091" +"39093" +"39095" +"39097" +"39099" +"39101" +"39103" +"39105" +"39107" +"39109" +"39111" +"39113" +"39115" +"39117" +"39119" +"39121" +"39123" +"39125" +"39127" +"39129" +"39131" +"39133" +"39135" +"39137" +"39139" +"39141" +"39143" +"39145" +"39147" +"39149" +"39151" +"39153" +"39155" +"39157" +"39159" +"39161" +"39163" +"39165" +"39167" +"39169" +"39171" +"39173" +"39175" +"40000" +"40001" +"40003" +"40005" +"40007" +"40009" +"40011" +"40013" +"40015" +"40017" +"40019" +"40021" +"40023" +"40025" +"40027" +"40029" +"40031" +"40033" +"40035" +"40037" +"40039" +"40041" +"40043" +"40045" +"40047" +"40049" +"40051" +"40053" +"40055" +"40057" +"40059" +"40061" +"40063" +"40065" +"40067" +"40069" +"40071" +"40073" +"40075" +"40077" +"40079" +"40081" +"40083" +"40085" +"40087" +"40089" +"40091" +"40093" +"40095" +"40097" +"40099" +"40101" +"40103" +"40105" +"40107" +"40109" +"40111" +"40113" +"40115" +"40117" +"40119" +"40121" +"40123" +"40125" +"40127" +"40129" +"40131" +"40133" +"40135" +"40137" +"40139" +"40141" +"40143" +"40145" +"40147" +"40149" +"40151" +"40153" +"41000" +"41001" +"41003" +"41005" +"41007" +"41009" +"41011" +"41013" +"41015" +"41017" +"41019" +"41021" +"41023" +"41025" +"41027" +"41029" +"41031" +"41033" +"41035" +"41037" +"41039" +"41041" +"41043" +"41045" +"41047" +"41049" +"41051" +"41053" +"41055" +"41057" +"41059" +"41061" +"41063" +"41065" +"41067" +"41069" +"41071" +"42000" +"42001" +"42003" +"42005" +"42007" +"42009" +"42011" +"42013" +"42015" +"42017" +"42019" +"42021" +"42023" +"42025" +"42027" +"42029" +"42031" +"42033" +"42035" +"42037" +"42039" +"42041" +"42043" +"42045" +"42047" +"42049" +"42051" +"42053" +"42055" +"42057" +"42059" +"42061" +"42063" +"42065" +"42067" +"42069" +"42071" +"42073" +"42075" +"42077" +"42079" +"42081" +"42083" +"42085" +"42087" +"42089" +"42091" +"42093" +"42095" +"42097" +"42099" +"42101" +"42103" +"42105" +"42107" +"42109" +"42111" +"42113" +"42115" +"42117" +"42119" +"42121" +"42123" +"42125" +"42127" +"42129" +"42131" +"42133" +"44000" +"44001" +"44003" +"44005" +"44007" +"44009" +"45000" +"45001" +"45003" +"45005" +"45007" +"45009" +"45011" +"45013" +"45015" +"45017" +"45019" +"45021" +"45023" +"45025" +"45027" +"45029" +"45031" +"45033" +"45035" +"45037" +"45039" +"45041" +"45043" +"45045" +"45047" +"45049" +"45051" +"45053" +"45055" +"45057" +"45059" +"45061" +"45063" +"45065" +"45067" +"45069" +"45071" +"45073" +"45075" +"45077" +"45079" +"45081" +"45083" +"45085" +"45087" +"45089" +"45091" +"46000" +"46003" +"46005" +"46007" +"46009" +"46011" +"46013" +"46015" +"46017" +"46019" +"46021" +"46023" +"46025" +"46027" +"46029" +"46031" +"46033" +"46035" +"46037" +"46039" +"46041" +"46043" +"46045" +"46047" +"46049" +"46051" +"46053" +"46055" +"46057" +"46059" +"46061" +"46063" +"46065" +"46067" +"46069" +"46071" +"46073" +"46075" +"46077" +"46079" +"46081" +"46083" +"46085" +"46087" +"46089" +"46091" +"46093" +"46095" +"46097" +"46099" +"46101" +"46102" +"46103" +"46105" +"46107" +"46109" +"46111" +"46113" +"46115" +"46117" +"46119" +"46121" +"46123" +"46125" +"46127" +"46129" +"46135" +"46137" +"47000" +"47001" +"47003" +"47005" +"47007" +"47009" +"47011" +"47013" +"47015" +"47017" +"47019" +"47021" +"47023" +"47025" +"47027" +"47029" +"47031" +"47033" +"47035" +"47037" +"47039" +"47041" +"47043" +"47045" +"47047" +"47049" +"47051" +"47053" +"47055" +"47057" +"47059" +"47061" +"47063" +"47065" +"47067" +"47069" +"47071" +"47073" +"47075" +"47077" +"47079" +"47081" +"47083" +"47085" +"47087" +"47089" +"47091" +"47093" +"47095" +"47097" +"47099" +"47101" +"47103" +"47105" +"47107" +"47109" +"47111" +"47113" +"47115" +"47117" +"47119" +"47121" +"47123" +"47125" +"47127" +"47129" +"47131" +"47133" +"47135" +"47137" +"47139" +"47141" +"47143" +"47145" +"47147" +"47149" +"47151" +"47153" +"47155" +"47157" +"47159" +"47161" +"47163" +"47165" +"47167" +"47169" +"47171" +"47173" +"47175" +"47177" +"47179" +"47181" +"47183" +"47185" +"47187" +"47189" +"48000" +"48001" +"48003" +"48005" +"48007" +"48009" +"48011" +"48013" +"48015" +"48017" +"48019" +"48021" +"48023" +"48025" +"48027" +"48029" +"48031" +"48033" +"48035" +"48037" +"48039" +"48041" +"48043" +"48045" +"48047" +"48049" +"48051" +"48053" +"48055" +"48057" +"48059" +"48061" +"48063" +"48065" +"48067" +"48069" +"48071" +"48073" +"48075" +"48077" +"48079" +"48081" +"48083" +"48085" +"48087" +"48089" +"48091" +"48093" +"48095" +"48097" +"48099" +"48101" +"48103" +"48105" +"48107" +"48109" +"48111" +"48113" +"48115" +"48117" +"48119" +"48121" +"48123" +"48125" +"48127" +"48129" +"48131" +"48133" +"48135" +"48137" +"48139" +"48141" +"48143" +"48145" +"48147" +"48149" +"48151" +"48153" +"48155" +"48157" +"48159" +"48161" +"48163" +"48165" +"48167" +"48169" +"48171" +"48173" +"48175" +"48177" +"48179" +"48181" +"48183" +"48185" +"48187" +"48189" +"48191" +"48193" +"48195" +"48197" +"48199" +"48201" +"48203" +"48205" +"48207" +"48209" +"48211" +"48213" +"48215" +"48217" +"48219" +"48221" +"48223" +"48225" +"48227" +"48229" +"48231" +"48233" +"48235" +"48237" +"48239" +"48241" +"48243" +"48245" +"48247" +"48249" +"48251" +"48253" +"48255" +"48257" +"48259" +"48261" +"48263" +"48265" +"48267" +"48269" +"48271" +"48273" +"48275" +"48277" +"48279" +"48281" +"48283" +"48285" +"48287" +"48289" +"48291" +"48293" +"48295" +"48297" +"48299" +"48301" +"48303" +"48305" +"48307" +"48309" +"48311" +"48313" +"48315" +"48317" +"48319" +"48321" +"48323" +"48325" +"48327" +"48329" +"48331" +"48333" +"48335" +"48337" +"48339" +"48341" +"48343" +"48345" +"48347" +"48349" +"48351" +"48353" +"48355" +"48357" +"48359" +"48361" +"48363" +"48365" +"48367" +"48369" +"48371" +"48373" +"48375" +"48377" +"48379" +"48381" +"48383" +"48385" +"48387" +"48389" +"48391" +"48393" +"48395" +"48397" +"48399" +"48401" +"48403" +"48405" +"48407" +"48409" +"48411" +"48413" +"48415" +"48417" +"48419" +"48421" +"48423" +"48425" +"48427" +"48429" +"48431" +"48433" +"48435" +"48437" +"48439" +"48441" +"48443" +"48445" +"48447" +"48449" +"48451" +"48453" +"48455" +"48457" +"48459" +"48461" +"48463" +"48465" +"48467" +"48469" +"48471" +"48473" +"48475" +"48477" +"48479" +"48481" +"48483" +"48485" +"48487" +"48489" +"48491" +"48493" +"48495" +"48497" +"48499" +"48501" +"48503" +"48505" +"48507" +"49000" +"49001" +"49003" +"49005" +"49007" +"49009" +"49011" +"49013" +"49015" +"49017" +"49019" +"49021" +"49023" +"49025" +"49027" +"49029" +"49031" +"49033" +"49035" +"49037" +"49039" +"49041" +"49043" +"49045" +"49047" +"49049" +"49051" +"49053" +"49055" +"49057" +"50000" +"50001" +"50003" +"50005" +"50007" +"50009" +"50011" +"50013" +"50015" +"50017" +"50019" +"50021" +"50023" +"50025" +"50027" +"51000" +"51001" +"51003" +"51005" +"51007" +"51009" +"51011" +"51013" +"51015" +"51017" +"51019" +"51021" +"51023" +"51025" +"51027" +"51029" +"51031" +"51033" +"51035" +"51036" +"51037" +"51041" +"51043" +"51045" +"51047" +"51049" +"51051" +"51053" +"51057" +"51059" +"51061" +"51063" +"51065" +"51067" +"51069" +"51071" +"51073" +"51075" +"51077" +"51079" +"51081" +"51083" +"51085" +"51087" +"51089" +"51091" +"51093" +"51095" +"51097" +"51099" +"51101" +"51103" +"51105" +"51107" +"51109" +"51111" +"51113" +"51115" +"51117" +"51119" +"51121" +"51125" +"51127" +"51131" +"51133" +"51135" +"51137" +"51139" +"51141" +"51143" +"51145" +"51147" +"51149" +"51153" +"51155" +"51157" +"51159" +"51161" +"51163" +"51165" +"51167" +"51169" +"51171" +"51173" +"51175" +"51177" +"51179" +"51181" +"51183" +"51185" +"51187" +"51191" +"51193" +"51195" +"51197" +"51199" +"51510" +"51520" +"51530" +"51540" +"51550" +"51570" +"51580" +"51590" +"51595" +"51600" +"51610" +"51620" +"51630" +"51640" +"51650" +"51660" +"51670" +"51678" +"51680" +"51683" +"51685" +"51690" +"51700" +"51710" +"51720" +"51730" +"51735" +"51740" +"51750" +"51760" +"51770" +"51775" +"51790" +"51800" +"51810" +"51820" +"51830" +"51840" +"53000" +"53001" +"53003" +"53005" +"53007" +"53009" +"53011" +"53013" +"53015" +"53017" +"53019" +"53021" +"53023" +"53025" +"53027" +"53029" +"53031" +"53033" +"53035" +"53037" +"53039" +"53041" +"53043" +"53045" +"53047" +"53049" +"53051" +"53053" +"53055" +"53057" +"53059" +"53061" +"53063" +"53065" +"53067" +"53069" +"53071" +"53073" +"53075" +"53077" +"54000" +"54001" +"54003" +"54005" +"54007" +"54009" +"54011" +"54013" +"54015" +"54017" +"54019" +"54021" +"54023" +"54025" +"54027" +"54029" +"54031" +"54033" +"54035" +"54037" +"54039" +"54041" +"54043" +"54045" +"54047" +"54049" +"54051" +"54053" +"54055" +"54057" +"54059" +"54061" +"54063" +"54065" +"54067" +"54069" +"54071" +"54073" +"54075" +"54077" +"54079" +"54081" +"54083" +"54085" +"54087" +"54089" +"54091" +"54093" +"54095" +"54097" +"54099" +"54101" +"54103" +"54105" +"54107" +"54109" +"55000" +"55001" +"55003" +"55005" +"55007" +"55009" +"55011" +"55013" +"55015" +"55017" +"55019" +"55021" +"55023" +"55025" +"55027" +"55029" +"55031" +"55033" +"55035" +"55037" +"55039" +"55041" +"55043" +"55045" +"55047" +"55049" +"55051" +"55053" +"55055" +"55057" +"55059" +"55061" +"55063" +"55065" +"55067" +"55069" +"55071" +"55073" +"55075" +"55077" +"55078" +"55079" +"55081" +"55083" +"55085" +"55087" +"55089" +"55091" +"55093" +"55095" +"55097" +"55099" +"55101" +"55103" +"55105" +"55107" +"55109" +"55111" +"55113" +"55115" +"55117" +"55119" +"55121" +"55123" +"55125" +"55127" +"55129" +"55131" +"55133" +"55135" +"55137" +"55139" +"55141" +"56000" +"56001" +"56003" +"56005" +"56007" +"56009" +"56011" +"56013" +"56015" +"56017" +"56019" +"56021" +"56023" +"56025" +"56027" +"56029" +"56031" +"56033" +"56035" +"56037" +"56039" +"56041" +"56043" +"56045" +"60000" +"66000" +"69000" +"70002" +"70003" +"72000" +"72001" +"72003" +"72005" +"72007" +"72009" +"72011" +"72013" +"72015" +"72017" +"72019" +"72021" +"72023" +"72025" +"72027" +"72029" +"72031" +"72033" +"72035" +"72037" +"72039" +"72041" +"72043" +"72045" +"72047" +"72049" +"72051" +"72053" +"72054" +"72055" +"72057" +"72059" +"72061" +"72063" +"72065" +"72067" +"72069" +"72071" +"72073" +"72075" +"72077" +"72079" +"72081" +"72083" +"72085" +"72087" +"72089" +"72091" +"72093" +"72095" +"72097" +"72099" +"72101" +"72103" +"72105" +"72107" +"72109" +"72111" +"72113" +"72115" +"72117" +"72119" +"72121" +"72123" +"72125" +"72127" +"72129" +"72131" +"72133" +"72135" +"72137" +"72139" +"72141" +"72143" +"72145" +"72147" +"72149" +"72151" +"72153" +"72888" +"72999" +"78000" diff --git a/validator/static/dma_geo.csv b/validator/static/dma_geo.csv new file mode 100644 index 000000000..3315ebd11 --- /dev/null +++ b/validator/static/dma_geo.csv @@ -0,0 +1,211 @@ +"geo_id" +"500" +"501" +"502" +"503" +"504" +"505" +"506" +"507" +"508" +"509" +"510" +"511" +"512" +"513" +"514" +"515" +"516" +"517" +"518" +"519" +"520" +"521" +"522" +"523" +"524" +"525" +"526" +"527" +"528" +"529" +"530" +"531" +"532" +"533" +"534" +"535" +"536" +"537" +"538" +"539" +"540" +"541" +"542" +"543" +"544" +"545" +"546" +"547" +"548" +"549" +"550" +"551" +"552" +"553" +"554" +"555" +"556" +"557" +"558" +"559" +"560" +"561" +"563" +"564" +"565" +"566" +"567" +"569" +"570" +"571" +"573" +"574" +"575" +"576" +"577" +"581" +"582" +"583" +"584" +"588" +"592" +"596" +"597" +"598" +"600" +"602" +"603" +"604" +"605" +"606" +"609" +"610" +"611" +"612" +"613" +"616" +"617" +"618" +"619" +"622" +"623" +"624" +"625" +"626" +"627" +"628" +"630" +"631" +"632" +"633" +"634" +"635" +"636" +"637" +"638" +"639" +"640" +"641" +"642" +"643" +"644" +"647" +"648" +"649" +"650" +"651" +"652" +"656" +"657" +"658" +"659" +"661" +"662" +"669" +"670" +"671" +"673" +"675" +"676" +"678" +"679" +"682" +"686" +"687" +"691" +"692" +"693" +"698" +"702" +"705" +"709" +"710" +"711" +"716" +"717" +"718" +"722" +"724" +"725" +"734" +"736" +"737" +"740" +"743" +"744" +"745" +"746" +"747" +"749" +"751" +"752" +"753" +"754" +"755" +"756" +"757" +"758" +"759" +"760" +"762" +"764" +"765" +"766" +"767" +"770" +"771" +"773" +"789" +"790" +"798" +"800" +"801" +"802" +"803" +"804" +"807" +"810" +"811" +"813" +"819" +"820" +"821" +"825" +"828" +"839" +"855" +"862" +"866" +"868" +"881" diff --git a/validator/static/hrr_geo.csv b/validator/static/hrr_geo.csv new file mode 100644 index 000000000..4e9042de5 --- /dev/null +++ b/validator/static/hrr_geo.csv @@ -0,0 +1,307 @@ +"geo_id" +"1" +"10" +"101" +"102" +"103" +"104" +"105" +"106" +"107" +"109" +"11" +"110" +"111" +"112" +"113" +"115" +"116" +"118" +"119" +"12" +"120" +"122" +"123" +"124" +"127" +"129" +"130" +"131" +"133" +"134" +"137" +"139" +"14" +"140" +"141" +"142" +"144" +"145" +"146" +"147" +"148" +"149" +"15" +"150" +"151" +"152" +"154" +"155" +"156" +"158" +"16" +"161" +"163" +"164" +"166" +"170" +"171" +"172" +"173" +"175" +"179" +"18" +"180" +"181" +"183" +"184" +"185" +"186" +"187" +"188" +"19" +"190" +"191" +"192" +"193" +"194" +"195" +"196" +"197" +"2" +"200" +"201" +"203" +"204" +"205" +"207" +"208" +"209" +"21" +"210" +"212" +"213" +"214" +"216" +"217" +"218" +"219" +"22" +"220" +"221" +"222" +"223" +"225" +"226" +"227" +"23" +"230" +"231" +"232" +"233" +"234" +"235" +"236" +"238" +"239" +"240" +"242" +"243" +"244" +"245" +"246" +"248" +"249" +"25" +"250" +"251" +"253" +"254" +"256" +"257" +"258" +"259" +"260" +"261" +"262" +"263" +"264" +"267" +"268" +"270" +"273" +"274" +"275" +"276" +"277" +"278" +"279" +"280" +"281" +"282" +"283" +"284" +"285" +"288" +"289" +"291" +"292" +"293" +"295" +"296" +"297" +"299" +"300" +"301" +"303" +"304" +"307" +"308" +"309" +"31" +"311" +"312" +"313" +"314" +"315" +"318" +"319" +"320" +"321" +"322" +"323" +"324" +"325" +"326" +"327" +"328" +"329" +"33" +"330" +"331" +"332" +"334" +"335" +"336" +"339" +"340" +"341" +"342" +"343" +"344" +"345" +"346" +"347" +"350" +"351" +"352" +"354" +"355" +"356" +"357" +"358" +"359" +"360" +"362" +"363" +"364" +"365" +"366" +"367" +"368" +"369" +"370" +"371" +"373" +"374" +"375" +"376" +"377" +"379" +"380" +"382" +"383" +"385" +"386" +"388" +"390" +"391" +"393" +"394" +"396" +"397" +"399" +"400" +"402" +"406" +"411" +"412" +"413" +"416" +"417" +"418" +"420" +"421" +"422" +"423" +"424" +"426" +"427" +"428" +"429" +"43" +"430" +"431" +"432" +"435" +"437" +"438" +"439" +"440" +"441" +"442" +"443" +"444" +"445" +"446" +"447" +"448" +"449" +"450" +"451" +"452" +"456" +"457" +"5" +"56" +"58" +"6" +"62" +"65" +"69" +"7" +"73" +"77" +"78" +"79" +"80" +"81" +"82" +"83" +"85" +"86" +"87" +"89" +"9" +"91" +"96" diff --git a/validator/static/msa_geo.csv b/validator/static/msa_geo.csv new file mode 100644 index 000000000..a8d1043d6 --- /dev/null +++ b/validator/static/msa_geo.csv @@ -0,0 +1,393 @@ +"geo_id" +"10180" +"10380" +"10420" +"10500" +"10540" +"10580" +"10740" +"10780" +"10900" +"11020" +"11100" +"11180" +"11260" +"11460" +"11500" +"11540" +"11640" +"11700" +"12020" +"12060" +"12100" +"12220" +"12260" +"12420" +"12540" +"12580" +"12620" +"12700" +"12940" +"12980" +"13020" +"13140" +"13220" +"13380" +"13460" +"13740" +"13780" +"13820" +"13900" +"13980" +"14010" +"14020" +"14100" +"14260" +"14460" +"14500" +"14540" +"14740" +"14860" +"15180" +"15260" +"15380" +"15500" +"15540" +"15680" +"15940" +"15980" +"16020" +"16060" +"16180" +"16220" +"16300" +"16540" +"16580" +"16620" +"16700" +"16740" +"16820" +"16860" +"16940" +"16980" +"17020" +"17140" +"17300" +"17420" +"17460" +"17660" +"17780" +"17820" +"17860" +"17900" +"17980" +"18020" +"18140" +"18580" +"18700" +"18880" +"19060" +"19100" +"19140" +"19180" +"19300" +"19340" +"19430" +"19460" +"19500" +"19660" +"19740" +"19780" +"19820" +"20020" +"20100" +"20220" +"20260" +"20500" +"20700" +"20740" +"20940" +"21060" +"21140" +"21300" +"21340" +"21420" +"21500" +"21660" +"21780" +"21820" +"22020" +"22140" +"22180" +"22220" +"22380" +"22420" +"22500" +"22520" +"22540" +"22660" +"22900" +"23060" +"23420" +"23460" +"23540" +"23580" +"23900" +"24020" +"24140" +"24220" +"24260" +"24300" +"24340" +"24420" +"24500" +"24540" +"24580" +"24660" +"24780" +"24860" +"25020" +"25060" +"25180" +"25220" +"25260" +"25420" +"25500" +"25540" +"25620" +"25860" +"25940" +"25980" +"26140" +"26300" +"26380" +"26420" +"26580" +"26620" +"26820" +"26900" +"26980" +"27060" +"27100" +"27140" +"27180" +"27260" +"27340" +"27500" +"27620" +"27740" +"27780" +"27860" +"27900" +"27980" +"28020" +"28100" +"28140" +"28420" +"28660" +"28700" +"28740" +"28940" +"29020" +"29100" +"29180" +"29200" +"29340" +"29420" +"29460" +"29540" +"29620" +"29700" +"29740" +"29820" +"29940" +"30020" +"30140" +"30300" +"30340" +"30460" +"30620" +"30700" +"30780" +"30860" +"30980" +"31020" +"31080" +"31140" +"31180" +"31340" +"31420" +"31460" +"31540" +"31700" +"31740" +"31860" +"31900" +"32420" +"32580" +"32780" +"32820" +"32900" +"33100" +"33140" +"33220" +"33260" +"33340" +"33460" +"33540" +"33660" +"33700" +"33740" +"33780" +"33860" +"34060" +"34100" +"34580" +"34620" +"34740" +"34820" +"34900" +"34940" +"34980" +"35100" +"35300" +"35380" +"35620" +"35660" +"35840" +"35980" +"36100" +"36140" +"36220" +"36260" +"36420" +"36500" +"36540" +"36740" +"36780" +"36980" +"37100" +"37340" +"37460" +"37620" +"37860" +"37900" +"37980" +"38060" +"38220" +"38300" +"38340" +"38540" +"38660" +"38860" +"38900" +"38940" +"39100" +"39150" +"39300" +"39340" +"39380" +"39460" +"39540" +"39580" +"39660" +"39740" +"39820" +"39900" +"40060" +"40140" +"40220" +"40340" +"40380" +"40420" +"40580" +"40660" +"40900" +"40980" +"41060" +"41100" +"41140" +"41180" +"41420" +"41500" +"41540" +"41620" +"41660" +"41700" +"41740" +"41860" +"41900" +"41940" +"41980" +"42020" +"42100" +"42140" +"42200" +"42220" +"42340" +"42540" +"42660" +"42680" +"42700" +"43100" +"43300" +"43340" +"43420" +"43580" +"43620" +"43780" +"43900" +"44060" +"44100" +"44140" +"44180" +"44220" +"44300" +"44420" +"44700" +"44940" +"45060" +"45220" +"45300" +"45460" +"45500" +"45540" +"45780" +"45820" +"45940" +"46060" +"46140" +"46220" +"46300" +"46340" +"46520" +"46540" +"46660" +"46700" +"47020" +"47220" +"47260" +"47300" +"47380" +"47460" +"47580" +"47900" +"47940" +"48060" +"48140" +"48260" +"48300" +"48540" +"48620" +"48660" +"48700" +"48900" +"49020" +"49180" +"49340" +"49420" +"49500" +"49620" +"49660" +"49700" +"49740" diff --git a/validator/static/national_geo.csv b/validator/static/national_geo.csv new file mode 100644 index 000000000..f445fd82d --- /dev/null +++ b/validator/static/national_geo.csv @@ -0,0 +1,2 @@ +"geo_id" +"us" diff --git a/validator/static/state_geo.csv b/validator/static/state_geo.csv new file mode 100644 index 000000000..8bba20eac --- /dev/null +++ b/validator/static/state_geo.csv @@ -0,0 +1,57 @@ +"geo_id" +"ak" +"al" +"ar" +"as" +"az" +"ca" +"co" +"ct" +"dc" +"de" +"fl" +"ga" +"gu" +"hi" +"ia" +"id" +"il" +"in" +"ks" +"ky" +"la" +"ma" +"md" +"me" +"mi" +"mn" +"mo" +"mp" +"ms" +"mt" +"nc" +"nd" +"ne" +"nh" +"nj" +"nm" +"nv" +"ny" +"oh" +"ok" +"or" +"pa" +"pr" +"ri" +"sc" +"sd" +"tn" +"tx" +"ut" +"va" +"vi" +"vt" +"wa" +"wi" +"wv" +"wy" diff --git a/validator/tests/test_checks.py b/validator/tests/test_checks.py new file mode 100644 index 000000000..04ee98c71 --- /dev/null +++ b/validator/tests/test_checks.py @@ -0,0 +1,628 @@ +import pytest +from datetime import date, datetime, timedelta +import numpy as np +import pandas as pd + +from delphi_validator.datafetcher import filename_regex +from delphi_validator.validate import Validator, make_date_filter + + +class TestDateFilter: + + def test_same_day(self): + start_date = end_date = datetime.strptime("20200902", "%Y%m%d") + date_filter = make_date_filter( + start_date, end_date) + + filenames = [(f, filename_regex.match(f)) + for f in ("20200901_county_signal_signal.csv", + "20200902_county_signal_signal.csv", + "20200903_county_signal_signal.csv")] + + subset_filenames = [(f, m) for (f, m) in filenames if date_filter(m)] + + assert len(subset_filenames) == 1 + assert subset_filenames[0][0] == "20200902_county_signal_signal.csv" + + def test_inclusive(self): + start_date = datetime.strptime("20200902", "%Y%m%d") + end_date = datetime.strptime("20200903", "%Y%m%d") + date_filter = make_date_filter( + start_date, end_date) + + filenames = [(f, filename_regex.match(f)) + for f in ("20200901_county_signal_signal.csv", + "20200902_county_signal_signal.csv", + "20200903_county_signal_signal.csv", + "20200904_county_signal_signal.csv")] + + subset_filenames = [(f, m) for (f, m) in filenames if date_filter(m)] + + assert len(subset_filenames) == 2 + + def test_empty(self): + start_date = datetime.strptime("20200902", "%Y%m%d") + end_date = datetime.strptime("20200903", "%Y%m%d") + date_filter = make_date_filter( + start_date, end_date) + + filenames = [(f, filename_regex.match(f)) + for f in ()] + + subset_filenames = [(f, m) for (f, m) in filenames if date_filter(m)] + + assert len(subset_filenames) == 0 + + +class TestValidatorInitialization: + + def test_default_settings(self): + params = {"data_source": "", "span_length": 0, + "end_date": "2020-09-01", "expected_lag": {}} + validator = Validator(params) + + assert validator.max_check_lookbehind == timedelta(days=7) + assert validator.minimum_sample_size == 100 + assert validator.missing_se_allowed == False + assert validator.missing_sample_size_allowed == False + assert validator.sanity_check_rows_per_day == True + assert validator.sanity_check_value_diffs == True + assert len(validator.suppressed_errors) == 0 + assert isinstance(validator.suppressed_errors, set) + assert len(validator.raised_errors) == 0 + + +class TestCheckMissingDates: + + def test_empty_filelist(self): + params = {"data_source": "", "span_length": 8, + "end_date": "2020-09-09", "expected_lag": {}} + validator = Validator(params) + + filenames = list() + validator.check_missing_date_files(filenames) + + assert len(validator.raised_errors) == 1 + assert "check_missing_date_files" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert len(validator.raised_errors[0].expression) == 9 + + def test_same_day(self): + params = {"data_source": "", "span_length": 0, + "end_date": "2020-09-01", "expected_lag": {}} + validator = Validator(params) + + filenames = [("20200901_county_signal_signal.csv", "match_obj")] + validator.check_missing_date_files(filenames) + + assert len(validator.raised_errors) == 0 + assert "check_missing_date_files" not in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_duplicate_dates(self): + params = {"data_source": "", "span_length": 1, + "end_date": "2020-09-02", "expected_lag": {}} + validator = Validator(params) + + filenames = [("20200901_county_signal_signal.csv", "match_obj"), + ("20200903_county_signal_signal.csv", "match_obj"), + ("20200903_usa_signal_signal.csv", "match_obj"), + ("20200903_usa_signal_signal.csv", "match_obj")] + validator.check_missing_date_files(filenames) + + assert len(validator.raised_errors) == 1 + assert "check_missing_date_files" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert len([err.expression[0] for + err in validator.raised_errors if err.check_data_id[0] == + "check_missing_date_files"]) == 1 + assert [err.expression[0] for + err in validator.raised_errors if err.check_data_id[0] == + "check_missing_date_files"][0] == datetime.strptime("20200902", "%Y%m%d").date() + + +class TestNameFormat: + + def test_match_existence(self): + pattern_found = filename_regex.match("20200903_usa_signal_signal.csv") + assert pattern_found + + pattern_found = filename_regex.match("2020090_usa_signal_signal.csv") + assert not pattern_found + + pattern_found = filename_regex.match("20200903_usa_signal_signal.pdf") + assert not pattern_found + + pattern_found = filename_regex.match("20200903_usa_.csv") + assert not pattern_found + + def test_expected_groups(self): + pattern_found = filename_regex.match( + "20200903_usa_signal_signal.csv").groupdict() + assert pattern_found["date"] == "20200903" + assert pattern_found["geo_type"] == "usa" + assert pattern_found["signal"] == "signal_signal" + + +class TestCheckBadGeoIdFormat: + params = {"data_source": "", "span_length": 0, + "end_date": "2020-09-02", "expected_lag": {}} + + def test_empty_df(self): + validator = Validator(self.params) + empty_df = pd.DataFrame(columns=["geo_id"], dtype=str) + validator.check_bad_geo_id_format(empty_df, "name", "county") + + assert len(validator.raised_errors) == 0 + + def test_invalid_geo_type(self): + validator = Validator(self.params) + empty_df = pd.DataFrame(columns=["geo_id"], dtype=str) + validator.check_bad_geo_id_format(empty_df, "name", "hello") + + assert len(validator.raised_errors) == 1 + assert "check_geo_type" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert [err.expression for + err in validator.raised_errors if err.check_data_id[0] == + "check_geo_type"][0] == "hello" + + def test_invalid_geo_id_county(self): + validator = Validator(self.params) + df = pd.DataFrame(["0", "54321", "123", ".0000", + "abc12"], columns=["geo_id"]) + validator.check_bad_geo_id_format(df, "name", "county") + + assert len(validator.raised_errors) == 1 + assert "check_geo_id_format" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 2 + assert "54321" not in validator.raised_errors[0].expression + + def test_invalid_geo_id_msa(self): + validator = Validator(self.params) + df = pd.DataFrame(["0", "54321", "123", ".0000", + "abc12"], columns=["geo_id"]) + validator.check_bad_geo_id_format(df, "name", "msa") + + assert len(validator.raised_errors) == 1 + assert "check_geo_id_format" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 2 + assert "54321" not in validator.raised_errors[0].expression + + def test_invalid_geo_id_hrr(self): + validator = Validator(self.params) + df = pd.DataFrame(["1", "12", "123", "1234", "12345", + "a", ".", "ab1"], columns=["geo_id"]) + validator.check_bad_geo_id_format(df, "name", "hrr") + + assert len(validator.raised_errors) == 1 + assert "check_geo_id_format" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 5 + assert "1" not in validator.raised_errors[0].expression + assert "12" not in validator.raised_errors[0].expression + assert "123" not in validator.raised_errors[0].expression + + def test_invalid_geo_id_state(self): + validator = Validator(self.params) + df = pd.DataFrame(["aa", "hi", "HI", "hawaii", + "Hawaii", "a", "H.I."], columns=["geo_id"]) + validator.check_bad_geo_id_format(df, "name", "state") + + assert len(validator.raised_errors) == 1 + assert "check_geo_id_format" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 4 + assert "aa" not in validator.raised_errors[0].expression + assert "hi" not in validator.raised_errors[0].expression + assert "HI" not in validator.raised_errors[0].expression + + def test_invalid_geo_id_national(self): + validator = Validator(self.params) + df = pd.DataFrame(["usa", "SP", " us", "us", + "usausa", "US"], columns=["geo_id"]) + validator.check_bad_geo_id_format(df, "name", "national") + + assert len(validator.raised_errors) == 1 + assert "check_geo_id_format" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 3 + assert "us" not in validator.raised_errors[0].expression + assert "US" not in validator.raised_errors[0].expression + assert "SP" not in validator.raised_errors[0].expression + + +class TestCheckBadGeoIdValue: + params = {"data_source": "", "span_length": 0, + "end_date": "2020-09-02", "expected_lag": {}, + "validator_static_file_dir": "../static"} + + def test_empty_df(self): + validator = Validator(self.params) + empty_df = pd.DataFrame(columns=["geo_id"], dtype=str) + validator.check_bad_geo_id_value(empty_df, "name", "county") + assert len(validator.raised_errors) == 0 + + def test_invalid_geo_id_county(self): + validator = Validator(self.params) + df = pd.DataFrame(["01001", "88888", "99999"], columns=["geo_id"]) + validator.check_bad_geo_id_value(df, "name", "county") + + assert len(validator.raised_errors) == 1 + assert "check_bad_geo_id_value" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 2 + assert "01001" not in validator.raised_errors[0].expression + assert "88888" in validator.raised_errors[0].expression + assert "99999" in validator.raised_errors[0].expression + + def test_invalid_geo_id_msa(self): + validator = Validator(self.params) + df = pd.DataFrame(["10180", "88888", "99999"], columns=["geo_id"]) + validator.check_bad_geo_id_value(df, "name", "msa") + + assert len(validator.raised_errors) == 1 + assert "check_bad_geo_id_value" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 2 + assert "10180" not in validator.raised_errors[0].expression + assert "88888" in validator.raised_errors[0].expression + assert "99999" in validator.raised_errors[0].expression + + def test_invalid_geo_id_hrr(self): + validator = Validator(self.params) + df = pd.DataFrame(["1", "11", "111", "8", "88", + "888"], columns=["geo_id"]) + validator.check_bad_geo_id_value(df, "name", "hrr") + + assert len(validator.raised_errors) == 1 + assert "check_bad_geo_id_value" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 3 + assert "1" not in validator.raised_errors[0].expression + assert "11" not in validator.raised_errors[0].expression + assert "111" not in validator.raised_errors[0].expression + assert "8" in validator.raised_errors[0].expression + assert "88" in validator.raised_errors[0].expression + assert "888" in validator.raised_errors[0].expression + + def test_invalid_geo_id_state(self): + validator = Validator(self.params) + df = pd.DataFrame(["aa", "ak"], columns=["geo_id"]) + validator.check_bad_geo_id_value(df, "name", "state") + + assert len(validator.raised_errors) == 1 + assert "check_bad_geo_id_value" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 1 + assert "ak" not in validator.raised_errors[0].expression + assert "aa" in validator.raised_errors[0].expression + + def test_uppercase_geo_id(self): + validator = Validator(self.params) + df = pd.DataFrame(["ak", "AK"], columns=["geo_id"]) + validator.check_bad_geo_id_value(df, "name", "state") + + assert len(validator.raised_errors) == 0 + assert len(validator.raised_warnings) == 1 + assert "check_geo_id_lowercase" in validator.raised_warnings[0].check_data_id + assert "AK" in validator.raised_warnings[0].expression + + def test_invalid_geo_id_national(self): + validator = Validator(self.params) + df = pd.DataFrame(["us", "zz"], columns=["geo_id"]) + validator.check_bad_geo_id_value(df, "name", "national") + + assert len(validator.raised_errors) == 1 + assert "check_bad_geo_id_value" in validator.raised_errors[0].check_data_id + assert len(validator.raised_errors[0].expression) == 1 + assert "us" not in validator.raised_errors[0].expression + assert "zz" in validator.raised_errors[0].expression + + +class TestCheckBadVal: + params = {"data_source": "", "span_length": 1, + "end_date": "2020-09-02", "expected_lag": {}} + + def test_empty_df(self): + validator = Validator(self.params) + empty_df = pd.DataFrame(columns=["val"]) + validator.check_bad_val(empty_df, "", "") + validator.check_bad_val(empty_df, "", "prop") + validator.check_bad_val(empty_df, "", "pct") + + assert len(validator.raised_errors) == 0 + + def test_missing(self): + validator = Validator(self.params) + df = pd.DataFrame([np.nan], columns=["val"]) + validator.check_bad_val(df, "name", "signal") + + assert len(validator.raised_errors) == 1 + assert "check_val_missing" in validator.raised_errors[0].check_data_id + + def test_lt_0(self): + validator = Validator(self.params) + df = pd.DataFrame([-5], columns=["val"]) + validator.check_bad_val(df, "name", "signal") + + assert len(validator.raised_errors) == 1 + assert "check_val_lt_0" in validator.raised_errors[0].check_data_id + + def test_gt_max_pct(self): + validator = Validator(self.params) + df = pd.DataFrame([1e7], columns=["val"]) + validator.check_bad_val(df, "name", "pct") + + assert len(validator.raised_errors) == 1 + assert "check_val_pct_gt_100" in validator.raised_errors[0].check_data_id + + def test_gt_max_prop(self): + validator = Validator(self.params) + df = pd.DataFrame([1e7], columns=["val"]) + validator.check_bad_val(df, "name", "prop") + + assert len(validator.raised_errors) == 1 + assert "check_val_prop_gt_100k" in validator.raised_errors[0].check_data_id + + +class TestCheckBadSe: + params = {"data_source": "", "span_length": 1, + "end_date": "2020-09-02", "expected_lag": {}} + + def test_empty_df(self): + validator = Validator(self.params) + empty_df = pd.DataFrame( + columns=["val", "se", "sample_size"], dtype=float) + validator.check_bad_se(empty_df, "") + + assert len(validator.raised_errors) == 0 + + validator.missing_se_allowed = True + validator.check_bad_se(empty_df, "") + + assert len(validator.raised_errors) == 0 + + def test_missing(self): + validator = Validator(self.params) + validator.missing_se_allowed = True + df = pd.DataFrame([[np.nan, np.nan, np.nan]], columns=[ + "val", "se", "sample_size"]) + validator.check_bad_se(df, "name") + + assert len(validator.raised_errors) == 0 + + validator.missing_se_allowed = False + validator.check_bad_se(df, "name") + + assert len(validator.raised_errors) == 2 + assert "check_se_not_missing_and_in_range" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert "check_se_many_missing" in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_e_0_missing_allowed(self): + validator = Validator(self.params) + validator.missing_se_allowed = True + df = pd.DataFrame([[1, 0, 200], [1, np.nan, np.nan], [ + 1, np.nan, np.nan]], columns=["val", "se", "sample_size"]) + validator.check_bad_se(df, "name") + + assert len(validator.raised_errors) == 2 + assert "check_se_missing_or_in_range" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert "check_se_0" in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_e_0_missing_not_allowed(self): + validator = Validator(self.params) + validator.missing_se_allowed = False + df = pd.DataFrame([[1, 0, 200], [1, 0, np.nan], [ + 1, np.nan, np.nan]], columns=["val", "se", "sample_size"]) + validator.check_bad_se(df, "name") + + assert len(validator.raised_errors) == 2 + assert "check_se_not_missing_and_in_range" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert "check_se_0" in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_jeffreys(self): + validator = Validator(self.params) + validator.missing_se_allowed = False + df = pd.DataFrame([[0, 0, 200], [1, 0, np.nan], [ + 1, np.nan, np.nan]], columns=["val", "se", "sample_size"]) + validator.check_bad_se(df, "name") + + assert len(validator.raised_errors) == 2 + assert "check_se_not_missing_and_in_range" in [ + err.check_data_id[0] for err in validator.raised_errors] + assert "check_se_0_when_val_0" in [ + err.check_data_id[0] for err in validator.raised_errors] + + +class TestCheckBadN: + params = {"data_source": "", "span_length": 1, + "end_date": "2020-09-02", "expected_lag": {}} + + def test_empty_df(self): + validator = Validator(self.params) + empty_df = pd.DataFrame( + columns=["val", "se", "sample_size"], dtype=float) + validator.check_bad_sample_size(empty_df, "") + + assert len(validator.raised_errors) == 0 + + validator.missing_sample_size_allowed = True + validator.check_bad_sample_size(empty_df, "") + + assert len(validator.raised_errors) == 0 + + def test_missing(self): + validator = Validator(self.params) + validator.missing_sample_size_allowed = True + df = pd.DataFrame([[np.nan, np.nan, np.nan]], columns=[ + "val", "se", "sample_size"]) + validator.check_bad_sample_size(df, "name") + + assert len(validator.raised_errors) == 0 + + validator.missing_sample_size_allowed = False + validator.check_bad_sample_size(df, "name") + + assert len(validator.raised_errors) == 1 + assert "check_n_missing" in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_lt_min_missing_allowed(self): + validator = Validator(self.params) + validator.missing_sample_size_allowed = True + df = pd.DataFrame([[1, 0, 10], [1, np.nan, np.nan], [ + 1, np.nan, np.nan]], columns=["val", "se", "sample_size"]) + validator.check_bad_sample_size(df, "name") + + assert len(validator.raised_errors) == 1 + assert "check_n_missing_or_gt_min" in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_lt_min_missing_not_allowed(self): + validator = Validator(self.params) + validator.missing_sample_size_allowed = False + df = pd.DataFrame([[1, 0, 10], [1, np.nan, 240], [ + 1, np.nan, 245]], columns=["val", "se", "sample_size"]) + validator.check_bad_sample_size(df, "name") + + assert len(validator.raised_errors) == 1 + assert "check_n_gt_min" in [ + err.check_data_id[0] for err in validator.raised_errors] + + +class TestCheckRapidChange: + params = {"data_source": "", "span_length": 1, + "end_date": "2020-09-02", "expected_lag": {}} + + def test_same_df(self): + validator = Validator(self.params) + test_df = pd.DataFrame([date.today()] * 5, columns=["time_value"]) + ref_df = pd.DataFrame([date.today()] * 5, columns=["time_value"]) + validator.check_rapid_change_num_rows( + test_df, ref_df, date.today(), "geo", "signal") + + assert len(validator.raised_errors) == 0 + + def test_0_vs_many(self): + validator = Validator(self.params) + + time_value = datetime.combine(date.today(), datetime.min.time()) + + test_df = pd.DataFrame([time_value] * 5, columns=["time_value"]) + ref_df = pd.DataFrame([time_value] * 1, columns=["time_value"]) + validator.check_rapid_change_num_rows( + test_df, ref_df, time_value, "geo", "signal") + + assert len(validator.raised_errors) == 1 + assert "check_rapid_change_num_rows" in [ + err.check_data_id[0] for err in validator.raised_errors] + + +class TestCheckAvgValDiffs: + params = {"data_source": "", "span_length": 1, + "end_date": "2020-09-02", "expected_lag": {}} + + def test_same_val(self): + validator = Validator(self.params) + + data = {"val": [1, 1, 1, 2, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(data) + ref_df = pd.DataFrame(data) + + validator.check_avg_val_vs_reference( + test_df, ref_df, date.today(), "geo", "signal") + + assert len(validator.raised_errors) == 0 + + def test_same_se(self): + validator = Validator(self.params) + + data = {"val": [np.nan] * 6, "se": [1, 1, 1, 2, 0, 1], + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(data) + ref_df = pd.DataFrame(data) + + validator.check_avg_val_vs_reference( + test_df, ref_df, date.today(), "geo", "signal") + + assert len(validator.raised_errors) == 0 + + def test_same_n(self): + validator = Validator(self.params) + + data = {"val": [np.nan] * 6, "se": [np.nan] * 6, + "sample_size": [1, 1, 1, 2, 0, 1], "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(data) + ref_df = pd.DataFrame(data) + + validator.check_avg_val_vs_reference( + test_df, ref_df, date.today(), "geo", "signal") + + assert len(validator.raised_errors) == 0 + + def test_same_val_se_n(self): + validator = Validator(self.params) + + data = {"val": [1, 1, 1, 2, 0, 1], "se": [1, 1, 1, 2, 0, 1], + "sample_size": [1, 1, 1, 2, 0, 1], "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(data) + ref_df = pd.DataFrame(data) + + validator.check_avg_val_vs_reference( + test_df, ref_df, date.today(), "geo", "signal") + + assert len(validator.raised_errors) == 0 + + def test_10x_val(self): + validator = Validator(self.params) + test_data = {"val": [1, 1, 1, 20, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + ref_data = {"val": [1, 1, 1, 2, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(test_data) + ref_df = pd.DataFrame(ref_data) + validator.check_avg_val_vs_reference( + test_df, ref_df, + datetime.combine(date.today(), datetime.min.time()), "geo", "signal") + + assert len(validator.raised_errors) == 0 + + def test_100x_val(self): + validator = Validator(self.params) + test_data = {"val": [1, 1, 1, 200, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + ref_data = {"val": [1, 1, 1, 2, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(test_data) + ref_df = pd.DataFrame(ref_data) + validator.check_avg_val_vs_reference( + test_df, ref_df, + datetime.combine(date.today(), datetime.min.time()), "geo", "signal") + + assert len(validator.raised_errors) == 1 + assert "check_test_vs_reference_avg_changed" in [ + err.check_data_id[0] for err in validator.raised_errors] + + def test_1000x_val(self): + validator = Validator(self.params) + test_data = {"val": [1, 1, 1, 2000, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + ref_data = {"val": [1, 1, 1, 2, 0, 1], "se": [np.nan] * 6, + "sample_size": [np.nan] * 6, "geo_id": ["1"] * 6} + + test_df = pd.DataFrame(test_data) + ref_df = pd.DataFrame(ref_data) + validator.check_avg_val_vs_reference( + test_df, ref_df, + datetime.combine(date.today(), datetime.min.time()), "geo", "signal") + + assert len(validator.raised_errors) == 1 + assert "check_test_vs_reference_avg_changed" in [ + err.check_data_id[0] for err in validator.raised_errors]