-
Notifications
You must be signed in to change notification settings - Fork 67
Integration Test Case for is_latest_issue using DB #925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from 13 commits
bd7f285
d789236
c3f0205
1f10a04
b241980
0a4e09e
f0f3abe
a998d38
035baf7
6d8d25e
4090d4d
2e0546f
39ab346
db706ac
b9d712f
d0b8d69
11a5316
64ea75d
9cd42c3
00e396b
dee268d
157530f
18a2758
cb78c73
6682b58
f06c327
d250f98
56852f4
7427906
c120316
1f6322c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,3 +11,4 @@ __pycache__/ | |
| /node_modules | ||
| .mypy_cache | ||
| /missing_db_signals.csv | ||
| /dev/local/output.txt | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,262 @@ | ||||||||||||||||||||||||||||||||||||||||
| """Integration tests for covidcast's is_latest_issue boolean.""" | ||||||||||||||||||||||||||||||||||||||||
| # standard library | ||||||||||||||||||||||||||||||||||||||||
| import unittest | ||||||||||||||||||||||||||||||||||||||||
| import time | ||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||
| import time |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import threading |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from aiohttp.client_exceptions import ClientResponseError |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import pytest |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from delphi.epidata.acquisition.covidcast.logger import get_structured_logger |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| from delphi.epidata.acquisition.covidcast.covidcast_meta_cache_updater import main as update_covidcast_meta_cache |
krivard marked this conversation as resolved.
Show resolved
Hide resolved
xavier-xia-99 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For easier maintainability, use e.g. Database.history_table instead of "signal_history" in these
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.viewSignalHistory?
xavier-xia-99 marked this conversation as resolved.
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make more sense to check len(record3) here?
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| CovidcastRow('src', 'sig', 'day', 'state', 20200414, 'pa', # | |
| CovidcastRow('src', 'sig', 'day', 'state', 20200414, 'pa', |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats not an update!
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pattern like this will be easier to maintain -- we'll be able to add new tests without having to update all subsequent expressions.
| self.assertEqual(len(list(record)), sigLatestRows + 2) #2 original, 2 added | |
| record_length = len(list(record)) | |
| self.assertEqual(record_length, sigLatestRows + 2) #2 original, 2 added | |
| sigLatestRows = record_length |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update sigDimRows here
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update sigLatestRows here
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update geoDimRows here
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update sigLatestRows here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can one of you clarify for me? I thought the whole point of batch issue uploads was that they included multiple issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I was writing tests with multiple issues but it was not performing as intended so I sought George out for some help. Would it be possible to find a time to sit down together to go through this quickly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it -- insert_or_update_bulk is called from within a loop, such that it is called separately for each CSV file. Each CSV contains only one time_value and only one issue.
delphi-epidata/src/acquisition/covidcast/csv_to_database.py
Lines 91 to 109 in 61a3d8b
| for path, details in path_details: | |
| logger.info(event='handling',dest=path) | |
| path_src, filename = os.path.split(path) | |
| if not details: | |
| # file path or name was invalid, source is unknown | |
| archive_as_failed(path_src, filename, 'unknown',logger) | |
| continue | |
| (source, signal, time_type, geo_type, time_value, issue, lag) = details | |
| csv_rows = csv_importer_impl.load_csv(path, geo_type) | |
| cc_rows = CovidcastRow.fromCsvRows(csv_rows, source, signal, time_type, geo_type, time_value, issue, lag) | |
| rows_list = list(cc_rows) | |
| all_rows_valid = rows_list and all(r is not None for r in rows_list) | |
| if all_rows_valid: | |
| try: | |
| modified_row_count = database.insert_or_update_bulk(rows_list) |
However, this test also includes run_dbjobs, which is not called as part of the csv_to_database script or as part of insert_or_update_bulk. We cannot therefore expect that run_dbjobs will only deal with one time_value and issue at a time. So it depends on what this test is trying to verify:
- correct behavior of
insert_or_update_bulkalone? --> then it should not call run_dbjobs - correct behavior of
insert_or_update_bulkandrun_dbjobswhen run separately with each CSV file? --> then this test doesn't tell us whether run_dbjobs will run correctly as part of the intended load pipeline - correct behavior of
insert_or_update_bulkwhen run separately with each CSV file, and ofrun_dbjobswhen run after all available CSVs have been loaded intosignal_load? --> then this test should runinsert_or_update_bulkonce for each time_value x issue configuration, and thenrun_dbjobsonce at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a flaw (for some definition of 'flaw') that we cannot properly process multiple issues per datapoint in a single run of insert_or_update[+dbjobs] if the issues are not provided in increasing order (or really, if the 'latest' issue is not last in the series). In practice, insert_or_update_bulk() is only called by csv_to_database:upload_archive() and each call only specifies a single issue (in fact, the only part of the composite key that differs is the geo_value). This is not going to cause us problems for the foreseeable future, but it is still worth discussing to determine if/how we want to account for possibilities later (adding/changing logic in insert_or_update (in the SQL or in the python), carrying some of those key-field arguments further down the call stack to prevent unintended consequences, refactoring the larger ingestion pipeline, or otherwise).
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results of zoom discussion:
Options:
- one CSV per insert, many CSVs per dbjobs
- Decision: one CSV per insert+dbjobs. Have insert call dbjobs itself so csv_to_database doesn't have to bother about it.
- if someone chooses to run dbjobs on an untrusted signal_load, it's their responsibility to ensure that issues are ordered correctly.
- could add a guard to dbjobs to check if signal_load is sorted correctly but that check would take a long time so let's not. comment it obsessively though at the head of dbjobs.
Options for fixing at dbjobs time: none of them are good so let's not.
- get max before loading into latest
- sort while loading into latest
Batch issue uploads do traverse issues in the correct order, but since we're having insert call dbjobs we don't need to care.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.assertEqual(record[0][0], 20200417) #20200416 != 20200417 | |
| # Make sure the 4/17 issue is listed even though 4/16 was imported after it | |
| self.assertEqual(record[0][0], 20200417) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,3 +11,4 @@ scipy==1.6.2 | |
| tenacity==7.0.0 | ||
| newrelic | ||
| epiweeks==2.1.2 | ||
| prettytable | ||
Uh oh!
There was an error while loading. Please reload this page.