-
Couldn't load subscription status.
- Fork 40
Description
Describe the bug
Quite often, we have a user report an issue to us where one of the following tools is unresponsive due to the Specify 7 Worker process being offline:
- WorkBench
- Record Merging
- Batch Edit
This causes internal issues as well. @lexiclevenger has asked me to restart several instances (Donana, UC Davis, etc.) for conversion work. The worker often becomes unresponsive or fails unexpectedly. The logs for the worker in most, if not all, of these cases provide little information about the failure, and they seem accessible. The only way to reestablish the connection and allow the validation, uploading, and merging processes to resume is to completely restart the worker.
This is not a new issue, and tickets related to this problem date back to 2021 (the introduction of this component). This has been encountered on all versions from v7.6.0 to v7.10.2.2.
In the WorkBench, users are told they should wait for another data set to be finished uploading:
specify7/specifyweb/frontend/js_src/lib/localization/workbench.ts
Lines 1453 to 1458 in fdc2301
| wbStatusPendingSecondDescription: { | |
| 'en-us': ` | |
| If this message persists for longer than 30 seconds, the | |
| {operationName:string} process is busy with another Data Set. Please try | |
| again later. | |
| `, |
Of course, if the worker is unresponsive, this will never disappear, and they can be left waiting for hours/days until they message our team.
This issue is meant to capture such incidents and to call for a proper resolution to prevent these issues going forward.
As Max said in June 2021:
As to the "Beginning the Validating..." message, currently, it can mean at least 3 things and there is no way for the front end to differentiate between them:
- The upload worker is preparing to validate the DS and validation should begin soon
- The upload worker is currently busy with another DS and validation would not begin until that DS is validated
- The upload worker is down or not responding. The validation would not begin and the "Beginning the Validating" would stay up forever
- All of this also applies to Upload and Rollback.
Reported By Institutions
Each bullet point has the subject line of the email/post where an issue was reported due to the Specify Worker being offline.
This is not a complete list, as often it was discused in meetings or under terms that made it difficult to find in the support tickets. I only went back to late 2023.
This is only what was directly reported to us, which further reduces the sample size.
Natural History Museums of Denmark
- Web server timeouts related to WB validation (2024-06-24)
University of Michigan
We are currently experiencing an issue with multiple Specify 7 databases where Workbenches fail to complete validation. After selecting Validate, the process pauses at the Data Set Validation Status message for several minutes before eventually failing with a 'Failed aborting validation' message.
- Specify 7 Workbench failures at 'Data Set Validation Status' message (2024-03-05)
University of Kansas
From @acbentley (2024-02-21 on Slack):
KU Ornithology just came to indicating that agent merging was not working for them. They showed me an example of two agents with the last name Barbour that they wanted to merge. When starting the merge process the merge dialog appears but then just sits there and doesn't complete as in the screenshot. No error message is ever thrown. I tried the same process with my admin account and got the same result. I even tried a different agent duplicate (name A. A. Alcorn) and got the same result. As I am not getting any error message I am unclear as to how to diagnose the issue. Could someone take a look and see what is going on please?
From @acwhite211 (2024-02-21) on Slack):
it was something to do with the connection between the redis and worker containers. I restarted redis, and then restarted all the worker containers. It's working now.
The error on the worker side was[2024-02-20 22:54:24,034: ERROR/MainProcess] consumer: Cannot connect to redis://redis:6379/0: Error -2 connecting to redis:6379. Name or service not known.. Trying again in 32.00 seconds... (16/100)But there were no errors on the redis side
College of Idaho
- Validate troubles in Batch Edit (2025-05-09)
- Data Set Validation Woes (2024-10-22)
- More workbench issues (2024-09-15)
From Theresa (2024-10-24):
Grant is out of the office currently and he definitely has a bit more information on what was going on. We did discuss a bit about options to prevent it from happening in the future. While we are attempting to come up with a resolution, please let us know if it happens again.
Donana
- I'm trying to validate this data set, but it keeps getting stuck on the 'data set validation status' dialog (2025-05-07)
University of British Columbia
- Specify worker appears to be stuck (2025-02-12)
Royal Botanic Gardens Edinburgh
- Unable to validate workbench import (2025-02-12)
Museum fur Naturkunde, Berlin
- Specify 7 problems + Workshop (2025-04-03)
The Hebrew University of Jerusalem
- Error during validation process (2023-12-07)
Florida Fish and Wildlife
- Saving bug (?) in workbench (2024-07-23)
Canadian Forestry Service
- Stuck on workbench validation (2024-12-04)
The Ohio State University Mollusks
- CO: GUID field bugging out, and 1 other error (2024-09-10)
University of Guam
- Attachment Upload Tool Issue (2025-01-23)
- issue opening jrxml files in Jaspersoft Studio (2024-10-02)
theNAT (San Diego Natural History Museum)
- Worker was not running [Speciforum] (2024-01-29)
University of Minnesota Entomology
- Validation Not Working (2024-09-10)
Montana State University
- Validation not working (2024-06-07)
- Workbench data set validation status (2024-01-12)
Earlham College
- Merge Records Problem (2024-03-11)
To Reproduce
The biggest issue is that we are not sure exactly what causes the worker to become unresponsive.
Steps to reproduce the behavior:
- Kill the
specify7-workercontainer - Try to validate a data set, perform a merge, or upload data
- See the user is given no message and is asked to simply wait forever
Expected behavior
We need to make the worker processes more resilient to address unexpected and unresolvable outages.