-
Notifications
You must be signed in to change notification settings - Fork 60
Open
Milestone
Description
My prometheus scrape jobs stopped getting any data from rack2 after it was updated to omicron commit ae3ca81. Most of the scrape jobs look for data points collected in the most recent 1 or 2 minutes. They used to get data back consistently until this recent rack update. I was able to get some data if I put "@now() - 45m" as the interval but I've also seen no data for a whole hour, e.g.
oxide experimental system timeseries query --query 'get hardware_component:amd_cpu_tctl | filter timestamp > @now() - 1h | last 1'
{
"tables": [
{
"name": "hardware_component:amd_cpu_tctl",
"timeseries": {}
}
]
}
In the oximeter logs, I see errors like this which didn't exist in earlier logs:
20:29:51.751Z WARN oximeter (oximeter-agent): failed to insert some results into metric DB
collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
collector_ip = fd00:1122:3344:10a::3
error = Telemetry database unavailable: SQL query timed out after 30.000955486s
file = oximeter/collector/src/results_sink.rs:92
There are also frequent errors like the one below but they are also there prior to the recent SW update:
22:28:10.427Z ERRO oximeter (oximeter-agent): timer-based collection request queue is full! This may indicate that the producer has a sampling interval that is too fast for the amount of data it generates
collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
collector_ip = fd00:1122:3344:10a::3
file = oximeter/collector/src/collection_task.rs:845
interval = 1s
producer_id = c334fc56-155a-4d7f-a2c9-e104f73603a2
The ClickHouse database were up and running when I logged into them. I'll see if there is anything useful from their log files during the database unavailable moments.
Metadata
Metadata
Assignees
Labels
No labels