-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat(uptime): Implement detector handler #91107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(uptime): Implement detector handler #91107
Conversation
0b984dd to
ff24352
Compare
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found.
Additional details and impacted files@@ Coverage Diff @@
## master #91107 +/- ##
==========================================
+ Coverage 86.69% 87.89% +1.19%
==========================================
Files 10237 10239 +2
Lines 587151 587375 +224
Branches 22809 22809
==========================================
+ Hits 509028 516261 +7233
+ Misses 77693 70684 -7009
Partials 430 430 |
ff24352 to
c14d535
Compare
c14d535 to
9042f7c
Compare
9042f7c to
9c98bea
Compare
| # | ||
| # Once we've determined that the detector handler is producing issues | ||
| # the same as the legacy issue creation, we can remove this. | ||
| if features.has("organizations:uptime-detector-create-issues", organization): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is important as it makes sure we don't create multiple occurrences at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For metrics we also have the handler inside the grouptype module. It's a bit difficult to have these separated because there's a interdependency on the two.
| return options.get("uptime.active-recovery-threshold") | ||
|
|
||
|
|
||
| def build_evidence_display(result: CheckResult) -> list[IssueEvidence]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared now with the issue_platform module (which will be removed later)
| return evidence_display | ||
|
|
||
|
|
||
| def build_event_data(result: CheckResult, detector: Detector) -> EventData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also shared now with the issue platform
|
|
||
| @override | ||
| def build_issue_fingerprint(self, group_key: DetectorGroupKey = None) -> list[str]: | ||
| return build_fingerprint(self.detector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: This isn't the only fingerprint actually, the stateful detector also adds it's own "default" fingerprint, which ends up being just the detector ID. This means these issues could potentially collide with our previous issues that were using the project subscription ID as the fingerprint.
| uptime_subscription = data_packet.packet.subscription | ||
| metric_tags = data_packet.packet.metric_tags | ||
|
|
||
| detector_issue_creation_enabled = features.has( | ||
| "organizations:uptime-detector-create-issues", | ||
| self.detector.project.organization, | ||
| ) | ||
| issue_creation_flag_enabled = features.has( | ||
| "organizations:uptime-create-issues", | ||
| self.detector.project.organization, | ||
| ) | ||
| restricted_host_provider_ids = options.get( | ||
| "uptime.restrict-issue-creation-by-hosting-provider-id" | ||
| ) | ||
| host_provider_id = uptime_subscription.host_provider_id | ||
| host_provider_enabled = host_provider_id not in restricted_host_provider_ids | ||
|
|
||
| issue_creation_allowed = ( | ||
| detector_issue_creation_enabled | ||
| and issue_creation_flag_enabled | ||
| and host_provider_enabled | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we check a few different feature flags an options to bail out of issue creation, the same way we do in the consumer.
9c98bea to
a01173f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense to me so far, we're handling the switch with the feature flag so that we only write one occurrence, lgtm
|
|
||
| # Bail if we're doing issue creation via detectors, we don't want to | ||
| # create issues using the legacy system in this case. If this flag is | ||
| # not enabkled the detector will still run, but will not produce an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get this in a follow up
| def build_detector_fingerprint_component(detector: Detector) -> str: | ||
| return f"uptime-detector:{detector.id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to start sending this new fingerprint type along with issue occurrences now, so then we can backfill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have been actually, we use the same fingerprint here
| fingerprint=build_fingerprint(detector), |
a01173f to
002e130
Compare
|
|
||
| def build_event_data(result: CheckResult, detector: Detector) -> EventData: | ||
| # Default environment when it hasn't been configured | ||
| env = detector.config.get("environment", "prod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this normal behavior? Does every detector need this? Should we at least make the default a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specific to the uptime detector. Unfortunately we hardcode default environment names a number of places in the sentry backend, so this is par for the course right now
| result_creates_issue = isinstance(evaluation.result, IssueOccurrence) | ||
| result_resolves_issue = isinstance(evaluation.result, StatusChangeMessage) | ||
|
|
||
| if result_creates_issue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these metrics seem like they should be provided by the platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree. But I don’t want to lose the metrics we have right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the most these metrics are provided by the platform: datadog, code
we will make a metric to denote that we're processing a detector, and if the detector has a state change. do we need to split out the difference between creating an issue and resolving an issue? the platform was just merging them into "a threshold was breached", but if we want to have separate tracking for those cases it should be fairly trivial to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can clean this all up later I just wanted to make sure we were not losing any metrics as we cut over from our legacy issue creation system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s also valuable for us to have various metric tags. Probably something worth talking about for the handler APIs
| uptime_subscription = data_packet.packet.subscription | ||
| metric_tags = data_packet.packet.metric_tags | ||
|
|
||
| detector_issue_creation_enabled = features.has( | ||
| "organizations:uptime-detector-create-issues", | ||
| self.detector.project.organization, | ||
| ) | ||
| issue_creation_flag_enabled = features.has( | ||
| "organizations:uptime-create-issues", | ||
| self.detector.project.organization, | ||
| ) | ||
| restricted_host_provider_ids = options.get( | ||
| "uptime.restrict-issue-creation-by-hosting-provider-id" | ||
| ) | ||
| host_provider_id = uptime_subscription.host_provider_id | ||
| host_provider_enabled = host_provider_id not in restricted_host_provider_ids | ||
|
|
||
| issue_creation_allowed = ( | ||
| detector_issue_creation_enabled | ||
| and issue_creation_flag_enabled | ||
| and host_provider_enabled | ||
| ) | ||
|
|
||
| # XXX(epurkhiser): We currently are duplicating the detector state onto | ||
| # the uptime_subscription when the detector changes state. Once we stop | ||
| # using this field we can drop this update logic. | ||
| # | ||
| # We ONLY do this when detector issue creation is enabled, otherwise we | ||
| # let the legacy uptime consumer handle this. | ||
| if detector_issue_creation_enabled: | ||
| if evaluation.priority == DetectorPriorityLevel.OK: | ||
| uptime_status = UptimeStatus.OK | ||
| elif evaluation.priority != DetectorPriorityLevel.OK: | ||
| uptime_status = UptimeStatus.FAILED | ||
|
|
||
| uptime_subscription.update( | ||
| uptime_status=uptime_status, | ||
| uptime_status_update_date=django_timezone.now(), | ||
| ) | ||
|
|
||
| if not host_provider_enabled: | ||
| metrics.incr( | ||
| "uptime.result_processor.restricted_by_provider", | ||
| sample_rate=1.0, | ||
| tags={ | ||
| "host_provider_id": host_provider_id, | ||
| **metric_tags, | ||
| }, | ||
| ) | ||
|
|
||
| result_creates_issue = isinstance(evaluation.result, IssueOccurrence) | ||
| result_resolves_issue = isinstance(evaluation.result, StatusChangeMessage) | ||
|
|
||
| if result_creates_issue: | ||
| metrics.incr( | ||
| "uptime.detector.will_create_issue", | ||
| tags=metric_tags, | ||
| sample_rate=1.0, | ||
| ) | ||
| # XXX(epurkhiser): This logging includes the same extra arguments | ||
| # as the `uptime_active_sent_occurrence` log in the consumer for | ||
| # legacy creation | ||
| logger.info( | ||
| "uptime.detector.will_create_issue", | ||
| extra={ | ||
| "project_id": self.detector.project_id, | ||
| "url": uptime_subscription.url, | ||
| **data_packet.packet.check_result, | ||
| }, | ||
| ) | ||
| if result_resolves_issue: | ||
| metrics.incr( | ||
| "uptime.detector.will_resolve_issue", | ||
| sample_rate=1.0, | ||
| tags=metric_tags, | ||
| ) | ||
| logger.info( | ||
| "uptime.detector.will_resolve_issue", | ||
| extra={ | ||
| "project_id": self.detector.project_id, | ||
| "url": uptime_subscription.url, | ||
| **data_packet.packet.check_result, | ||
| }, | ||
| ) | ||
|
|
||
| # Reutning an empty dict effectively causes the detector processor to | ||
| # bail and not produce an issue occurrence. | ||
| if result_creates_issue and not issue_creation_allowed: | ||
| return {} | ||
|
|
||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if y'all have any thoughts about this evaluation bit -- is there anything for you that would make this easier to implement? think it's okay to just invoke the super and handle it like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I think this is fine. Once we remove a lot of the legacy issue creation code for uptime some of this is going to go away and it will be more clear what can be factored and abstracted.
| return int(data_packet.packet.check_result["scheduled_check_time_ms"]) | ||
|
|
||
| @override | ||
| def evaluate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level, I feel we should aim for this evaluate function to be pure. All the side effects and extra conditionals in here are a design smell. I'd encourage us to generalize these:
- metrics should be emitted by framework layer
- feature flags to test things and turn on/off issue creation seem like they're something we'll want for most detectors and the framework should provide this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. Parts of this have to do with the fact that we’re migrating away from our existing issue creation system for uptime. Once we’ve cleaned up more of that this should simplify and we can clean things up more.
* master: (249 commits) feat(source-maps): Do not show pagination together with empty state (#92287) ref(project-creation): Introduce useCreateProjectRules hook (#92186) feat(agent-insights): Handle new keys (#92613) feat(source-maps): Introduce new empty state copies and react-native callout (#92286) ref(issues): Remove project from group activity type (#92600) feat(ourlogs): Use /trace-logs endpoint (#92577) feat(issues): Only update group hasSeen when user is member (#92597) fix(workflow_engine): Graceful Data Condition Eval Handling (#92591) feat(uptime): Implement detector handler (#91107) chore(autofix): Remove logs from response payload (#92589) fix(search): Fix issue with tags name 'constructor' (#92586) fix(autofix): Fix condition for onboarding check (#92584) fix(ourlogs): Return the same format as /events & limit 1000 for trace-logs (#92580) fix(autofix): Fix automation onboarding condition (#92579) feat(explore): Remove group by timestamp from explore (#92546) feat(trace-items): Autocomplete for semver attributes (#92515) feat(processing) Define EventProcessingStore.__all__ (#92555) feat(autofix): Better errored state (#92571) chore(autofix): Seer beta banner copy changes (#92576) feat(crons): Add endpoint to return counts by status (#92574) ...
Still a work in progress
Still a work in progress