feat(uptime): Implement detector handler #91107

evanpurkhiser · 2025-05-07T03:06:26Z

Still a work in progress

codecov · 2025-05-28T17:25:34Z

Codecov Report

Attention: Patch coverage is 99.58848% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/sentry/uptime/grouptype.py	99.01%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #91107      +/-   ##
==========================================
+ Coverage   86.69%   87.89%   +1.19%     
==========================================
  Files       10237    10239       +2     
  Lines      587151   587375     +224     
  Branches    22809    22809              
==========================================
+ Hits       509028   516261    +7233     
+ Misses      77693    70684    -7009     
  Partials      430      430

evanpurkhiser · 2025-05-29T21:23:14Z

src/sentry/uptime/consumers/results_consumer.py

+        #
+        # Once we've determined that the detector handler is producing issues
+        # the same as the legacy issue creation, we can remove this.
+        if features.has("organizations:uptime-detector-create-issues", organization):


This bit is important as it makes sure we don't create multiple occurrences at the same time.

evanpurkhiser · 2025-05-29T21:23:53Z

src/sentry/uptime/grouptype.py

For metrics we also have the handler inside the grouptype module. It's a bit difficult to have these separated because there's a interdependency on the two.

evanpurkhiser · 2025-05-29T21:24:41Z

src/sentry/uptime/grouptype.py

+    return options.get("uptime.active-recovery-threshold")
+
+
+def build_evidence_display(result: CheckResult) -> list[IssueEvidence]:


Shared now with the issue_platform module (which will be removed later)

evanpurkhiser · 2025-05-29T21:26:21Z

src/sentry/uptime/grouptype.py

+    return evidence_display
+
+
+def build_event_data(result: CheckResult, detector: Detector) -> EventData:


Also shared now with the issue platform

evanpurkhiser · 2025-05-29T21:30:12Z

src/sentry/uptime/grouptype.py

+
+    @override
+    def build_issue_fingerprint(self, group_key: DetectorGroupKey = None) -> list[str]:
+        return build_fingerprint(self.detector)


TODO: This isn't the only fingerprint actually, the stateful detector also adds it's own "default" fingerprint, which ends up being just the detector ID. This means these issues could potentially collide with our previous issues that were using the project subscription ID as the fingerprint.

evanpurkhiser · 2025-05-29T21:41:55Z

src/sentry/uptime/grouptype.py

+        uptime_subscription = data_packet.packet.subscription
+        metric_tags = data_packet.packet.metric_tags
+
+        detector_issue_creation_enabled = features.has(
+            "organizations:uptime-detector-create-issues",
+            self.detector.project.organization,
+        )
+        issue_creation_flag_enabled = features.has(
+            "organizations:uptime-create-issues",
+            self.detector.project.organization,
+        )
+        restricted_host_provider_ids = options.get(
+            "uptime.restrict-issue-creation-by-hosting-provider-id"
+        )
+        host_provider_id = uptime_subscription.host_provider_id
+        host_provider_enabled = host_provider_id not in restricted_host_provider_ids
+
+        issue_creation_allowed = (
+            detector_issue_creation_enabled
+            and issue_creation_flag_enabled
+            and host_provider_enabled
+        )


This is where we check a few different feature flags an options to bail out of issue creation, the same way we do in the consumer.

wedamija

This all makes sense to me so far, we're handling the switch with the feature flag so that we only write one occurrence, lgtm

wedamija · 2025-05-30T17:26:27Z

src/sentry/uptime/consumers/results_consumer.py

+
+        # Bail if we're doing issue creation via detectors, we don't want to
+        # create issues using the legacy system in this case. If this flag is
+        # not enabkled the detector will still run, but will not produce an


Nit: Enabled

Will get this in a follow up

wedamija · 2025-05-30T17:32:28Z

src/sentry/uptime/grouptype.py

+def build_detector_fingerprint_component(detector: Detector) -> str:
+    return f"uptime-detector:{detector.id}"


I think we need to start sending this new fingerprint type along with issue occurrences now, so then we can backfill

We have been actually, we use the same fingerprint here

sentry/src/sentry/uptime/issue_platform.py

Line 74 in ad75d51

fingerprint=build_fingerprint(detector),

mikejihbe · 2025-05-30T22:05:09Z

src/sentry/uptime/grouptype.py

+
+def build_event_data(result: CheckResult, detector: Detector) -> EventData:
+    # Default environment when it hasn't been configured
+    env = detector.config.get("environment", "prod")


is this normal behavior? Does every detector need this? Should we at least make the default a constant?

This is specific to the uptime detector. Unfortunately we hardcode default environment names a number of places in the sentry backend, so this is par for the course right now

mikejihbe · 2025-05-30T22:09:37Z

src/sentry/uptime/grouptype.py

+        result_creates_issue = isinstance(evaluation.result, IssueOccurrence)
+        result_resolves_issue = isinstance(evaluation.result, StatusChangeMessage)
+
+        if result_creates_issue:


All these metrics seem like they should be provided by the platform

Yeah I agree. But I don’t want to lose the metrics we have right now

for the most these metrics are provided by the platform: datadog, code

we will make a metric to denote that we're processing a detector, and if the detector has a state change. do we need to split out the difference between creating an issue and resolving an issue? the platform was just merging them into "a threshold was breached", but if we want to have separate tracking for those cases it should be fairly trivial to add.

I can clean this all up later I just wanted to make sure we were not losing any metrics as we cut over from our legacy issue creation system

It’s also valuable for us to have various metric tags. Probably something worth talking about for the handler APIs

saponifi3d · 2025-05-30T22:09:43Z

src/sentry/uptime/grouptype.py

+        uptime_subscription = data_packet.packet.subscription
+        metric_tags = data_packet.packet.metric_tags
+
+        detector_issue_creation_enabled = features.has(
+            "organizations:uptime-detector-create-issues",
+            self.detector.project.organization,
+        )
+        issue_creation_flag_enabled = features.has(
+            "organizations:uptime-create-issues",
+            self.detector.project.organization,
+        )
+        restricted_host_provider_ids = options.get(
+            "uptime.restrict-issue-creation-by-hosting-provider-id"
+        )
+        host_provider_id = uptime_subscription.host_provider_id
+        host_provider_enabled = host_provider_id not in restricted_host_provider_ids
+
+        issue_creation_allowed = (
+            detector_issue_creation_enabled
+            and issue_creation_flag_enabled
+            and host_provider_enabled
+        )
+
+        # XXX(epurkhiser): We currently are duplicating the detector state onto
+        # the uptime_subscription when the detector changes state. Once we stop
+        # using this field we can drop this update logic.
+        #
+        # We ONLY do this when detector issue creation is enabled, otherwise we
+        # let the legacy uptime consumer handle this.
+        if detector_issue_creation_enabled:
+            if evaluation.priority == DetectorPriorityLevel.OK:
+                uptime_status = UptimeStatus.OK
+            elif evaluation.priority != DetectorPriorityLevel.OK:
+                uptime_status = UptimeStatus.FAILED
+
+            uptime_subscription.update(
+                uptime_status=uptime_status,
+                uptime_status_update_date=django_timezone.now(),
+            )
+
+        if not host_provider_enabled:
+            metrics.incr(
+                "uptime.result_processor.restricted_by_provider",
+                sample_rate=1.0,
+                tags={
+                    "host_provider_id": host_provider_id,
+                    **metric_tags,
+                },
+            )
+
+        result_creates_issue = isinstance(evaluation.result, IssueOccurrence)
+        result_resolves_issue = isinstance(evaluation.result, StatusChangeMessage)
+
+        if result_creates_issue:
+            metrics.incr(
+                "uptime.detector.will_create_issue",
+                tags=metric_tags,
+                sample_rate=1.0,
+            )
+            # XXX(epurkhiser): This logging includes the same extra arguments
+            # as the `uptime_active_sent_occurrence` log in the consumer for
+            # legacy creation
+            logger.info(
+                "uptime.detector.will_create_issue",
+                extra={
+                    "project_id": self.detector.project_id,
+                    "url": uptime_subscription.url,
+                    **data_packet.packet.check_result,
+                },
+            )
+        if result_resolves_issue:
+            metrics.incr(
+                "uptime.detector.will_resolve_issue",
+                sample_rate=1.0,
+                tags=metric_tags,
+            )
+            logger.info(
+                "uptime.detector.will_resolve_issue",
+                extra={
+                    "project_id": self.detector.project_id,
+                    "url": uptime_subscription.url,
+                    **data_packet.packet.check_result,
+                },
+            )
+
+        # Reutning an empty dict effectively causes the detector processor to
+        # bail and not produce an issue occurrence.
+        if result_creates_issue and not issue_creation_allowed:
+            return {}
+
+        return result


curious if y'all have any thoughts about this evaluation bit -- is there anything for you that would make this easier to implement? think it's okay to just invoke the super and handle it like this?

For now I think this is fine. Once we remove a lot of the legacy issue creation code for uptime some of this is going to go away and it will be more clear what can be factored and abstracted.

mikejihbe · 2025-05-30T22:14:10Z

src/sentry/uptime/grouptype.py

+        return int(data_packet.packet.check_result["scheduled_check_time_ms"])
+
+    @override
+    def evaluate(


High level, I feel we should aim for this evaluate function to be pure. All the side effects and extra conditionals in here are a design smell. I'd encourage us to generalize these:

metrics should be emitted by framework layer

feature flags to test things and turn on/off issue creation seem like they're something we'll want for most detectors and the framework should provide this too.

Yes I agree. Parts of this have to do with the fact that we’re migrating away from our existing issue creation system for uptime. Once we’ve cleaned up more of that this should simplify and we can clean things up more.

* master: (249 commits) feat(source-maps): Do not show pagination together with empty state (#92287) ref(project-creation): Introduce useCreateProjectRules hook (#92186) feat(agent-insights): Handle new keys (#92613) feat(source-maps): Introduce new empty state copies and react-native callout (#92286) ref(issues): Remove project from group activity type (#92600) feat(ourlogs): Use /trace-logs endpoint (#92577) feat(issues): Only update group hasSeen when user is member (#92597) fix(workflow_engine): Graceful Data Condition Eval Handling (#92591) feat(uptime): Implement detector handler (#91107) chore(autofix): Remove logs from response payload (#92589) fix(search): Fix issue with tags name 'constructor' (#92586) fix(autofix): Fix condition for onboarding check (#92584) fix(ourlogs): Return the same format as /events & limit 1000 for trace-logs (#92580) fix(autofix): Fix automation onboarding condition (#92579) feat(explore): Remove group by timestamp from explore (#92546) feat(trace-items): Autocomplete for semver attributes (#92515) feat(processing) Define EventProcessingStore.__all__ (#92555) feat(autofix): Better errored state (#92571) chore(autofix): Seer beta banner copy changes (#92576) feat(crons): Add endpoint to return counts by status (#92574) ...

Still a work in progress

evanpurkhiser requested a review from a team as a code owner May 7, 2025 03:06

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 7, 2025

evanpurkhiser added the WIP label May 7, 2025

evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from 0b984dd to ff24352 Compare May 28, 2025 17:22

vercel bot deployed to Preview May 28, 2025 17:25 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from ff24352 to c14d535 Compare May 28, 2025 18:01

vercel bot deployed to Preview May 28, 2025 18:03 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from c14d535 to 9042f7c Compare May 28, 2025 21:16

vercel bot deployed to Preview May 28, 2025 21:18 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from 9042f7c to 9c98bea Compare May 29, 2025 21:22

evanpurkhiser commented May 29, 2025

View reviewed changes

vercel bot deployed to Preview May 29, 2025 21:23 View deployment

evanpurkhiser commented May 29, 2025

View reviewed changes

evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from 9c98bea to a01173f Compare May 30, 2025 17:05

vercel bot deployed to Preview May 30, 2025 17:08 View deployment

wedamija approved these changes May 30, 2025

View reviewed changes

feat(uptime): Implement detector handler

002e130

evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from a01173f to 002e130 Compare May 30, 2025 21:04

vercel bot deployed to Preview May 30, 2025 21:06 View deployment

evanpurkhiser merged commit ca00876 into master May 30, 2025
60 checks passed

evanpurkhiser deleted the evanpurkhiser/feat-uptime-implement-detector-handler branch May 30, 2025 21:46

mikejihbe reviewed May 30, 2025

View reviewed changes

saponifi3d reviewed May 30, 2025

View reviewed changes

mikejihbe reviewed May 30, 2025

View reviewed changes

andrewshie-sentry pushed a commit that referenced this pull request Jun 2, 2025

feat(uptime): Implement detector handler (#91107)

6099ff2

Still a work in progress

github-actions bot locked and limited conversation to collaborators Jun 15, 2025

		return options.get("uptime.active-recovery-threshold")


		def build_evidence_display(result: CheckResult) -> list[IssueEvidence]:

		return evidence_display


		def build_event_data(result: CheckResult, detector: Detector) -> EventData:

		def build_detector_fingerprint_component(detector: Detector) -> str:
		return f"uptime-detector:{detector.id}"

Uh oh!

feat(uptime): Implement detector handler #91107

feat(uptime): Implement detector handler #91107

Uh oh!

Conversation

evanpurkhiser commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wedamija left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

saponifi3d May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

evanpurkhiser commented May 7, 2025 •

edited

Loading

codecov bot commented May 28, 2025 •

edited

Loading

saponifi3d May 30, 2025 •

edited

Loading