Skip to content

Conversation

@evanpurkhiser
Copy link
Member

@evanpurkhiser evanpurkhiser commented May 7, 2025

Still a work in progress

@evanpurkhiser evanpurkhiser requested a review from a team as a code owner May 7, 2025 03:06
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 7, 2025
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from 0b984dd to ff24352 Compare May 28, 2025 17:22
@codecov
Copy link

codecov bot commented May 28, 2025

Codecov Report

Attention: Patch coverage is 99.58848% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/uptime/grouptype.py 99.01% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #91107      +/-   ##
==========================================
+ Coverage   86.69%   87.89%   +1.19%     
==========================================
  Files       10237    10239       +2     
  Lines      587151   587375     +224     
  Branches    22809    22809              
==========================================
+ Hits       509028   516261    +7233     
+ Misses      77693    70684    -7009     
  Partials      430      430              

@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from ff24352 to c14d535 Compare May 28, 2025 18:01
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from c14d535 to 9042f7c Compare May 28, 2025 21:16
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from 9042f7c to 9c98bea Compare May 29, 2025 21:22
#
# Once we've determined that the detector handler is producing issues
# the same as the legacy issue creation, we can remove this.
if features.has("organizations:uptime-detector-create-issues", organization):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is important as it makes sure we don't create multiple occurrences at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For metrics we also have the handler inside the grouptype module. It's a bit difficult to have these separated because there's a interdependency on the two.

return options.get("uptime.active-recovery-threshold")


def build_evidence_display(result: CheckResult) -> list[IssueEvidence]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared now with the issue_platform module (which will be removed later)

return evidence_display


def build_event_data(result: CheckResult, detector: Detector) -> EventData:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shared now with the issue platform


@override
def build_issue_fingerprint(self, group_key: DetectorGroupKey = None) -> list[str]:
return build_fingerprint(self.detector)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: This isn't the only fingerprint actually, the stateful detector also adds it's own "default" fingerprint, which ends up being just the detector ID. This means these issues could potentially collide with our previous issues that were using the project subscription ID as the fingerprint.

Comment on lines +170 to +193
uptime_subscription = data_packet.packet.subscription
metric_tags = data_packet.packet.metric_tags

detector_issue_creation_enabled = features.has(
"organizations:uptime-detector-create-issues",
self.detector.project.organization,
)
issue_creation_flag_enabled = features.has(
"organizations:uptime-create-issues",
self.detector.project.organization,
)
restricted_host_provider_ids = options.get(
"uptime.restrict-issue-creation-by-hosting-provider-id"
)
host_provider_id = uptime_subscription.host_provider_id
host_provider_enabled = host_provider_id not in restricted_host_provider_ids

issue_creation_allowed = (
detector_issue_creation_enabled
and issue_creation_flag_enabled
and host_provider_enabled
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we check a few different feature flags an options to bail out of issue creation, the same way we do in the consumer.

@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from 9c98bea to a01173f Compare May 30, 2025 17:05
Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all makes sense to me so far, we're handling the switch with the feature flag so that we only write one occurrence, lgtm


# Bail if we're doing issue creation via detectors, we don't want to
# create issues using the legacy system in this case. If this flag is
# not enabkled the detector will still run, but will not produce an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Enabled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will get this in a follow up

Comment on lines +51 to +52
def build_detector_fingerprint_component(detector: Detector) -> str:
return f"uptime-detector:{detector.id}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to start sending this new fingerprint type along with issue occurrences now, so then we can backfill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been actually, we use the same fingerprint here

fingerprint=build_fingerprint(detector),

@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-uptime-implement-detector-handler branch from a01173f to 002e130 Compare May 30, 2025 21:04
@evanpurkhiser evanpurkhiser merged commit ca00876 into master May 30, 2025
60 checks passed
@evanpurkhiser evanpurkhiser deleted the evanpurkhiser/feat-uptime-implement-detector-handler branch May 30, 2025 21:46

def build_event_data(result: CheckResult, detector: Detector) -> EventData:
# Default environment when it hasn't been configured
env = detector.config.get("environment", "prod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this normal behavior? Does every detector need this? Should we at least make the default a constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to the uptime detector. Unfortunately we hardcode default environment names a number of places in the sentry backend, so this is par for the course right now

result_creates_issue = isinstance(evaluation.result, IssueOccurrence)
result_resolves_issue = isinstance(evaluation.result, StatusChangeMessage)

if result_creates_issue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these metrics seem like they should be provided by the platform

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. But I don’t want to lose the metrics we have right now

Copy link
Contributor

@saponifi3d saponifi3d May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the most these metrics are provided by the platform: datadog, code

we will make a metric to denote that we're processing a detector, and if the detector has a state change. do we need to split out the difference between creating an issue and resolving an issue? the platform was just merging them into "a threshold was breached", but if we want to have separate tracking for those cases it should be fairly trivial to add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can clean this all up later I just wanted to make sure we were not losing any metrics as we cut over from our legacy issue creation system

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s also valuable for us to have various metric tags. Probably something worth talking about for the handler APIs

Comment on lines +172 to +262
uptime_subscription = data_packet.packet.subscription
metric_tags = data_packet.packet.metric_tags

detector_issue_creation_enabled = features.has(
"organizations:uptime-detector-create-issues",
self.detector.project.organization,
)
issue_creation_flag_enabled = features.has(
"organizations:uptime-create-issues",
self.detector.project.organization,
)
restricted_host_provider_ids = options.get(
"uptime.restrict-issue-creation-by-hosting-provider-id"
)
host_provider_id = uptime_subscription.host_provider_id
host_provider_enabled = host_provider_id not in restricted_host_provider_ids

issue_creation_allowed = (
detector_issue_creation_enabled
and issue_creation_flag_enabled
and host_provider_enabled
)

# XXX(epurkhiser): We currently are duplicating the detector state onto
# the uptime_subscription when the detector changes state. Once we stop
# using this field we can drop this update logic.
#
# We ONLY do this when detector issue creation is enabled, otherwise we
# let the legacy uptime consumer handle this.
if detector_issue_creation_enabled:
if evaluation.priority == DetectorPriorityLevel.OK:
uptime_status = UptimeStatus.OK
elif evaluation.priority != DetectorPriorityLevel.OK:
uptime_status = UptimeStatus.FAILED

uptime_subscription.update(
uptime_status=uptime_status,
uptime_status_update_date=django_timezone.now(),
)

if not host_provider_enabled:
metrics.incr(
"uptime.result_processor.restricted_by_provider",
sample_rate=1.0,
tags={
"host_provider_id": host_provider_id,
**metric_tags,
},
)

result_creates_issue = isinstance(evaluation.result, IssueOccurrence)
result_resolves_issue = isinstance(evaluation.result, StatusChangeMessage)

if result_creates_issue:
metrics.incr(
"uptime.detector.will_create_issue",
tags=metric_tags,
sample_rate=1.0,
)
# XXX(epurkhiser): This logging includes the same extra arguments
# as the `uptime_active_sent_occurrence` log in the consumer for
# legacy creation
logger.info(
"uptime.detector.will_create_issue",
extra={
"project_id": self.detector.project_id,
"url": uptime_subscription.url,
**data_packet.packet.check_result,
},
)
if result_resolves_issue:
metrics.incr(
"uptime.detector.will_resolve_issue",
sample_rate=1.0,
tags=metric_tags,
)
logger.info(
"uptime.detector.will_resolve_issue",
extra={
"project_id": self.detector.project_id,
"url": uptime_subscription.url,
**data_packet.packet.check_result,
},
)

# Reutning an empty dict effectively causes the detector processor to
# bail and not produce an issue occurrence.
if result_creates_issue and not issue_creation_allowed:
return {}

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if y'all have any thoughts about this evaluation bit -- is there anything for you that would make this easier to implement? think it's okay to just invoke the super and handle it like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I think this is fine. Once we remove a lot of the legacy issue creation code for uptime some of this is going to go away and it will be more clear what can be factored and abstracted.

return int(data_packet.packet.check_result["scheduled_check_time_ms"])

@override
def evaluate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level, I feel we should aim for this evaluate function to be pure. All the side effects and extra conditionals in here are a design smell. I'd encourage us to generalize these:

  • metrics should be emitted by framework layer
  • feature flags to test things and turn on/off issue creation seem like they're something we'll want for most detectors and the framework should provide this too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. Parts of this have to do with the fact that we’re migrating away from our existing issue creation system for uptime. Once we’ve cleaned up more of that this should simplify and we can clean things up more.

jan-auer added a commit that referenced this pull request Jun 2, 2025
* master: (249 commits)
  feat(source-maps): Do not show pagination together with empty state (#92287)
  ref(project-creation): Introduce useCreateProjectRules hook (#92186)
  feat(agent-insights): Handle new keys (#92613)
  feat(source-maps): Introduce new empty state copies and react-native callout (#92286)
  ref(issues): Remove project from group activity type (#92600)
  feat(ourlogs): Use /trace-logs endpoint (#92577)
  feat(issues): Only update group hasSeen when user is member (#92597)
  fix(workflow_engine): Graceful Data Condition Eval Handling (#92591)
  feat(uptime): Implement detector handler (#91107)
  chore(autofix): Remove logs from response payload (#92589)
  fix(search): Fix issue with tags name 'constructor' (#92586)
  fix(autofix): Fix condition for onboarding check (#92584)
  fix(ourlogs): Return the same format as /events & limit 1000 for trace-logs (#92580)
  fix(autofix): Fix automation onboarding condition (#92579)
  feat(explore): Remove group by timestamp from explore (#92546)
  feat(trace-items): Autocomplete for semver attributes (#92515)
  feat(processing) Define EventProcessingStore.__all__ (#92555)
  feat(autofix): Better errored state (#92571)
  chore(autofix): Seer beta banner copy changes (#92576)
  feat(crons): Add endpoint to return counts by status (#92574)
  ...
andrewshie-sentry pushed a commit that referenced this pull request Jun 2, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components WIP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants