-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
fix(deletions) Remove events when groups/projects are deleted #41937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Eagerly purge events when projects are removed. I've also modified the Project delete task so that it uses the existing group deletion task instead of doing half the work as Project deletion task was missing *many* relations of Group.
| eventstream_state = eventstream.start_delete_groups(self.project_id, [self.group_id]) | ||
| eventstream.end_delete_groups(eventstream_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's a problem with issuing a lot of separate replacements like this, instead of batching them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about the volume of replacements this could generate as well. Projects could have 1000's of issues.
Only send a replacement to eventstream after all the node data has been removed. Removing the event data early results in reports and events being left behind
|
This has a big risk of plugging up the replacement queue when a project is deleted if we send them 1 by 1. How hard is it to batch these things? We should not merge this as is |
It isn't simple as We would have to replace the groupdeletion task with a more specialized implementation that quacked like a bulk deletion but used a mix of instance and bulk deletion for its relations. |
|
The top projects have 1mm+ issues. Is removing that many issues from snuba via replacements even feasible for projects that large? |
That could be a reason it has not historically been part of project deletion. |
Yeah... if we batched them in groups of 1k, then we'd still issue 1k replacements. @volokluev I don't really know the full impact of replacements. How expensive is a replacement with 1 group vs a replacement with say 1k-10k groups? |
|
Yes removing that many issues is feasible via replacements. In terms of the impact of batching/not batching replacements it's huge. In order to get the rows to delete there is a sql query that happens: https://github.com/getsentry/snuba/blob/master/snuba/datasets/errors_replacer.py#L810-L815 That query will be batched (thus reducing query load). Then in order to write the "tombstone rows" to mark them for deletion, an insert call will be made https://github.com/getsentry/snuba/blob/master/snuba/datasets/errors_replacer.py#L637-L642 That insert call will happen once per kafka message on the repacements topic. If you were to do these one by one, it would result in as many insert calls as there are groups. Every insert call is a call to zookeeper and a new created file. We don't want that |
Is there a recommended max size for a replacement? I assume we can't shove 100k group ids into a single one. |
|
I was thinking more about this last night and I have an idea on how to batch these deletions for more than a single group at a time. I'll make the batch size for groups and events something we can easily change so we can tune deletions if batches are too big/small. |
By bulk deleting relations with `in` operations we can delete from multiple groups in a blended batch.
|
@wedamija I've revised group deletion so that group deletion can operate in batches of 100 groups. Each batch of 100 groups will emit a single replacement message. |
|
/gcbrun |
I'm going to try to review this today/tomorrow, but just have a few things going at the moment. In the mean time, @volokluev what's a reasonable max number of replacements to generate here? https://redash.getsentry.net/queries/3373/source: Here's a breakdown of the top projects by active group. Worst case is 5mm, but there's a long tail of 100k+ orgs. So if we batch in groups of 100, we might issue 50k replacements in the worse case. For other large orgs, we might issue 1k. Wondering if we can bump this up to 1k/10k groups batched just to avoid these edge cases? |
wedamija
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but I would like to get clarification from SNS on what a reasonable chunk size is here, given the info in my previous comment. Just don't want to have an incident come up because some project with 1mm+ groups gets deleted.
| ] | ||
| ) | ||
|
|
||
| model_list = (models.GroupMeta, models.GroupResolution, models.GroupSnooze) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove these? Just already handled in the direct group deletion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Project now has Group as a child_relation and Group has these models as child_relations.
This was likely a workaround to be able to bulk delete across multiple groups. That isn't necessary anymore as I've changed bulk deletions to support __in lookups as well.
|
PR reverted: bff380d |
…#41937)" This reverts commit 9618872. Co-authored-by: asottile-sentry <[email protected]>
Mulligan on #41937 that doesn't use itertools_more as that is a dev only dependency.
Eagerly purge events when projects are removed. I've also modified the Project delete task so that it uses the existing group deletion task instead of doing half the work as Project deletion task was missing many relations of Group.