-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31330] Automatically label PRs based on the paths they touch #28114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I took a stab at the rules, but I'm sure y'all will want to make a bunch of changes. |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
|
WDYT @dongjoon-hyun ? |
|
Currently the labels are being mapped with the ones set in the JIRA. This PR seems the approach is based on the directory the changes include, which I think it's fine at this moment. |
This comment has been minimized.
This comment has been minimized.
|
retest this please |
The automatic labelling based on PR changes looks nice to me. Btw, thanks for the great work, @nchammas ! |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nchammas, I will skim the project and refine the list here, and directly add some commits in your PR by pushing them into your branch in your Spark fork if you don't mind.
I would be able to do this within this week, I believe.
No problem. Go ahead. |
|
Thanks for the proposal and the work you've been done! While I think this is huge one step forward, there's still some difference compared to the manual labelling, as manual labelling considers the module - like "Structured Streaming" which would be simply "SQL" for auto labelling and very hard to filter from SQL PRs. IMHO, if it's not hard to have our own auto labeler, following the marker for module representation ( |
|
Just to be clear, can you share more specific examples of where you think manual labeling would work better than automatic labeling? What would be the modified paths for a PR that you think automatic labeling wouldn't handle well? Regarding a better commit history, if automatic labeling works well, we could update the merge script to include the labels in the commit message. Then it becomes one less thing that contributors and committers have to worry about. |
|
Please find "STRUCTURED STREAMING" labels in Spark PR. https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+label%3A%22STRUCTURED+STREAMING%22 If we simply label that There's some other case where path doesn't work: #24990 - while it proposes to "add" batch data source, it's to read the state in "structured streaming", which I think labelling to "structured streaming" is correct, whereas path based labelling would label to "SQL". |
|
Labels are additive. That is, a single PR can be tagged with multiple labels. Here's an example from the Avro project, which is using the same bot we are proposing to add in this PR: apache/avro#847 And you can filter PRs by combinations of labels. For example:
Does that address part of your concern? With regards to #24990, is there a pattern we could potentially add to the rule for the |
|
Yeah as I said it's still one huge step forward and it's still great to get this in as it is. I'm just saying there're some cases which need classification by human, and we always have been doing the classification per PR.
I guess not, as the case is the counter case where path based auto labelling won't work. It still needs the "human being" to classify, and auto labeler just needs to reflect the classification. |
|
I'm kind of agree that we'd better distinguish Is it possible to use a regex path, something like And as for #24990, I have no idea why we using |
I'm sorry but I don't get it. Could you please elaborate what you meant? |
The problem that @HeartSaVioR is pointing out is that some PRs we would consider Structured Streaming PRs don't touch paths that are clearly identifiable as Structured Streaming and not just plain SQL. By the way @HeartSaVioR, some of the files in #24990, like |
I just think without any background there that if the datasource is for streaming, why we don't add But since it's a corner case, can't we just update |
The datasource will run in "batch query", though the input data is from "streaming query". I might have to reiterate; please don't get me wrong. I don't object the feature, I said it's huge one step forward. I just wanted to point out that we require manual label for module in the PR title and it's kinda accurate (otherwise committer would fix it) so it seems redundant to do classification here unless we also do automate on PR title. (If we are confident about the classification then why not?) Yes that might require another implementation of bot hence I'm not strong about it. |
|
@HeartSaVioR I see. And yeah, this recall me my feeling when I first time saw the github label. It's really kind of duplicate with manual label. And I have been already used to manual label compares to github label. And it's fact that our github label is really not noticeable with low-key color. I don't know why we introduce github label firstly..maybe for better history searching or something else? cc @dongjoon-hyun |
|
For me, github label is easier to use than searching with |
|
I will make some changes soon. We could discuss further after that change happens. Labeling both SQL and structure streaming or streaming should be fine. We can't 100% correctly label here. Current labeling isn't correct already because it relies on the tags in JIRA which can have mistakes. |
|
And I was wrong - anyone can see how many issues and PRs are open under such label, which is even better. |
|
@HeartSaVioR, @Ngone51 while you guys are here, can you double check the list too in particular where you guys are familiar with? |
| - "python/" | ||
| R: | ||
| - "r/" | ||
| - "R/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://github.com/kaelzhang/node-ignore#optionsignorecase-since-400 which the plugin uses, seems it's case-insensitive by default but I just kept it.
|
@dongjoon-hyun do you have any concern or comments on this? |
|
Test build #121081 has finished for PR 28114 at commit
|
|
Test build #121080 has finished for PR 28114 at commit
|
|
Test build #121082 has finished for PR 28114 at commit
|
nchammas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few components on Jira that are not currently reflected here. Do we want to add them in now, or just leave it to later?
- Block Manager
- DStreams
- Optimizer
- Scheduler
- Shuffle
- Security
- Web UI
- Windows
.github/autolabeler.yml
Outdated
| - "!/python/pyspark/sql/avro" | ||
| - "!/python/pyspark/sql/streaming.py" | ||
| - "!/python/pyspark/sql/tests/test_streaming.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these ! lines will do what you want.
From https://git-scm.com/docs/gitignore#_pattern_format:
It is not possible to re-include a file if a parent directory of that file is excluded.
But we can try it out and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, actually this way is from node-ignore package which the plugin uses. Hopefully it works..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if it works or not.
.github/autolabeler.yml
Outdated
| MLLIB: | ||
| - "spark/mllib/" | ||
| - "/mllib-local" | ||
| - "/python/pyspark/mllib" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can collapse both this line and L91 into a single rule, mllib/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me add another commit. Feel free to edit @nchammas meanwhile :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we shouldn't collapse to mllib/ because /mllib/ contains both ML and MLlib.
|
I added |
| - "/bin/spark-sql*" | ||
| - "/bin/beeline*" | ||
| - "/sbin/*thriftserver*.sh" | ||
| - "*SQL*.R" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hesitant about adding R with SQL here because we don't usually label R with SQL in these files probably because RDD APIs are private in R. However, I concluded that there's no harm to label SQL at least for consistency.
|
Test build #121113 has finished for PR 28114 at commit
|
|
Test build #121115 has finished for PR 28114 at commit
|
|
Test build #121114 has finished for PR 28114 at commit
|
|
Looks good to me. |
|
I'll merge this PR in few days to try out. Let me know if you guys have any concern. |
|
I am going to merge this now. This is morning in my time so will probably be able to take a following-up action quick. |
|
Merged to master. |
|
Okay .. seems finally started to work. Seems so far as good. |
### What changes were proposed in this pull request? This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify. ### Why are the changes needed? This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? We'll only be able to test it, I believe, after merging it in. Given that [the Avro project is using this same bot already](https://github.com/apache/avro/blob/master/.github/autolabeler.yml), I expect it will be straightforward to get this working. Closes apache#28114 from nchammas/SPARK-31330-auto-label-prs. Lead-authored-by: Nicholas Chammas <[email protected]> Co-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…nd 'dev/.rat-excludes' in BUILD autolabeller ### What changes were proposed in this pull request? This PR excludes `ui` directly and `UI.scala` configuration file in `CORE` label, and exclude `dev/.rat-excludes` in `BUILD` label in autolabeller. See #28218, #28217, #28214 and #28213 There are some contexts about this #28114. The syntax is from https://git-scm.com/docs/gitignore#_pattern_format (see also https://github.com/kaelzhang/node-ignore) ### Why are the changes needed? To label UI component properly. ### Does this PR introduce any user-facing change? No, dev-only. ### How was this patch tested? It uses the same syntax used for other places. I expect to see the actual results after it gets merged as it's difficult to test it out. Closes #28228 from HyukjinKwon/SPARK-31330-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify.
Why are the changes needed?
This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
We'll only be able to test it, I believe, after merging it in. Given that the Avro project is using this same bot already, I expect it will be straightforward to get this working.