[SPARK-31330] Automatically label PRs based on the paths they touch #28114

nchammas · 2020-04-03T18:06:47Z

What changes were proposed in this pull request?

This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify.

Why are the changes needed?

This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

We'll only be able to test it, I believe, after merging it in. Given that the Avro project is using this same bot already, I expect it will be straightforward to get this working.

nchammas · 2020-04-03T18:07:51Z

I took a stab at the rules, but I'm sure y'all will want to make a bunch of changes.

HyukjinKwon · 2020-04-05T05:59:41Z

retest this please

HyukjinKwon · 2020-04-05T05:59:48Z

WDYT @dongjoon-hyun ?

HyukjinKwon · 2020-04-05T06:01:03Z

Currently the labels are being mapped with the ones set in the JIRA. This PR seems the approach is based on the directory the changes include, which I think it's fine at this moment.

maropu · 2020-04-06T01:34:40Z

retest this please

maropu · 2020-04-06T01:35:29Z

This PR seems the approach is based on the directory the changes include, which I think it's fine at this moment.

The automatic labelling based on PR changes looks nice to me. Btw, thanks for the great work, @nchammas !

.github/autolabeler.yml

HyukjinKwon

@nchammas, I will skim the project and refine the list here, and directly add some commits in your PR by pushing them into your branch in your Spark fork if you don't mind.

I would be able to do this within this week, I believe.

.github/autolabeler.yml

nchammas · 2020-04-07T14:05:18Z

I will skim the project and refine the list here, and directly add some commits in your PR by pushing them into your branch in your Spark fork if you don't mind.

No problem. Go ahead.

HeartSaVioR · 2020-04-07T23:20:34Z

Thanks for the proposal and the work you've been done!

While I think this is huge one step forward, there's still some difference compared to the manual labelling, as manual labelling considers the module - like "Structured Streaming" which would be simply "SQL" for auto labelling and very hard to filter from SQL PRs.

IMHO, if it's not hard to have our own auto labeler, following the marker for module representation ([CORE] or [SQL] or so) in PR title would work simpler if the auto labeler can update the label when the PR title has been changed. Reviewers would correct the marker if it's incorrect, not only for labelling but also for better commit history.

nchammas · 2020-04-07T23:36:09Z

Just to be clear, can you share more specific examples of where you think manual labeling would work better than automatic labeling? What would be the modified paths for a PR that you think automatic labeling wouldn't handle well?

Regarding a better commit history, if automatic labeling works well, we could update the merge script to include the labels in the commit message. Then it becomes one less thing that contributors and committers have to worry about.

HeartSaVioR · 2020-04-07T23:55:13Z

Please find "STRUCTURED STREAMING" labels in Spark PR.

https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+label%3A%22STRUCTURED+STREAMING%22

If we simply label that /sql/ becomes SQL, PRs for structured streaming will just be labelled as "SQL", which is the one having the most of PRs, and being hidden by SQL PRs. If we try to label it via capturing the path streaming/ then we would have to differentiate with DStream.

There's some other case where path doesn't work: #24990 - while it proposes to "add" batch data source, it's to read the state in "structured streaming", which I think labelling to "structured streaming" is correct, whereas path based labelling would label to "SQL".

nchammas · 2020-04-08T03:49:51Z

Labels are additive. That is, a single PR can be tagged with multiple labels. Here's an example from the Avro project, which is using the same bot we are proposing to add in this PR: apache/avro#847

And you can filter PRs by combinations of labels. For example:

is:open is:pr label:SQL label:streaming will search for PRs labeled both SQL and streaming.
is:open is:pr label:SQL -label:streaming will search for PRs labeled SQL but not streaming.

Does that address part of your concern?

With regards to #24990, is there a pattern we could potentially add to the rule for the streaming label so that that PR would be labeled correctly?

HeartSaVioR · 2020-04-08T05:10:58Z

Yeah as I said it's still one huge step forward and it's still great to get this in as it is. I'm just saying there're some cases which need classification by human, and we always have been doing the classification per PR.

With regards to #24990, is there a pattern we could potentially add to the rule for the streaming label so that that PR would be labeled correctly?

I guess not, as the case is the counter case where path based auto labelling won't work. It still needs the "human being" to classify, and auto labeler just needs to reflect the classification.

Ngone51 · 2020-04-08T07:18:37Z

I'm kind of agree that we'd better distinguish sql and structured streaming if possible.

Is it possible to use a regex path, something like */streaming/*(https://git-scm.com/docs/gitignore#_pattern_format) to catch streaming related module?

And as for #24990, I have no idea why we using state instead of streaming. But since it should be a corner case, I think we could just update this yml here whenever needed.

HeartSaVioR · 2020-04-08T09:02:02Z

And as for #24990, I have no idea why we using state instead of streaming.

I'm sorry but I don't get it. Could you please elaborate what you meant?

nchammas · 2020-04-08T14:15:46Z

Is it possible to use a regex path, something like */streaming/* (https://git-scm.com/docs/gitignore#_pattern_format) to catch streaming related module?

streaming/ will pick up directories named streaming anywhere in the directory hierarchy. You don't need the asterisks.

The problem that @HeartSaVioR is pointing out is that some PRs we would consider Structured Streaming PRs don't touch paths that are clearly identifiable as Structured Streaming and not just plain SQL.

By the way @HeartSaVioR, some of the files in #24990, like StateSchemaExtractor.scala, look like they are just for Structured Streaming, so we could add those to the matching rules for the streaming label. It's not a great solution, but it should cover some additional cases.

Ngone51 · 2020-04-09T01:27:48Z

I'm sorry but I don't get it. Could you please elaborate what you meant?

I just think without any background there that if the datasource is for streaming, why we don't add streaming as part of package name?

But since it's a corner case, can't we just update yml file here to catch the future change? I guess yml file won't be frequently updated since most of cases have already been included by the current file.

HeartSaVioR · 2020-04-09T05:51:23Z

I just think without any background there that if the datasource is for streaming, why we don't add streaming as part of package name?

The datasource will run in "batch query", though the input data is from "streaming query".

I might have to reiterate; please don't get me wrong. I don't object the feature, I said it's huge one step forward. I just wanted to point out that we require manual label for module in the PR title and it's kinda accurate (otherwise committer would fix it) so it seems redundant to do classification here unless we also do automate on PR title. (If we are confident about the classification then why not?) Yes that might require another implementation of bot hence I'm not strong about it.

Ngone51 · 2020-04-09T08:37:46Z

@HeartSaVioR I see. And yeah, this recall me my feeling when I first time saw the github label. It's really kind of duplicate with manual label. And I have been already used to manual label compares to github label. And it's fact that our github label is really not noticeable with low-key color.

I don't know why we introduce github label firstly..maybe for better history searching or something else? cc @dongjoon-hyun

HeartSaVioR · 2020-04-09T09:16:35Z

For me, github label is easier to use than searching with [...] and it's even more accurate (that means, github search is not perfect). If I remember correctly, maintainers can see how many PRs are labelled as label X, which could help to determine the overall status.

HyukjinKwon · 2020-04-09T12:18:38Z

I will make some changes soon. We could discuss further after that change happens. Labeling both SQL and structure streaming or streaming should be fine. We can't 100% correctly label here.

Current labeling isn't correct already because it relies on the tags in JIRA which can have mistakes.

HeartSaVioR · 2020-04-09T22:15:19Z

And I was wrong - anyone can see how many issues and PRs are open under such label, which is even better.
https://github.com/apache/spark/labels

HyukjinKwon · 2020-04-10T10:47:42Z

@HeartSaVioR, @Ngone51 while you guys are here, can you double check the list too in particular where you guys are familiar with?

HyukjinKwon · 2020-04-10T10:55:19Z

.github/autolabeler.yml

+  - "python/"
+R:
+  - "r/"
+  - "R/"


According to https://github.com/kaelzhang/node-ignore#optionsignorecase-since-400 which the plugin uses, seems it's case-insensitive by default but I just kept it.

HyukjinKwon · 2020-04-10T10:55:52Z

@dongjoon-hyun do you have any concern or comments on this?

.github/autolabeler.yml

SparkQA · 2020-04-10T13:45:01Z

Test build #121081 has finished for PR 28114 at commit aadea31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-10T13:59:57Z

Test build #121080 has finished for PR 28114 at commit 84d0c32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-10T14:43:12Z

Test build #121082 has finished for PR 28114 at commit 8228cf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas

There are a few components on Jira that are not currently reflected here. Do we want to add them in now, or just leave it to later?

Block Manager
DStreams
Optimizer
Scheduler
Shuffle
Security
Web UI
Windows

.github/autolabeler.yml

nchammas · 2020-04-10T15:16:00Z

.github/autolabeler.yml

+  - "!/python/pyspark/sql/avro"
+  - "!/python/pyspark/sql/streaming.py"
+  - "!/python/pyspark/sql/tests/test_streaming.py"


I don't think these ! lines will do what you want.

From https://git-scm.com/docs/gitignore#_pattern_format:

It is not possible to re-include a file if a parent directory of that file is excluded.

But we can try it out and see what happens.

Oh, actually this way is from node-ignore package which the plugin uses. Hopefully it works..

Let's see if it works or not.

.github/autolabeler.yml

nchammas · 2020-04-10T15:22:43Z

.github/autolabeler.yml

+MLLIB:
+  - "spark/mllib/"
+  - "/mllib-local"
+  - "/python/pyspark/mllib"


We can collapse both this line and L91 into a single rule, mllib/.

Yeah, let me add another commit. Feel free to edit @nchammas meanwhile :).

Actually, I think we shouldn't collapse to mllib/ because /mllib/ contains both ML and MLlib.

HyukjinKwon · 2020-04-11T03:22:14Z

DStreams is DStream I believe. And I checked the commit history roughly and Optimizer tends to be just SQL, and Scheduler and Shuffle tend to be just Core. For Security, we can't tell which directory.

I added Windows and Web UI.

HyukjinKwon · 2020-04-11T03:50:23Z

.github/autolabeler.yml

+  - "/bin/spark-sql*"
+  - "/bin/beeline*"
+  - "/sbin/*thriftserver*.sh"
+  - "*SQL*.R"


I was hesitant about adding R with SQL here because we don't usually label R with SQL in these files probably because RDD APIs are private in R. However, I concluded that there's no harm to label SQL at least for consistency.

SparkQA · 2020-04-11T05:25:41Z

Test build #121113 has finished for PR 28114 at commit 7db3491.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-11T05:49:36Z

Test build #121115 has finished for PR 28114 at commit 2ddde69.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-11T06:11:43Z

Test build #121114 has finished for PR 28114 at commit 30a09c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2020-04-11T23:40:29Z

Looks good to me.

HyukjinKwon · 2020-04-12T02:51:19Z

I'll merge this PR in few days to try out. Let me know if you guys have any concern.

HyukjinKwon · 2020-04-13T00:59:46Z

I am going to merge this now. This is morning in my time so will probably be able to take a following-up action quick.

HyukjinKwon · 2020-04-13T01:00:17Z

Merged to master.

HyukjinKwon · 2020-04-13T06:49:45Z

Okay .. seems finally started to work. Seems so far as good.

### What changes were proposed in this pull request? This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify. ### Why are the changes needed? This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? We'll only be able to test it, I believe, after merging it in. Given that [the Avro project is using this same bot already](https://github.com/apache/avro/blob/master/.github/autolabeler.yml), I expect it will be straightforward to get this working. Closes apache#28114 from nchammas/SPARK-31330-auto-label-prs. Lead-authored-by: Nicholas Chammas <[email protected]> Co-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…nd 'dev/.rat-excludes' in BUILD autolabeller ### What changes were proposed in this pull request? This PR excludes `ui` directly and `UI.scala` configuration file in `CORE` label, and exclude `dev/.rat-excludes` in `BUILD` label in autolabeller. See #28218, #28217, #28214 and #28213 There are some contexts about this #28114. The syntax is from https://git-scm.com/docs/gitignore#_pattern_format (see also https://github.com/kaelzhang/node-ignore) ### Why are the changes needed? To label UI component properly. ### Does this PR introduce any user-facing change? No, dev-only. ### How was this patch tested? It uses the same syntax used for other places. I expect to see the actual results after it gets merged as it's difficult to test it out. Closes #28228 from HyukjinKwon/SPARK-31330-followup. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

nchammas added 3 commits April 3, 2020 13:32

add autolabeler rules

bf064f1

tweak rules

0d19a74

tweak r

d6ecb1a

This comment has been minimized.

Sign in to view

HyukjinKwon reviewed Apr 7, 2020

View reviewed changes

.github/autolabeler.yml Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 7, 2020

View reviewed changes

.github/autolabeler.yml Show resolved Hide resolved

Refine the list and add some more fine-grained items

20b220c

Move Scala style configuration to BUILD tag

84d0c32

Fix another mistake

aadea31

HyukjinKwon reviewed Apr 10, 2020

View reviewed changes

.github/autolabeler.yml Outdated Show resolved Hide resolved

Address my own comments

8228cf4

HyukjinKwon reviewed Apr 10, 2020

View reviewed changes

.github/autolabeler.yml Show resolved Hide resolved

nchammas commented Apr 10, 2020

View reviewed changes

Add some more compoenents, address comments

7db3491

HyukjinKwon added 2 commits April 11, 2020 12:42

Add more fine grained R files

30a09c5

Just add SQL for R components too

2ddde69

HyukjinKwon reviewed Apr 11, 2020

View reviewed changes

HyukjinKwon closed this in 1b87015 Apr 13, 2020

nchammas deleted the SPARK-31330-auto-label-prs branch April 13, 2020 13:55

HyukjinKwon mentioned this pull request Apr 16, 2020

[SPARK-31330][INFRA][FOLLOW-UP] Exclude 'ui' and 'UI.scala' in CORE and 'dev/.rat-excludes' in BUILD autolabeller #28228

Closed

[SPARK-31330] Automatically label PRs based on the paths they touch #28114

[SPARK-31330] Automatically label PRs based on the paths they touch #28114

Uh oh!

Conversation

nchammas commented Apr 3, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

nchammas commented Apr 3, 2020

Uh oh!

This comment has been minimized.

HyukjinKwon commented Apr 5, 2020

Uh oh!

HyukjinKwon commented Apr 5, 2020

Uh oh!

HyukjinKwon commented Apr 5, 2020

Uh oh!

This comment has been minimized.

maropu commented Apr 6, 2020

Uh oh!

maropu commented Apr 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nchammas commented Apr 7, 2020

Uh oh!

HeartSaVioR commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nchammas commented Apr 7, 2020

Uh oh!

HeartSaVioR commented Apr 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nchammas commented Apr 8, 2020

Uh oh!

HeartSaVioR commented Apr 8, 2020

Uh oh!

Ngone51 commented Apr 8, 2020

Uh oh!

HeartSaVioR commented Apr 8, 2020

Uh oh!

nchammas commented Apr 8, 2020

Uh oh!

Ngone51 commented Apr 9, 2020

Uh oh!

HeartSaVioR commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ngone51 commented Apr 9, 2020

Uh oh!

HeartSaVioR commented Apr 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 9, 2020

Uh oh!

HeartSaVioR commented Apr 9, 2020

Uh oh!

HyukjinKwon commented Apr 10, 2020

Uh oh!

HyukjinKwon Apr 10, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 10, 2020

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Apr 10, 2020

Uh oh!

SparkQA commented Apr 10, 2020

Uh oh!

SparkQA commented Apr 10, 2020

maropu commented Apr 6, 2020 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

HeartSaVioR commented Apr 7, 2020 •

edited

Loading

HeartSaVioR commented Apr 7, 2020 •

edited

Loading

HeartSaVioR commented Apr 9, 2020 •

edited

Loading

HeartSaVioR commented Apr 9, 2020 •

edited

Loading

HyukjinKwon commented Apr 11, 2020 •

edited

Loading

HyukjinKwon Apr 11, 2020 •

edited

Loading

HyukjinKwon commented Apr 13, 2020 •

edited

Loading