Skip to content

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Apr 3, 2020

What changes were proposed in this pull request?

This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify.

Why are the changes needed?

This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

We'll only be able to test it, I believe, after merging it in. Given that the Avro project is using this same bot already, I expect it will be straightforward to get this working.

@nchammas
Copy link
Contributor Author

nchammas commented Apr 3, 2020

I took a stab at the rules, but I'm sure y'all will want to make a bunch of changes.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

WDYT @dongjoon-hyun ?

@HyukjinKwon
Copy link
Member

Currently the labels are being mapped with the ones set in the JIRA. This PR seems the approach is based on the directory the changes include, which I think it's fine at this moment.

@SparkQA

This comment has been minimized.

@maropu
Copy link
Member

maropu commented Apr 6, 2020

retest this please

@maropu
Copy link
Member

maropu commented Apr 6, 2020

This PR seems the approach is based on the directory the changes include, which I think it's fine at this moment.

The automatic labelling based on PR changes looks nice to me. Btw, thanks for the great work, @nchammas !

@SparkQA

This comment has been minimized.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nchammas, I will skim the project and refine the list here, and directly add some commits in your PR by pushing them into your branch in your Spark fork if you don't mind.

I would be able to do this within this week, I believe.

@nchammas
Copy link
Contributor Author

nchammas commented Apr 7, 2020

I will skim the project and refine the list here, and directly add some commits in your PR by pushing them into your branch in your Spark fork if you don't mind.

No problem. Go ahead.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Apr 7, 2020

Thanks for the proposal and the work you've been done!

While I think this is huge one step forward, there's still some difference compared to the manual labelling, as manual labelling considers the module - like "Structured Streaming" which would be simply "SQL" for auto labelling and very hard to filter from SQL PRs.

IMHO, if it's not hard to have our own auto labeler, following the marker for module representation ([CORE] or [SQL] or so) in PR title would work simpler if the auto labeler can update the label when the PR title has been changed. Reviewers would correct the marker if it's incorrect, not only for labelling but also for better commit history.

@nchammas
Copy link
Contributor Author

nchammas commented Apr 7, 2020

Just to be clear, can you share more specific examples of where you think manual labeling would work better than automatic labeling? What would be the modified paths for a PR that you think automatic labeling wouldn't handle well?

Regarding a better commit history, if automatic labeling works well, we could update the merge script to include the labels in the commit message. Then it becomes one less thing that contributors and committers have to worry about.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Apr 7, 2020

Please find "STRUCTURED STREAMING" labels in Spark PR.

https://github.com/apache/spark/pulls?q=is%3Aopen+is%3Apr+label%3A%22STRUCTURED+STREAMING%22

If we simply label that /sql/ becomes SQL, PRs for structured streaming will just be labelled as "SQL", which is the one having the most of PRs, and being hidden by SQL PRs. If we try to label it via capturing the path streaming/ then we would have to differentiate with DStream.

There's some other case where path doesn't work: #24990 - while it proposes to "add" batch data source, it's to read the state in "structured streaming", which I think labelling to "structured streaming" is correct, whereas path based labelling would label to "SQL".

@nchammas
Copy link
Contributor Author

nchammas commented Apr 8, 2020

Labels are additive. That is, a single PR can be tagged with multiple labels. Here's an example from the Avro project, which is using the same bot we are proposing to add in this PR: apache/avro#847

And you can filter PRs by combinations of labels. For example:

  • is:open is:pr label:SQL label:streaming will search for PRs labeled both SQL and streaming.
  • is:open is:pr label:SQL -label:streaming will search for PRs labeled SQL but not streaming.

Does that address part of your concern?

With regards to #24990, is there a pattern we could potentially add to the rule for the streaming label so that that PR would be labeled correctly?

@HeartSaVioR
Copy link
Contributor

Yeah as I said it's still one huge step forward and it's still great to get this in as it is. I'm just saying there're some cases which need classification by human, and we always have been doing the classification per PR.

With regards to #24990, is there a pattern we could potentially add to the rule for the streaming label so that that PR would be labeled correctly?

I guess not, as the case is the counter case where path based auto labelling won't work. It still needs the "human being" to classify, and auto labeler just needs to reflect the classification.

@Ngone51
Copy link
Member

Ngone51 commented Apr 8, 2020

I'm kind of agree that we'd better distinguish sql and structured streaming if possible.

Is it possible to use a regex path, something like */streaming/*(https://git-scm.com/docs/gitignore#_pattern_format) to catch streaming related module?

And as for #24990, I have no idea why we using state instead of streaming. But since it should be a corner case, I think we could just update this yml here whenever needed.

@HeartSaVioR
Copy link
Contributor

And as for #24990, I have no idea why we using state instead of streaming.

I'm sorry but I don't get it. Could you please elaborate what you meant?

@nchammas
Copy link
Contributor Author

nchammas commented Apr 8, 2020

Is it possible to use a regex path, something like */streaming/* (https://git-scm.com/docs/gitignore#_pattern_format) to catch streaming related module?

streaming/ will pick up directories named streaming anywhere in the directory hierarchy. You don't need the asterisks.

The problem that @HeartSaVioR is pointing out is that some PRs we would consider Structured Streaming PRs don't touch paths that are clearly identifiable as Structured Streaming and not just plain SQL.

By the way @HeartSaVioR, some of the files in #24990, like StateSchemaExtractor.scala, look like they are just for Structured Streaming, so we could add those to the matching rules for the streaming label. It's not a great solution, but it should cover some additional cases.

@Ngone51
Copy link
Member

Ngone51 commented Apr 9, 2020

I'm sorry but I don't get it. Could you please elaborate what you meant?

I just think without any background there that if the datasource is for streaming, why we don't add streaming as part of package name?

But since it's a corner case, can't we just update yml file here to catch the future change? I guess yml file won't be frequently updated since most of cases have already been included by the current file.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Apr 9, 2020

I just think without any background there that if the datasource is for streaming, why we don't add streaming as part of package name?

The datasource will run in "batch query", though the input data is from "streaming query".

I might have to reiterate; please don't get me wrong. I don't object the feature, I said it's huge one step forward. I just wanted to point out that we require manual label for module in the PR title and it's kinda accurate (otherwise committer would fix it) so it seems redundant to do classification here unless we also do automate on PR title. (If we are confident about the classification then why not?) Yes that might require another implementation of bot hence I'm not strong about it.

@Ngone51
Copy link
Member

Ngone51 commented Apr 9, 2020

@HeartSaVioR I see. And yeah, this recall me my feeling when I first time saw the github label. It's really kind of duplicate with manual label. And I have been already used to manual label compares to github label. And it's fact that our github label is really not noticeable with low-key color.

I don't know why we introduce github label firstly..maybe for better history searching or something else? cc @dongjoon-hyun

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Apr 9, 2020

For me, github label is easier to use than searching with [...] and it's even more accurate (that means, github search is not perfect). If I remember correctly, maintainers can see how many PRs are labelled as label X, which could help to determine the overall status.

@HyukjinKwon
Copy link
Member

I will make some changes soon. We could discuss further after that change happens. Labeling both SQL and structure streaming or streaming should be fine. We can't 100% correctly label here.

Current labeling isn't correct already because it relies on the tags in JIRA which can have mistakes.

@HeartSaVioR
Copy link
Contributor

And I was wrong - anyone can see how many issues and PRs are open under such label, which is even better.
https://github.com/apache/spark/labels

@HyukjinKwon
Copy link
Member

@HeartSaVioR, @Ngone51 while you guys are here, can you double check the list too in particular where you guys are familiar with?

- "python/"
R:
- "r/"
- "R/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/kaelzhang/node-ignore#optionsignorecase-since-400 which the plugin uses, seems it's case-insensitive by default but I just kept it.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun do you have any concern or comments on this?

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121081 has finished for PR 28114 at commit aadea31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121080 has finished for PR 28114 at commit 84d0c32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 10, 2020

Test build #121082 has finished for PR 28114 at commit 8228cf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@nchammas nchammas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few components on Jira that are not currently reflected here. Do we want to add them in now, or just leave it to later?

  • Block Manager
  • DStreams
  • Optimizer
  • Scheduler
  • Shuffle
  • Security
  • Web UI
  • Windows

Comment on lines 68 to 70
- "!/python/pyspark/sql/avro"
- "!/python/pyspark/sql/streaming.py"
- "!/python/pyspark/sql/tests/test_streaming.py"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these ! lines will do what you want.

From https://git-scm.com/docs/gitignore#_pattern_format:

It is not possible to re-include a file if a parent directory of that file is excluded.

But we can try it out and see what happens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually this way is from node-ignore package which the plugin uses. Hopefully it works..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if it works or not.

MLLIB:
- "spark/mllib/"
- "/mllib-local"
- "/python/pyspark/mllib"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can collapse both this line and L91 into a single rule, mllib/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me add another commit. Feel free to edit @nchammas meanwhile :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we shouldn't collapse to mllib/ because /mllib/ contains both ML and MLlib.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 11, 2020

DStreams is DStream I believe. And I checked the commit history roughly and Optimizer tends to be just SQL, and Scheduler and Shuffle tend to be just Core. For Security, we can't tell which directory.

I added Windows and Web UI.

- "/bin/spark-sql*"
- "/bin/beeline*"
- "/sbin/*thriftserver*.sh"
- "*SQL*.R"
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant about adding R with SQL here because we don't usually label R with SQL in these files probably because RDD APIs are private in R. However, I concluded that there's no harm to label SQL at least for consistency.

@SparkQA
Copy link

SparkQA commented Apr 11, 2020

Test build #121113 has finished for PR 28114 at commit 7db3491.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2020

Test build #121115 has finished for PR 28114 at commit 2ddde69.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2020

Test build #121114 has finished for PR 28114 at commit 30a09c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

Looks good to me.

@HyukjinKwon
Copy link
Member

I'll merge this PR in few days to try out. Let me know if you guys have any concern.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 13, 2020

I am going to merge this now. This is morning in my time so will probably be able to take a following-up action quick.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Okay .. seems finally started to work. Seems so far as good.

@nchammas nchammas deleted the SPARK-31330-auto-label-prs branch April 13, 2020 13:55
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This PR adds some rules that will be used by Probot Auto Labeler to label PRs based on what paths they modify.

### Why are the changes needed?

This should make it easier for committers to organize PRs, and it could also help drive downstream tooling like the PR dashboard.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

We'll only be able to test it, I believe, after merging it in. Given that [the Avro project is using this same bot already](https://github.com/apache/avro/blob/master/.github/autolabeler.yml), I expect it will be straightforward to get this working.

Closes apache#28114 from nchammas/SPARK-31330-auto-label-prs.

Lead-authored-by: Nicholas Chammas <[email protected]>
Co-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Apr 16, 2020
…nd 'dev/.rat-excludes' in BUILD autolabeller

### What changes were proposed in this pull request?

This PR excludes `ui` directly and `UI.scala` configuration file in `CORE` label, and exclude `dev/.rat-excludes` in `BUILD` label  in autolabeller. See #28218, #28217, #28214 and #28213

There are some contexts about this #28114.

The syntax is from https://git-scm.com/docs/gitignore#_pattern_format (see also https://github.com/kaelzhang/node-ignore)

### Why are the changes needed?

To label UI component properly.

### Does this PR introduce any user-facing change?

No, dev-only.

### How was this patch tested?

It uses the same syntax used for other places. I expect to see the actual results after it gets merged as it's difficult to test it out.

Closes #28228 from HyukjinKwon/SPARK-31330-followup.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants